rocksdb

Author	SHA1	Message	Date
Yanqin Jin	a78503bd6c	Temporarily disable snapshot list refresh for atomic flush stress test (#5581 ) Summary: Atomic flush test started to fail after https://github.com/facebook/rocksdb/issues/5099. Then https://github.com/facebook/rocksdb/issues/5278 provided a fix after which the same error occurred much less frequently. However it still occur occasionally. Not sure what the root cause is. This PR disables the feature of snapshot list refresh, and we should keep an eye on the failure in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5581 Differential Revision: D16295985 Pulled By: riversand963 fbshipit-source-id: c9e62e65133c52c21b07097de359632ca62571e4	2019-07-22 14:38:16 -07:00
sdong	6bb3b4b567	ldb idump to support non-default column families. (#5594 ) Summary: ldb idump now only works for default column family. Extend it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5594 Test Plan: Compile and run the tool against a multiple CF DB. Differential Revision: D16380684 fbshipit-source-id: bfb8af36fdad1806837c90aaaab492d71528aceb	2019-07-19 11:36:59 -07:00
haoyuhuang	8a008d4170	Block access tracing: Trace referenced key for Get on non-data blocks. (#5548 ) Summary: This PR traces the referenced key for Get for all types of blocks. This is useful when evaluating hybrid row-block caches. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5548 Test Plan: make clean && USE_CLANG=1 make check -j32 Differential Revision: D16157979 Pulled By: HaoyuHuang fbshipit-source-id: f6327411c9deb74e35e22a35f66cdbae09ab9d87	2019-07-17 13:05:58 -07:00
Levi Tamasi	3bde41b5a3	Move the filter readers out of the block cache (#5504 ) Summary: Currently, when the block cache is used for the filter block, it is not really the block itself that is stored in the cache but a FilterBlockReader object. Since this object is not pure data (it has, for instance, pointers that might dangle, including in one case a back pointer to the TableReader), it's not really sharable. To avoid the issues around this, the current code erases the cache entries when the TableReader is closed (which, BTW, is not sufficient since a concurrent TableReader might have picked up the object in the meantime). Instead of doing this, the patch moves the FilterBlockReader out of the cache altogether, and decouples the filter reader object from the filter block. In particular, instead of the TableReader owning, or caching/pinning the FilterBlockReader (based on the customer's settings), with the change the TableReader unconditionally owns the FilterBlockReader, which in turn owns/caches/pins the filter block. This change also enables us to reuse the code paths historically used for data blocks for filters as well. Note: Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504 Test Plan: make asan_check Differential Revision: D16036974 Pulled By: ltamasi fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091	2019-07-16 13:14:58 -07:00
haoyuhuang	68d43b4d30	A python script to plot graphs for cvs files generated by block_cache_trace_analyzer Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5563 Test Plan: Manually run the script on files generated by block_cache_trace_analyzer. Differential Revision: D16214400 Pulled By: HaoyuHuang fbshipit-source-id: 94485eed995e9b2b63e197c5dfeb80129fa7897f	2019-07-12 18:56:20 -07:00
haoyuhuang	3e9c5a3523	Block cache analyzer: Add more stats (#5516 ) Summary: This PR provides more command line options for block cache analyzer to better understand block cache access pattern. -analyze_bottom_k_access_count_blocks -analyze_top_k_access_count_blocks -reuse_lifetime_labels -reuse_lifetime_buckets -analyze_callers -access_count_buckets -analyze_blocks_reuse_k_reuse_window Pull Request resolved: https://github.com/facebook/rocksdb/pull/5516 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32 Differential Revision: D16037440 Pulled By: HaoyuHuang fbshipit-source-id: b9a4ac0d4712053fab910732077a4d4b91400bc8	2019-07-12 16:55:34 -07:00
haoyuhuang	1a59b6e2a9	Cache simulator: Add a ghost cache for admission control and a hybrid row-block cache. (#5534 ) Summary: This PR adds a ghost cache for admission control. Specifically, it admits an entry on its second access. It also adds a hybrid row-block cache that caches the referenced key-value pairs of a Get/MultiGet request instead of its blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5534 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32 Differential Revision: D16101124 Pulled By: HaoyuHuang fbshipit-source-id: b99edda6418a888e94eb40f71ece45d375e234b1	2019-07-11 12:43:29 -07:00
Yanqin Jin	f786b4a5b4	Improve result print on atomic flush stress test failure (#5549 ) Summary: When atomic flush stress test fails, we print internal keys within the range with mismatched key/values for all column families. Test plan (on devserver) Manually hack the code to randomly insert wrong data. Run the test. ``` $make clean && COMPILE_WITH_TSAN=1 make -j32 db_stress $./db_stress -test_atomic_flush=true -ops_per_thread=10000 ``` Check that proper error messages are printed, as follows: ``` 2019/07/08-17:40:14 Starting verification Verification failed Latest Sequence Number: 190903 [default] 000000000000050B => 56290000525350515E5F5C5D5A5B5859 [3] 0000000000000533 => EE100000EAEBE8E9E6E7E4E5E2E3E0E1FEFFFCFDFAFBF8F9 Internal keys in CF 'default', [000000000000050B, 0000000000000533] (max 8) key 000000000000050B seq 139920 type 1 key 0000000000000533 seq 0 type 1 Internal keys in CF '3', [000000000000050B, 0000000000000533] (max 8) key 0000000000000533 seq 0 type 1 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5549 Differential Revision: D16158709 Pulled By: riversand963 fbshipit-source-id: f07fa87763f87b3bd908da03c956709c6456bcab	2019-07-09 16:27:22 -07:00
sdong	aa0367aabb	Allow ldb to open DB as secondary (#5537 ) Summary: Right now ldb can open running DB through read-only DB. However, it might leave info logs files to the read-only DB directory. Add an option to open the DB as secondary to avoid it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5537 Test Plan: Run ./ldb scan --max_keys=10 --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp --no_value --hex and ./ldb get 0x00000000000000103030303030303030 --hex --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp against a normal db_bench run and observe the output changes. Also observe that no new info logs files are created under /tmp/rocksdbtest-2491/dbbench. Run without --secondary_path and observe that new info logs created under /tmp/rocksdbtest-2491/dbbench. Differential Revision: D16113886 fbshipit-source-id: 4e09dec47c2528f6ca08a9e7a7894ba2d9daebbb	2019-07-09 12:51:28 -07:00
Tim Hatch	a6a9213a36	Fix interpreter lines for files with python2-only syntax. Reviewed By: lisroach Differential Revision: D15362271 fbshipit-source-id: 48fab12ab6e55a8537b19b4623d2545ca9950ec5	2019-07-09 10:51:37 -07:00
sdong	872a261ffc	db_stress to print some internal keys after verification failure (#5543 ) Summary: Print out some more information when db_tress fails with verification failures to help debugging problems. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5543 Test Plan: Manually ingest some failures and observe the outputs are like this: Verification failed [default] 0000000000199A5A => 7C3D000078797A7B74757677707172736C6D6E6F68696A6B [6] 000000000019C8BD => 65380000616063626D6C6F6E69686B6A internal keys in default CF [0000000000199A5A, 000000000019C8BD] (max 8) key 0000000000199A5A seq 179246 type 1 key 000000000019C8BD seq 163970 type 1 Lastest Sequence Number: 292234 Differential Revision: D16153717 fbshipit-source-id: b33fa50a828c190cbf8249a37955432044f92daf	2019-07-08 13:36:37 -07:00
sdong	e4dcf5fd22	db_bench to add a new "benchmark" to print out all stats history (#5532 ) Summary: Sometimes it is helpful to fetch the whole history of stats after benchmark runs. Add such an option Pull Request resolved: https://github.com/facebook/rocksdb/pull/5532 Test Plan: Run the benchmark manually and observe the output is as expected. Differential Revision: D16097764 fbshipit-source-id: 10b5b735a22a18be198b8f348be11f11f8806904	2019-07-03 20:03:28 -07:00
haoyuhuang	66464d1fde	Remove multiple declarations o kMicrosInSecond. Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5526 Test Plan: OPT=-g V=1 make J=1 unity_test -j32 make clean && make -j32 Differential Revision: D16079315 Pulled By: HaoyuHuang fbshipit-source-id: 294ab439cf0db8dd5da44e30eabf0cbb2bb8c4f6	2019-07-01 15:15:12 -07:00
Eli Pozniansky	3e6c185381	Formatting fixes in db_bench_tool (#5525 ) Summary: Formatting fixes in db_bench_tool that were accidentally omitted Pull Request resolved: https://github.com/facebook/rocksdb/pull/5525 Test Plan: Unit tests Differential Revision: D16078516 Pulled By: elipoz fbshipit-source-id: bf8df0e3f08092a91794ebf285396d9b8a335bb9	2019-07-01 14:57:28 -07:00
Eli Pozniansky	f872009237	Fix from some C-style casting (#5524 ) Summary: Fix from some C-style casting in bloom.cc and ./tools/db_bench_tool.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/5524 Differential Revision: D16075626 Pulled By: elipoz fbshipit-source-id: 352948885efb64a7ef865942c75c3c727a914207	2019-07-01 13:05:34 -07:00
haoyuhuang	9f0bd56889	Cache simulator: Refactor the cache simulator so that we can add alternative policies easily (#5517 ) Summary: This PR creates cache_simulator.h file. It contains a CacheSimulator that runs against a block cache trace record. We can add alternative cache simulators derived from CacheSimulator later. For example, this PR adds a PrioritizedCacheSimulator that inserts filter/index/uncompressed dictionary blocks with high priority. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5517 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32 Differential Revision: D16043689 Pulled By: HaoyuHuang fbshipit-source-id: 65f28ed52b866ffb0e6eceffd7f9ca7c45bb680d	2019-07-01 12:46:32 -07:00
Yanqin Jin	c360675750	Add secondary instance to stress test (#5479 ) Summary: This PR allows users to run stress tests on secondary instance. Test plan (on devserver) ``` ./db_stress -ops_per_thread=100000 -enable_secondary=true -threads=32 -secondary_catch_up_one_in=10000 -clear_column_family_one_in=1000 -reopen=100 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5479 Differential Revision: D16074325 Pulled By: riversand963 fbshipit-source-id: c0ed959e7b6c7cda3efd0b3070ab379de3b29f1c	2019-07-01 11:49:50 -07:00
sdong	10bae8ceb3	Add more release versions to tools/check_format_compatible.sh (#5518 ) Summary: tools/check_format_compatible.sh is lagged behind. Catch up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5518 Test Plan: Run the command Differential Revision: D16063180 fbshipit-source-id: d063eb42df9653dec06a2cf0fb982b8a60ca3d2f	2019-06-28 17:41:58 -07:00
Aaron Gao	5c2f13fb14	add create_column_family and drop_column_family cmd to ldb tool (#5503 ) Summary: `create_column_family` cmd already exists but was somehow missed in the help message. also add `drop_column_family` cmd which can drop a cf without opening db. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5503 Test Plan: Updated existing ldb_test.py to test deleting a column family. Differential Revision: D16018414 Pulled By: lightmark fbshipit-source-id: 1fc33680b742104fea86b10efc8499f79e722301	2019-06-27 11:11:48 -07:00
haoyuhuang	554a6456aa	Block cache trace analysis: Write time series graphs in csv files (#5490 ) Summary: This PR adds a feature in block cache trace analysis tool to write statistics into csv files. 1. The analysis tool supports grouping the number of accesses per second by various labels, e.g., block, column family, block type, or a combination of them. 2. It also computes reuse distance and reuse interval. Reuse distance: The cumulated size of unique blocks read between two consecutive accesses on the same block. Reuse interval: The time between two consecutive accesses on the same block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5490 Differential Revision: D15901322 Pulled By: HaoyuHuang fbshipit-source-id: b5454fea408a32757a80be63de6fe1c8149ca70e	2019-06-24 20:42:12 -07:00
Yanqin Jin	1bfeffab2d	Stop printing after verification fails (#5493 ) Summary: Stop verification and printing once verification fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5493 Differential Revision: D15928992 Pulled By: riversand963 fbshipit-source-id: 699feac034a217d57280aa3fb50f5aba06adf317	2019-06-20 22:16:58 -07:00
haoyuhuang	705b8eecb4	Add more callers for table reader. (#5454 ) Summary: This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block. 1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name. 2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable. This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32. Differential Revision: D15819451 Pulled By: HaoyuHuang fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d	2019-06-20 14:31:48 -07:00
haoyuhuang	2e8ad03ab3	Add more stats in the block cache trace analyzer (#5482 ) Summary: This PR adds more stats in the block cache trace analyzer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5482 Differential Revision: D15883553 Pulled By: HaoyuHuang fbshipit-source-id: 6d440e4f657af75690420102d532d0ee1ed4e9cf	2019-06-18 18:38:42 -07:00
Huisheng Liu	92f631da33	replace sprintf with its safe version snprintf (#5475 ) Summary: sprintf is unsafe and has buffer overrun risk. Replace it with the safer version snprintf where buffer size is supplied to avoid overrun. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5475 Differential Revision: D15879481 Pulled By: sagar0 fbshipit-source-id: 7ae1958ffc9727fa50261dfbb98ddd74e70a72d8	2019-06-18 16:42:26 -07:00
haoyuhuang	bcfc53b436	Block cache tracing: Fix minor bugs with downsampling and some benchmark results. (#5473 ) Summary: As the code changes for block cache tracing are almost complete, I did a benchmark to compare the performance when block cache tracing is enabled/disabled. With 1% downsampling ratio, the performance overhead of block cache tracing is negligible. When we trace all block accesses, the throughput drops by 6 folds with 16 threads issuing random reads and all reads are served in block cache. Setup: RocksDB: version 6.2 Date: Mon Jun 17 17:11:13 2019 CPU: 24 * Intel Core Processor (Skylake) CPUCache: 16384 KB Keys: 20 bytes each Values: 100 bytes each (100 bytes after compression) Entries: 10000000 Prefix: 20 bytes Keys per prefix: 0 RawSize: 1144.4 MB (estimated) FileSize: 1144.4 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: NoCompression Compression sampling rate: 0 Memtablerep: skip_list Perf Level: 1 I ran the readrandom workload for 1 minute. Detailed throughput results: (ops/second) Sample rate 0: no block cache tracing. Sample rate 1: trace all block accesses. Sample rate 100: trace accesses 1% blocks. 1 thread \| \| \| -- \| -- \| -- \| -- Sample rate \| 0 \| 1 \| 100 1 MB block cache size \| 13,094 \| 13,166 \| 13,341 10 GB block cache size \| 202,243 \| 188,677 \| 229,182 16 threads \| \| \| -- \| -- \| -- \| -- Sample rate \| 0 \| 1 \| 100 1 MB block cache size \| 208,761 \| 178,700 \| 201,872 10 GB block cache size \| 2,645,996 \| 426,295 \| 2,587,605 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5473 Differential Revision: D15869479 Pulled By: HaoyuHuang fbshipit-source-id: 7ae802abe84811281a6af8649f489887cd7c4618	2019-06-17 17:59:02 -07:00
haoyuhuang	2d1dd5bce7	Support computing miss ratio curves using sim_cache. (#5449 ) Summary: This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G". When we replay the trace, we also perform lookups and inserts on the simulated caches. In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file. This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449 Test Plan: Added tests for block_cache_trace_analyzer. COMPILE_WITH_ASAN=1 make check -j32. Differential Revision: D15797073 Pulled By: HaoyuHuang fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408	2019-06-17 16:41:12 -07:00
Zhongyi Xie	671d15cbdd	Persistent Stats: persist stats history to disk (#5046 ) Summary: This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046 Differential Revision: D15863138 Pulled By: miasantreble fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b	2019-06-17 15:21:50 -07:00
haoyuhuang	d43b4cd570	Integrate block cache tracing into db_bench (#5459 ) Summary: This PR integrates the block cache tracing into db_bench. It adds three command line arguments. -block_cache_trace_file (Block cache trace file path.) type: string default: "" -block_cache_trace_max_trace_file_size_in_bytes (The maximum block cache trace file size in bytes. Block cache accesses will not be logged if the trace file size exceeds this threshold. Default is 64 GB.) type: int64 default: 68719476736 -block_cache_trace_sampling_frequency (Block cache trace sampling frequency, termed s. It uses spatial downsampling and samples accesses to one out of s blocks.) type: int32 default: 1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5459 Differential Revision: D15832031 Pulled By: HaoyuHuang fbshipit-source-id: 0ecf2f2686557251fe741a2769b21170777efa3d	2019-06-17 11:08:21 -07:00
haoyuhuang	7a8d7358bb	Integrate block cache tracer in block based table reader. (#5441 ) Summary: This PR integrates the block cache tracer into block based table reader. The tracer will write the block cache accesses using the trace_writer. The tracer is null in this PR so that nothing will be logged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5441 Differential Revision: D15772029 Pulled By: HaoyuHuang fbshipit-source-id: a64adb92642cd23222e0ba8b10d86bf522b42f9b	2019-06-14 17:40:31 -07:00
haoyuhuang	bb4178066d	Integrate block cache tracer into db_impl (#5433 ) Summary: This PR integrates the block cache tracer class into db_impl.cc. db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433 Differential Revision: D15728016 Pulled By: HaoyuHuang fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71	2019-06-13 15:43:10 -07:00
Maysam Yabandeh	f9842869cf	Disable pipeline writes in stress test (#5445 ) Summary: The tsan crash tests are failing with a data race compliant with pipelined write option. Temporarily disable it until its concurrency issue are fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5445 Differential Revision: D15783824 Pulled By: maysamyabandeh fbshipit-source-id: 413a0c3230b86f524fc7eeea2cf8e8375406e65b	2019-06-12 11:12:36 -07:00
haoyuhuang	9bbccda01e	First commit for block cache trace analyzer (#5425 ) Summary: This PR contains the first commit for block cache trace analyzer. It reads a block cache trace file and prints statistics of the traces. We will extend this class to provide more functionalities. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5425 Differential Revision: D15709580 Pulled By: HaoyuHuang fbshipit-source-id: 2f43bd2311f460ab569880819d95eeae217c20bb	2019-06-11 12:22:44 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Siying Dong	5851cb7fdb	Move util/trace_replay.* to trace_replay/ (#5376 ) Summary: util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376 Differential Revision: D15550938 Pulled By: siying fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207	2019-06-03 13:25:26 -07:00
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	2019-05-31 17:23:59 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00
Yanqin Jin	83f7a8eed0	Fix compilation error in LITE mode (#5391 ) Summary: Add macro ROCKSDB_LITE to fix compilation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5391 Differential Revision: D15574522 Pulled By: riversand963 fbshipit-source-id: 95aea83c5d9b2bf98a3ba0ef9167b63c9be2988b	2019-05-31 08:32:22 -07:00
Yanqin Jin	b9f5900658	Fix WAL replay by skipping old write batches (#5170 ) Summary: 1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables. 2. Add support for benchmarking secondary instance to db_bench_tool. With changes made in this PR, we can start benchmarking secondary instance using two processes. It is also possible to vary the frequency at which the secondary instance tries to catch up with the primary. The info log of the secondary can be found in a directory whose path can be specified with '-secondary_path'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170 Differential Revision: D15564608 Pulled By: riversand963 fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8	2019-05-30 19:33:33 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
anand76	bd44ec2006	Fix reopen voting logic in db_stress when using MultiGet (#5374 ) Summary: When the --reopen option is non-zero, the DB is reopened after every ops_per_thread/(reopen+1) ops, with the check being done after every op. With MultiGet, we might do multiple ops in one iteration, which broke the logic that checked when to synchronize among the threads and reopen the DB. This PR fixes that logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5374 Differential Revision: D15559780 Pulled By: anand1976 fbshipit-source-id: ee6563a68045df7f367eca3cbc2500d3e26359ef	2019-05-30 11:41:08 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Siying Dong	545d206040	Move some file related files outside util/ (#5375 ) Summary: util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375 Differential Revision: D15550935 Pulled By: siying fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2	2019-05-29 20:47:06 -07:00
Maysam Yabandeh	eab4f49a2c	WritePrepared: skip_concurrency_control option (#5330 ) Summary: This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330 Differential Revision: D15525932 Pulled By: maysamyabandeh fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b	2019-05-28 16:29:45 -07:00
Silver Chan	2095ae8858	fixed db_stress.cc build error (#5307 ) Summary: when building this file using Xcode 10.2.1 in MacOSX10.14, the compiler report this error: ` rocksdb/tools/db_stress.cc:3613:33: error: implicit instantiation of undefined template 'std::__1::array<std::__1::basic_string<char>, 10>' std::array<std::string, 10> keys = {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}; /usr/include/c++/v1/__tuple:223:64: note: template is declared here template <class _Tp, size_t _Size> struct _LIBCPP_TEMPLATE_VIS array; ^ 1 error generated. ` if including array, this error will be fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5307 Differential Revision: D15475217 Pulled By: sagar0 fbshipit-source-id: b04a7658c2ca2573157028863b3a80f5ab52b9de	2019-05-23 14:03:25 -07:00
Zhichao Cao	a13026fb2f	Added trace replay fast forward function (#5273 ) Summary: In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273 Differential Revision: D15389232 Pulled By: zhichao-cao fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806	2019-05-16 20:21:18 -07:00
anand76	6492430eaf	Fix a bug in db_stress and an incorrect assertion in FilePickerMultiGet (#5301 ) Summary: This PR has two fixes for crash test failures - 1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values 2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time Test - asan_crash and ubsan_crash tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301 Differential Revision: D15337383 Pulled By: anand1976 fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6	2019-05-14 11:58:04 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Yi Wu	92c60547fe	db_bench: fix hang on IO error (#5300 ) Summary: db_bench will wait indefinitely if there's background error. Fix by pass `abs_time_us` to cond var. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5300 Differential Revision: D15319945 Pulled By: miasantreble fbshipit-source-id: 0034fb7f6ec7c3303c4ccf26e54c20fbdac8ab44	2019-05-13 11:30:35 -07:00
anand76	181bb43f08	Fix bugs in FilePickerMultiGet (#5292 ) Summary: This PR fixes a couple of bugs in FilePickerMultiGet that were causing db_stress test failures. The failures were caused by - 1. Improper handling of a key that matches the user key portion of an L0 file's largest key. In this case, the curr_index_in_curr_level file index in L0 for that key was getting incremented, but batch_iter_ was not advanced. By design, all keys in a batch are supposed to be checked against an L0 file before advancing to the next L0 file. Not advancing to the next key in the batch was causing a double increment of curr_index_in_curr_level due to the same key being processed again 2. Improper handling of a key that matches the user key portion of the largest key in the last file of L1 and higher. This was resulting in a premature end to the processing of the batch for that level when the next key in the batch is a duplicate. Typically, the keys in MultiGet will not be duplicates, but its good to handle that case correctly Test - asan_crash make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/5292 Differential Revision: D15282530 Pulled By: anand1976 fbshipit-source-id: d1a6a86e0af273169c3632db22a44d79c66a581f	2019-05-09 13:18:00 -07:00
anand76	930bfa5750	Disable MultiGet from db_stress (#5284 ) Summary: Disable it for now until we can get stress tests to pass consistently. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5284 Differential Revision: D15230727 Pulled By: anand1976 fbshipit-source-id: 239baacdb3c4cd4fb7c4447f7582b9042501d752	2019-05-06 18:26:50 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Zhongyi Xie	5d27d65bef	multiget: fix memory issues due to vector auto resizing (#5279 ) Summary: This PR fixes three memory issues found by ASAN * in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance. * Similar issue in construction of GetContext autovector in version_set.cc * In multiget_context.h use T[] specialization for unique_ptr that holds a char array Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279 Differential Revision: D15202893 Pulled By: miasantreble fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d	2019-05-03 15:58:43 -07:00
Zhongyi Xie	3e994809a1	fix implicit conversion error reported by clang check (#5277 ) Summary: fix the following clang check errors ``` tools/db_stress.cc:3609:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] int num_keys = rand_keys.size(); ~~~~~~~~ ~~~~~~~~~~^~~~~~ tools/db_stress.cc:3888:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] int num_keys = rand_keys.size(); ~~~~~~~~ ~~~~~~~~~~^~~~~~ 2 errors generated. make: *** [tools/db_stress.o] Error 1 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5277 Differential Revision: D15196620 Pulled By: miasantreble fbshipit-source-id: d56b1420d4a9f1df875fc52877a5fbb342bc7cae	2019-05-03 10:02:27 -07:00
anand76	434ccf2df4	Add option to use MultiGet in db_stress (#5264 ) Summary: The new option will pick a batch size randomly in the range 1-64. It will then space the keys in the batch by random intervals. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5264 Differential Revision: D15175522 Pulled By: anand1976 fbshipit-source-id: c16baa69d0f1ff4cf53c55c813ddd82c8aeb58fc	2019-05-01 23:06:56 -07:00
Andrew Kryczka	b02d0c238d	Init compression dict handle before reading meta-blocks (#5267 ) Summary: At least one of the meta-block loading functions (`ReadRangeDelBlock`) uses the same block reading function (`NewDataBlockIterator`) as data block reads, which means it uses the dictionary handle. However, the dictionary handle was uninitialized while reading meta-blocks, causing readers to receive an error. This situation was only noticed when `cache_index_and_filter_blocks=true`. This PR initializes the handle to null while reading meta-blocks to prevent the error. It also adds support to `db_stress` / `db_crashtest.py` for `cache_index_and_filter_blocks`. Fixes #5263. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5267 Differential Revision: D15149264 Pulled By: maysamyabandeh fbshipit-source-id: 991d38a306c62db5976778bfb050fa3cd4a0671b	2019-04-30 09:50:49 -07:00
Yanqin Jin	210b49cac9	Disable pipelined write in atomic flush stress test (#5266 ) Summary: Since currently pipelined write allows one thread to perform memtable writes while another thread is traversing the `flush_scheduler_`, it will cause an assertion failure in `FlushScheduler::Clear`. To unblock crash recoery tests, we temporarily disable pipelined write when atomic flush is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5266 Differential Revision: D15142285 Pulled By: riversand963 fbshipit-source-id: a0c20fe4ac543e08feaed602414f982054df7831	2019-04-30 08:12:42 -07:00
Sagar Vemuri	3548e4220d	Improve explicit user readahead performance (#5246 ) Summary: Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`. 1. Stop creating new table readers when the user explicitly sets readahead size. 2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases). 3. Add `readahead_size` to db_bench. Benchmarks: https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737 For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%. For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%. Test Plan: Updated `DBIteratorTest.ReadAhead` test to make sure that: - no new table readers are created for iterators on setting ReadOptions.readahead_size - At least "readahead" number of bytes are actually getting read on each iterator read. TODO later: - Use similar logic for compactions as well. - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246 Differential Revision: D15107946 Pulled By: sagar0 fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea	2019-04-26 21:24:10 -07:00
Adam Retter	990b2f4cb3	Fix compilation on db_bench_tool.cc on Windows (#5227 ) Summary: I needed this change to be able to build the v6.0.1 release on Windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5227 Differential Revision: D15033815 Pulled By: sagar0 fbshipit-source-id: 579f3b8e694c34c0d43527eb2fa37175e37f5911	2019-04-23 11:16:51 -07:00
Fosco Marotto	6c2bf9e916	Add copyright headers per FB open-source checkup tool. (#5199 ) Summary: internal task: T35568575 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199 Differential Revision: D14962794 Pulled By: gfosco fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe	2019-04-18 10:55:01 -07:00
Zhongyi Xie	baa5302447	Avoid double-compacting data in bottom level in manual compactions (#5138 ) Summary: Depending on the config, manual compaction (leveled compaction style) does following compactions: L0->L1 L1->L2 ... Ln-1 -> Ln Ln -> Ln The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only. Resolves issue https://github.com/facebook/rocksdb/issues/4995 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138 Differential Revision: D14940106 Pulled By: miasantreble fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5	2019-04-16 23:32:20 -07:00
Yi Wu	b70967aac7	db_bench: support seek to non-exist prefix (#5163 ) Summary: Add `--seek_missing_prefix` flag to db_bench to allow benchmarking seeking to non-existing prefix. Usage example: ``` ./db_bench --db=/dev/shm/db_bench --use_existing_db=false --benchmarks=fillrandom --num=100000000 --prefix_size=9 --keys_per_prefix=10 ./db_bench --db=/dev/shm/db_bench --use_existing_db=true --benchmarks=seekrandom --disable_auto_compactions=true --num=100000000 --prefix_size=9 --keys_per_prefix=10 --reads=1000 --prefix_same_as_start=true --seek_missing_prefix=true ``` Also adding `--total_order_seek` and `--prefix_same_as_start` flags. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5163 Differential Revision: D14935724 Pulled By: riversand963 fbshipit-source-id: 7c41023f007febe373eb1589861f215432a9e18a	2019-04-15 10:54:58 -07:00
Yanqin Jin	3189398c00	Fix bugs detected by clang analyzer (#5185 ) Summary: as titled. False positive included, fixed anyway to make the check pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5185 Differential Revision: D14909384 Pulled By: riversand963 fbshipit-source-id: dc5177e72b1929ccfd6175a60e2cd7bdb9bd80f3	2019-04-12 10:45:56 -07:00
anand76	fefd4b98c5	Introduce a new MultiGet batching implementation (#5011 ) Summary: This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching. Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to - 1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch() 2. Bloom filter cachelines can be prefetched, hiding the cache miss latency The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress. Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32). Batch Sizes 1 \| 2 \| 4 \| 8 \| 16 \| 32 Random pattern (Stride length 0) 4.158 \| 4.109 \| 4.026 \| 4.05 \| 4.1 \| 4.074 - Get 4.438 \| 4.302 \| 4.165 \| 4.122 \| 4.096 \| 4.075 - MultiGet (no batching) 4.461 \| 4.256 \| 4.277 \| 4.11 \| 4.182 \| 4.14 - MultiGet (w/ batching) Good locality (Stride length 16) 4.048 \| 3.659 \| 3.248 \| 2.99 \| 2.84 \| 2.753 4.429 \| 3.728 \| 3.406 \| 3.053 \| 2.911 \| 2.781 4.452 \| 3.45 \| 2.833 \| 2.451 \| 2.233 \| 2.135 Good locality (Stride length 256) 4.066 \| 3.786 \| 3.581 \| 3.447 \| 3.415 \| 3.232 4.406 \| 4.005 \| 3.644 \| 3.49 \| 3.381 \| 3.268 4.393 \| 3.649 \| 3.186 \| 2.882 \| 2.676 \| 2.62 Medium locality (Stride length 4096) 4.012 \| 3.922 \| 3.768 \| 3.61 \| 3.582 \| 3.555 4.364 \| 4.057 \| 3.791 \| 3.65 \| 3.57 \| 3.465 4.479 \| 3.758 \| 3.316 \| 3.077 \| 2.959 \| 2.891 dbbench command used (on a DB with 4 levels, 12 million keys)- TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011 Differential Revision: D14348703 Pulled By: anand1976 fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b	2019-04-11 14:28:26 -07:00
datonli	f0edf9d575	#5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output (#5152 ) Summary: mv port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output Pull Request resolved: https://github.com/facebook/rocksdb/pull/5152 Differential Revision: D14779409 Pulled By: siying fbshipit-source-id: d4162c47c979c6e8cc6a9e601802864ab3768ecb	2019-04-04 11:38:19 -07:00
Yanqin Jin	d77476ef55	Fix db_stress for custom env (#5122 ) Summary: Fix some hdfs-related code so that it can compile and run 'db_stress' Pull Request resolved: https://github.com/facebook/rocksdb/pull/5122 Differential Revision: D14675495 Pulled By: riversand963 fbshipit-source-id: cac280479efcf5451982558947eac1732e8bc45a	2019-03-28 19:20:27 -07:00
Siying Dong	2b4d5ceb47	Remove some "using std::..." from header files. (#5113 ) Summary: The code convention we are following, Google C++ Style, discourage alias in header files, especially public headers: https://google.github.io/styleguide/cppguide.html#Aliases Remove some of them. Might removed some from .cc files as well to be consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113 Differential Revision: D14633030 Pulled By: siying fbshipit-source-id: b990edc919d5de60295992284f980195e501d424	2019-03-27 10:28:21 -07:00
Yanqin Jin	9358178edc	Support for single-primary, multi-secondary instances (#4899 ) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886	2019-03-26 16:45:31 -07:00
Zhongyi Xie	52e6404e0f	ldb command parsing: allow option values to contain equals signs (#5088 ) Summary: Right now ldb command doesn't allow cases where option values contain equals sign. For example, ``` ldb --db=/tmp/test scan --from='q=3' --max_keys=1 ``` after parsing, ldb will have one option 'db', 'max_keys' and one flag 'from'. This PR updates the parsing logic so that it now supports the above mentioned cases Pull Request resolved: https://github.com/facebook/rocksdb/pull/5088 Differential Revision: D14600869 Pulled By: miasantreble fbshipit-source-id: c6ef518c74a98d7b6675ea5954ae08b1bda5554e	2019-03-25 13:23:11 -07:00
Shobhit Dayal	b45b1cde3e	Feature for sampling and reporting compressibility (#4842 ) Summary: This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms: 1. lz4 or snappy for fast compression 2. zstd or zlib for slow but higher compression. The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType. The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842 Differential Revision: D13629011 Pulled By: shobhitdayal fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa	2019-03-18 12:15:34 -07:00
Andrew Kryczka	2263f86901	exercise WAL recycling in crash test (#5070 ) Summary: Since this feature affects the WAL behavior, it seems important our crash-recovery tests cover it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5070 Differential Revision: D14470085 Pulled By: miasantreble fbshipit-source-id: 9b9682a718a926d57d055e0a5ec867efbd2eb9c1	2019-03-15 12:03:26 -07:00
Zhichao Cao	dcde292c3b	Add the -try_process_corrupted_trace option to trace_analyzer (#5067 ) Summary: In the current trace_analyzer implementation, once the trace file has corrupted content, which can be caused by unexpected tracing operations or other reasons, trace_analyzer will print the error and stop analyzing. By adding the -try_process_corrupted_trace option, user can try to process the corrupted trace file and get the analyzing results of the trace records from the beginning to the the first corrupted point in the trace file. Analyzing might fail even this option is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5067 Differential Revision: D14433037 Pulled By: zhichao-cao fbshipit-source-id: d095233ba371726869af0def0cdee23b69896831	2019-03-14 20:03:01 -07:00
Andrew Kryczka	5a5c0492db	ldb: set `total_order_seek` for scans (#5066 ) Summary: Without `total_order_seek=true`, using this command with `prefix_extractor` set skips over lots of keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5066 Differential Revision: D14425967 Pulled By: sagar0 fbshipit-source-id: f6f142733258d92604f920615be9266e1fe797f8	2019-03-12 13:10:39 -07:00
Zhichao Cao	05ebfebc17	Fixed the potential stack overflow of MixGraph in db_bench (#5051 ) Summary: In the MixGraph benchmark of db_bench, The max buffer size used for value of KV-pair might be extremely large (64MB), which might cause function stack overflow in some platforms, reduced to 1MB. Added the finished ops printing in MixGraph benchmark. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5051 Differential Revision: D14379571 Pulled By: zhichao-cao fbshipit-source-id: 24084fbe38f60f2902d9a40f6bc9a25e4e2c9bb9	2019-03-08 14:10:17 -08:00
Andrew Kryczka	18d2e4beb7	Run db_bench on database generated externally (#5017 ) Summary: Added an option, `-use_existing_keys`, which can be set to run benchmarks against an arbitrary existing database. Now users can benchmark against their actual database rather than synthetic data. Before the run begins, it loads all the keys into memory, then uses that set of keys rather than synthesizing new ones in `GenerateKeyFromInt`. This is mainly intended for small-scale DBs where the memory consumption is not a concern. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5017 Differential Revision: D14270303 Pulled By: riversand963 fbshipit-source-id: 6328df9dffb5e19170270dd00a69f4bbe424e5ed	2019-03-01 11:19:03 -08:00
Siying Dong	aef763b6d6	Make statistics's stats_level change thread-safe (#5030 ) Summary: Right now, users can change statistics.stats_level while DB is running, but TSAN may report data race. We make stats_level_ to be atomic, and access them using accessors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030 Differential Revision: D14267519 Pulled By: siying fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73	2019-03-01 10:42:09 -08:00
Maysam Yabandeh	0b80f6b380	WritePrepared: script to analyze stress test failures (#5033 ) Summary: This the hackish script we used to find the root cause of failures in transaction stress tests. It is not well-written and does not require rigorous reviewing but it is better than starting from scratch each time we observe an issue. The stress tests would just say that at which snapshots the sum of all the keys in a set is inconsistent with another set. To help debugging one need to know which key exactly returned inconsistent results. The script looks at the transactions between two conflicting snapshots, and performs thee changes manually to see for which key the read value was inconsistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5033 Differential Revision: D14280362 Pulled By: maysamyabandeh fbshipit-source-id: d5826055c46711460ba81480d96cb5ea082814a5	2019-03-01 09:18:40 -08:00
Siying Dong	5e298f865b	Add two more StatsLevel (#5027 ) Summary: Statistics cost too much CPU for some use cases. Add two stats levels so that people can choose to skip two types of expensive stats, timers and histograms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027 Differential Revision: D14252765 Pulled By: siying fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592	2019-02-28 10:27:59 -08:00
Zhongyi Xie	c4f5d0aa15	add GetStatsHistory to retrieve stats snapshots (#4748 ) Summary: This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0. (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748 Differential Revision: D13961471 Pulled By: miasantreble fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04	2019-02-20 15:52:54 -08:00
Zhongyi Xie	ed995c6a69	add whole key bloom filter support in memtables (#4985 ) Summary: MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985 Differential Revision: D14118257 Pulled By: miasantreble fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea	2019-02-19 12:15:39 -08:00
Siying Dong	4db46aa2e6	Fix LITE Build (#4989 ) Summary: LITE mode has EventListener to be an empty class. However in db_bench, it is used. When "override" is added to the functions, the build breaks. Fix it by keeping the listener empty in LITE mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4989 Differential Revision: D14108132 Pulled By: siying fbshipit-source-id: 80121aab35b1120e502b37b782301dd700692697	2019-02-15 16:13:11 -08:00
Aubin Sanyal	3231a2e581	Deprecate ttl option from CompactionOptionsFIFO (#4965 ) Summary: We introduced ttl option in CompactionOptionsFIFO when ttl-based file deletion (compaction) was supported only as part of FIFO Compaction. But with the extension of ttl semantics even to Level compaction, CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start using ColumnFamilyOptions.ttl for FIFO compaction as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965 Differential Revision: D14072960 Pulled By: sagar0 fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e	2019-02-15 09:51:41 -08:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Yanqin Jin	4fc442029a	Avoid using kInAtomicGroup tag for single-cf op (#4981 ) Summary: if an operation just involves a single column family, then we do not have to set the kInAtomicGroup tag when writing to MANIFEST. This change can fix a compatibility test failure, i.e. 5.15 and earlier cannot recognize kInAtomicGroup tag. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4981 Differential Revision: D14072687 Pulled By: riversand963 fbshipit-source-id: 46b0c61e399f16c6b7169de0b33430d0ed90d6d4	2019-02-13 18:33:42 -08:00
Andrew Kryczka	62f70f6d14	Reduce scope of compression dictionary to single SST (#4952 ) Summary: Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio. So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include: - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called. - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up. - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952 Differential Revision: D13967980 Pulled By: ajkr fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f	2019-02-11 19:47:32 -08:00
Andrew Kryczka	1218704b61	Fix `compression_zstd_max_train_bytes` coverage in stress test (#4957 ) Summary: Previously `finalize_and_sanitize` function was always zeroing out `compression_zstd_max_train_bytes`. It was only supposed to do that when non-ZSTD compression was used. But since `--compression_type` was an unknown argument (i.e., one that `db_crashtest.py` does not recognize and blindly forwards to `db_stress`), `finalize_and_sanitize` could not tell whether ZSTD was used. This PR fixes it simply by making `--compression_type` a known argument with snappy as default (same as `db_stress`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4957 Differential Revision: D13994302 Pulled By: ajkr fbshipit-source-id: 1b0baea7331397822830970d3698642eb7a7df65	2019-02-11 14:56:39 -08:00
Siying Dong	cf3a671733	Remove cuckoo hash memtable (#4953 ) Summary: Cuckoo Hash is less useful than we initially expected. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953 Differential Revision: D13979264 Pulled By: siying fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed	2019-02-07 16:15:27 -08:00
Siying Dong	d9c9f3c809	db_bench: fix "micros/op" reporting (#4949 ) Summary: `4985a9f73b (diff-e5276985b26a0551957144f4420a594bR511)` changes the meaning of latency reporting from running time per query, to elapse_time / #ops, without providing a reason why. Considering that this is a counter-intuitive reporting, Reverting the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4949 Differential Revision: D13964684 Pulled By: siying fbshipit-source-id: d6304d3d4b5a802daa292302623c7dbca9a680bc	2019-02-05 17:20:02 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
zhichao-cao	e2547103fd	Fix the build error caused by the dynamic array (#4918 ) Summary: In the MixGraph benchmark of db_bench #4788 , the char array is initialized with an argument from user's input, which can cause build error on some platforms. Also, the msg char array size can be potentially smaller than the printed data, which should be extended from 100 to 256. Tested with make check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4918 Differential Revision: D13844298 Pulled By: sagar0 fbshipit-source-id: 33c4809c5c4438f0a9f7b289d3f42e20c545bbab	2019-01-28 12:39:57 -08:00
Andrew Kryczka	e242fa4664	Add latest toolchain (gcc-8, etc.) build support for fbcode users (#4923 ) Summary: - When building with internal dependencies, specify this toolchain by setting `ROCKSDB_FBCODE_BUILD_WITH_PLATFORM007=1` - It is not enabled by default. However, it is enabled for TSAN builds in CI since there is a known problem with TSAN in gcc-5: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71090 - I did not add support for Lua since (1) we agreed to deprecate it, and (2) we only have an internal build for v5.3 with this toolchain while that has breaking changes compared to our current version (v5.2). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4923 Differential Revision: D13827226 Pulled By: ajkr fbshipit-source-id: 9aa3388ed3679777cfb15ef8cbcb83c07f62f947	2019-01-28 11:26:32 -08:00
Sagar Vemuri	0cead31d10	Fix Clang static analyzer warning in db_bench (#4910 ) Summary: Fixed clang static analyzer warning about division by 0. ``` ar: creating librocksdb_debug.a tools/db_bench_tool.cc:4650:43: warning: Division by zero int pos = static_cast<int>(rand_num % range_); ~~~~~~~~~^~~~~~~~ 1 warning generated. make: *** [analyze] Error 1 ``` This is from the new code I recently merged in `ce8e88d2d7`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4910 Differential Revision: D13788037 Pulled By: sagar0 fbshipit-source-id: f48851dca85047c19fbb1a361e25ce643aa4c7ea	2019-01-23 13:33:02 -08:00
Zhongyi Xie	cbe0239270	add cast to avoid loss of precision error (#4906 ) Summary: this PR address the following error: > tools/db_bench_tool.cc:4776:68: error: implicit conversion loses integer precision: 'int64_t' (aka 'long') to 'unsigned int' [-Werror,-Wshorten-64-to-32] s = db_with_cfh->db->Put(write_options_, key, gen.Generate(value_size)); Pull Request resolved: https://github.com/facebook/rocksdb/pull/4906 Differential Revision: D13780185 Pulled By: miasantreble fbshipit-source-id: 1c83a77d341099518c72f0f4a63e97ab9c4784b3	2019-01-22 22:44:17 -08:00
Zhichao Cao	ce8e88d2d7	Generate mixed workload with Get, Put, Seek in db_bench (#4788 ) Summary: Based on the specific workload models (key access distribution, value size distribution, and iterator scan length distribution, the QPS variation), the MixGraph benchmark generate the synthetic workload according to these distributions which can reflect the real-world workload characteristics. After user enable the tracing function, they will get the trace file. By analyzing the trace file with the trace_analyzer tool, user can generate a set of statistic data files including. The _accessed_key_stats.txt, -accessed_value_size_distribution.txt, -iterator_length_distribution.txt, and -qps_stats.txt are mainly used to fit the Matlab model fitting. After that, user can get the parameters of the workload distributions (the modeling details are described: [here](https://github.com/facebook/rocksdb/wiki/RocksDB-Trace%2C-Replay%2C-and-Analyzer)) The key access distribution follows the The two-term power model. The probability density function is: `f(x) = ax^{b}+c`. The corresponding parameters are key_dist_a, key_dist_b, and key_dist_c in db_bench For the value size distribution and iterator scan length distribution, they both follow the Generalized Pareto Distribution. The probability density function is `f(x) = (1/sigma)(1+k(x-theta)/sigma))^{-1-1/k)`. The parameters are: value_k, value_theta, value_sigma and iter_k, iter_theta, iter_sigma. For more information about the Generalized Pareto Distribution, users can find the [wiki](https://en.wikipedia.org/wiki/Generalized_Pareto_distribution) and [Matalb page](https://www.mathworks.com/help/stats/generalized-pareto-distribution.html) As for the QPS, it follows the diurnal pattern. So Sine is a good model to fit it. `F(x) = sine_asin(sine_bx + sine_c) + sine_d`. The trace_will tell you the average QPS in the print out resutls, which is sine_d. After user fit the "-qps_stats.txt" to the Matlab model, user can get the sine_a, sine_b, and sine_c. By using the 4 parameters, user can control the QPS variation including the period, average, changes. To use the bench mark, user can indicate the following parameters as examples: ``` -benchmarks="mixgraph" -key_dist_a=0.002312 -key_dist_b=0.3467 -value_k=0.9233 -value_sigma=226.4092 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.7 -mix_put_ratio=0.25 -mix_seek_ratio=0.05 -sine_mix_rate_interval_milliseconds=500 -sine_a=15000 -sine_b=1 -sine_d=20000 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4788 Differential Revision: D13573940 Pulled By: sagar0 fbshipit-source-id: e184c27e07b4f1bc0b436c2be36c5090c1fb0222	2019-01-22 10:44:26 -08:00
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	2019-01-18 19:12:57 -08:00
Siying Dong	4e37251b4d	With ldb --try_load_options and wal_dir doesn't exist, ignore it (#4875 ) Summary: LDB is frequently used to exam data copied. wal_dir in option file is not modified and it usually points to the path it copied from. The user experience will be better if when ldb sees wal_dir pointed by the option file doesn't exist, rather than fail, just ignore it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4875 Differential Revision: D13643173 Pulled By: siying fbshipit-source-id: 2e64d4ea2ec49a6794b9a706b7fc1ba901128bb8	2019-01-11 16:48:32 -08:00
Yanqin Jin	ffc9f84649	Free memory after use Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4857 Differential Revision: D13602688 Pulled By: riversand963 fbshipit-source-id: 993419a6afb982a7a701ff71daebebb4b4a6b265	2019-01-08 17:19:09 -08:00
Yanqin Jin	e686caffec	Remove unnecessary assersion in AtomicFlushStressTest::TestCheckpoint (#4846 ) Summary: as titled. We can remove the assersion because we do not perform verification in AtomicFlushStressTest::TestCheckpoint for similar reasons to TestGet, TestPut, etc. Therefore, we override TestCheckpoint in AtomicFlushStressTest so that the assertion `rand_column_families.size() == rand_keys.size()' is removed, and we do not verify the DB in this function. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4846 Differential Revision: D13583377 Pulled By: riversand963 fbshipit-source-id: 03647f3da67e27a397413fd666e3bb43003bf596	2019-01-07 16:47:26 -08:00
Huachao Huang	74f7d7551e	tools: use provided options instead of the default (#4839 ) Summary: The current implementation hardcode the default options in different places, which makes it impossible to support other environments (like encrypted environment). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4839 Differential Revision: D13573578 Pulled By: sagar0 fbshipit-source-id: 76b58b4b758902798d10ff2f52d9f39abff015e7	2019-01-03 11:23:49 -08:00
Yanqin Jin	565b5bdc42	Add support for read-only db chkpt stress (#4690 ) Summary: Updated stress test will support testing of db in read-only mode. The user has to make sure that only read/scan operations are enabled. This PR relies on #4681. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4690 Differential Revision: D13102741 Pulled By: riversand963 fbshipit-source-id: f5a36b34db187fe12dd355f7eda161f99d6c75e4	2019-01-02 17:40:53 -08:00
Andrew Kryczka	ace543a815	fix accounting for range tombstones in TableProperties (#4841 ) Summary: - To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`. - Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841 Differential Revision: D13568424 Pulled By: ajkr fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026	2019-01-02 15:08:53 -08:00
Andrew Kryczka	68d949b3e3	Enable DeleteRange in stress/crash tests (#4483 ) Summary: Set `delrangepercent=1` when `test_batches_snapshots=false`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4483 Differential Revision: D10324361 Pulled By: ajkr fbshipit-source-id: 0cde1f1504f9493408a0c6493b976d7e5f5b2d23	2018-12-18 13:42:49 -08:00
Fosco Marotto	311cd8cf2f	Updated benchmark script (#4134 ) Summary: When producing the updated performance on flash results for the wiki, these are the updates which were made. https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks Pull Request resolved: https://github.com/facebook/rocksdb/pull/4134 Differential Revision: D13491052 Pulled By: gfosco fbshipit-source-id: dcd92f24659e0917cb1ac54a4446aa8e7aac8b0d	2018-12-17 16:34:30 -08:00
Andrew Kryczka	8d2b74d287	Refine db_stress params for atomic flush (#4781 ) Summary: Separate flag for enabling option from flag for enabling dedicated atomic stress test. I have found setting the former without setting the latter can detect different problems. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4781 Differential Revision: D13463211 Pulled By: ajkr fbshipit-source-id: 054f777885b2dc7d5ea99faafa21d6537eee45fd	2018-12-13 22:10:38 -08:00
DorianZheng	2670fe8c73	Get `CompactionJobInfo` from CompactFiles Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716 Differential Revision: D13207677 Pulled By: ajkr fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b	2018-12-13 14:21:24 -08:00
Sagar Vemuri	70645355ad	Move FIFOCompactionPicker to a separate file (#4724 ) Summary: Summary: Simplified the code layout by moving FIFOCompactionPicker to a separate file. Why?: While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724 Differential Revision: D13227914 Pulled By: sagar0 fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3	2018-11-29 16:04:52 -08:00
Sagar Vemuri	c94f073e5e	Fix Mac build break in casting (#4722 ) Summary: Mac build is failing with the below error: ``` $ make db_bench -j8 ... ... tools/db_bench_tool.cc:4583:25: error: no matching function for call to 'max' (uint64_t)std::max(0l, seek_pos - FLAGS_max_scan_distance), ^~~~~~~~ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2717:1: note: candidate template ignored: deduced conflicting types for parameter '_Tp' ('long' vs. 'long long') max(const _Tp& __a, const _Tp& __b) ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2727:1: note: candidate template ignored: could not match 'initializer_list<type-parameter-0-0>' against 'long' max(initializer_list<_Tp> __t, _Compare __comp) ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2709:1: note: candidate function template not viable: requires 3 arguments, but 2 were provided max(const _Tp& __a, const _Tp& __b, _Compare __comp) ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/algorithm:2735:1: note: candidate function template not viable: requires single argument '__t', but 2 arguments were provided max(initializer_list<_Tp> __t) ^ 1 error generated. make: *** [tools/db_bench_tool.o] Error 1 ``` My compiler version: Mac OS X Mojave ``` $ clang++ --version Apple LLVM version 10.0.0 (clang-1000.11.45.5) Target: x86_64-apple-darwin18.2.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4722 Differential Revision: D13220196 Pulled By: sagar0 fbshipit-source-id: 01e5e928288a5613027c83a26ad8aedf04438b14	2018-11-27 13:30:16 -08:00
Huachao Huang	5e72bc113a	Add SstFileReader to read sst files (#4717 ) Summary: A user friendly sst file reader is useful when we want to access sst files outside of RocksDB. For example, we can generate an sst file with SstFileWriter and send it to other places, then use SstFileReader to read the file and process the entries in other ways. Also rename the original SstFileReader to SstFileDumper because of name conflict, and seems SstFileDumper is more appropriate for tools. TODO: there is only a very simple test now, because I want to get some feedback first. If the changes look good, I will add more tests soon. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717 Differential Revision: D13212686 Pulled By: ajkr fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56	2018-11-27 13:02:23 -08:00
Po-Chuan Hsieh	60deb4485e	Fix build with ROCKSDB_LITE and -Wunused-private-field (#4715 ) Summary: The error message of databases/rocksdb-lite (FreeBSD port) is as follows: ``` tools/db_bench_tool.cc:1976:16: error: private field 'trace_options_' is not used [-Werror,-Wunused-private-field] TraceOptions trace_options_; ^ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4715 Differential Revision: D13207902 Pulled By: ajkr fbshipit-source-id: be3c612eba656aeddb77e35e2f201dd25dc92f7e	2018-11-26 21:35:38 -08:00
Abhishek Madan	0ed738fdd0	Add max_scan_distance flag to db_bench (#4660 ) Summary: The new flag makes it possible to constrain iterator traversal by the upper/lower bound the iterator is expected to pass. This allows seekrandom results to be more easily comparable between DBs with and without deletions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4660 Differential Revision: D13053111 Pulled By: abhimadan fbshipit-source-id: 33e250f2e2d210b54c7726399da30a33f723c33c	2018-11-14 10:46:12 -08:00
Yanqin Jin	de65103553	Improve result report of scan (#4648 ) Summary: When iterator becomes invalid, there are two possibilities. First, all data in the column family have been scanned and there is nothing more to scan. Second, an underlying error has occurred, causing `status()` to be !ok. Therefore, we need to check for both cases when `!iter->Valid()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4648 Differential Revision: D12959601 Pulled By: riversand963 fbshipit-source-id: 49c9382c9ea9e78f2e2b6f3708f0670b822ca8dd	2018-11-13 20:03:59 -08:00
Zhichao Cao	d761857d56	Add unique key number changing statistics to Trace_analyzer (#4646 ) Summary: Changes: 1. in current version, key size distribution is printed out as the result. In this change, the result will be output to a file to make further analyze easier 2. To understand how the unique keys are accessed over time, the total unique key number of each CF of each query type in each second over time is output to a file. In this way, user could know when the unique keys are accessed frequently or accessed rarely. 3. output the total QPS of each CF to a file 4. Add the print result of total queries of each CF of each query type. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4646 Differential Revision: D12968156 Pulled By: zhichao-cao fbshipit-source-id: 6c411c7ec47c7843a70929136efd71a150db0e4c	2018-11-12 08:26:50 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Andrew Kryczka	8ba17f382e	Verify restore from backup in db_stress (#4655 ) Summary: We already exercised backup functionality in `db_stress` according to the `-backup_one_in` flag. This PR verifies the backup can be restored/opened and sanity checks a few keys. Changes in this PR: - Extracted existing backup-related logic to a helper function, `TestBackupRestore` - Added restore logic, which targets a hidden directory named "./.restore\<thread number\>", similar to how backups target hidden directories named "./.backup\<thread number\>". - After restore, check the existence/non-existence of a few keys. - With this PR, backup is no longer compatible with clearing column families. - Also included unrelated fixes to set `ReadOptions::total_order_seek=true` when using `-compare_full_db_state_snapshot` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4655 Differential Revision: D12972496 Pulled By: ajkr fbshipit-source-id: 481a40052d9a38d1bd5c5159aa4d7c5a4b546b80	2018-11-08 15:15:24 -08:00
Yanqin Jin	d7a04383d1	Include newer RocksDB versions in compat test (#4634 ) Summary: Include 5.16 and 5.17 in check_format_compatible.sh Pull Request resolved: https://github.com/facebook/rocksdb/pull/4634 Differential Revision: D12947140 Pulled By: riversand963 fbshipit-source-id: 79852b76d5139b2f31db59ed14cb368be01f2c32	2018-11-06 14:25:39 -08:00
Yanqin Jin	50895e5f0d	Update manual flush stress test (#4608 ) Summary: Originally, the manual flush calls in db_stress flushes only a single column family, which is not sufficient when atomic flush is enabled. With atomic flush, we should call `Flush(flush_opts, cfhs)` to better test this new feature. Specifically, we manuall flush all column families so that database verification is easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4608 Differential Revision: D12849160 Pulled By: riversand963 fbshipit-source-id: ae1f0dd825247b42c0aba520a5c967335102c876	2018-10-30 17:30:28 -07:00
Abhishek Madan	eaaf1a6f05	Promote rocksdb.{deleted.keys,merge.operands} to main table properties (#4594 ) Summary: Since the number of range deletions are reported in TableProperties, it is confusing to not report the number of merge operands and point deletions as top-level properties; they are accessible through the public API, but since they are not the "main" properties, they do not appear in aggregated table properties, or the string representation of table properties. This change promotes those two property keys to `rocksdb/table_properties.h`, adds corresponding uint64 members for them, deprecates the old access methods `GetDeletedKeys()` and `GetMergeOperands()` (though they are still usable for now), and removes `InternalKeyPropertiesCollector`. The property key strings are the same as before this change, so this should be able to read DBs written from older versions (though I haven't tested this yet). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594 Differential Revision: D12826893 Pulled By: abhimadan fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68	2018-10-30 15:34:27 -07:00
Yanqin Jin	912bbbbc72	Enable crash-recovery stress test for atomic flush (#4605 ) Summary: This PR adds test of atomic flush to our continuous stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4605 Differential Revision: D12840607 Pulled By: riversand963 fbshipit-source-id: 0da187572791a59530065a7952697c05b1197ad9	2018-10-30 14:03:36 -07:00
Yanqin Jin	7fb39f1ae1	Fix a warning against implicit type conversion (#4593 ) Summary: Test plan ``` $USE_CLANG=1 make -j32 all check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4593 Differential Revision: D12811159 Pulled By: riversand963 fbshipit-source-id: 5e3bbe058c5a8d5a286a19d7643593fc154a2d6d	2018-10-29 09:54:36 -07:00
Yanqin Jin	5b4c709fad	Enable atomic flush (#4023 ) Summary: Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023 Differential Revision: D8518381 Pulled By: riversand963 fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f	2018-10-26 15:08:43 -07:00
Zhongyi Xie	fe0d23059d	Fix two contrun job failures (#4587 ) Summary: Currently there are two contrun test failures: * rocksdb-contrun-lite: > tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char)’: tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’ rocksdb::DumpMallocStats(&stats_string); ^ make: * [tools/db_bench_tool.o] Error 1 * rocksdb-contrun-unity: > In file included from unity.cc:44:0: db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’: db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous auto cmp = ParsedInternalKeyComparator(icmp_); This PR will fix them Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587 Differential Revision: D10846554 Pulled By: miasantreble fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93	2018-10-24 20:16:45 -07:00
Yi Wu	0415244bfa	option to print malloc stats at the end of db_bench (#4582 ) Summary: Option to print malloc stats to stdout at the end of db_bench. This is different from `--dump_malloc_stats`, which periodically print the same information to LOG file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4582 Differential Revision: D10520814 Pulled By: yiwu-arbug fbshipit-source-id: beff5e514e414079d31092b630813f82939ffe5c	2018-10-24 11:39:05 -07:00
Simon Grätzer	f959e88048	Fix printf formatting on MacOS (#4533 ) Summary: On MacOS with clang the compilation of _tools/db_bench_tool.cc_ always fails because the format used in a `fprintf` call has the wrong type. This PR should hopefully fix this issue ``` tools/db_bench_tool.cc:4233:61: error: format specifies type 'unsigned long long' but the argument has type 'size_t' (aka 'unsigned long') ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4533 Differential Revision: D10471657 Pulled By: maysamyabandeh fbshipit-source-id: f20f5f3756d3571b586c895c845d0d4d1e34a398	2018-10-19 14:46:09 -07:00
Yanqin Jin	da4aa59b4c	Add read retry support to log reader (#4394 ) Summary: Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`. Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394 Differential Revision: D9926508 Pulled By: riversand963 fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1	2018-10-19 11:53:00 -07:00
Abhishek Madan	35cd754a6d	Add writes_before_delete_range flag to db_bench (#4538 ) Summary: The new flag allows tombstones to be generated after enough keys have been written to the database, which makes it easier to ensure that tombstones cover a lot of keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4538 Differential Revision: D10455685 Pulled By: abhimadan fbshipit-source-id: f25d5421745a353c830dea12b79784e852056551	2018-10-18 17:19:59 -07:00
Zhongyi Xie	d6ec288703	Add PerfContextByLevel to provide per level perf context information (#4226 ) Summary: Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level. This will be helpful when analyzing point and range query performance as well as tuning bloom filter Also replaced __thread with thread_local keyword for perf_context Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226 Differential Revision: D10369509 Pulled By: miasantreble fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82	2018-10-17 11:19:40 -07:00
Young Tack Jin	c648d90f8e	benchmark.sh: to fix divide by zero runtime error (#4442 ) Summary: "Write (GB)" of $9 rather than "Rnp1 (GB)" of $8 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4442 Differential Revision: D10318193 Pulled By: yiwu-arbug fbshipit-source-id: 03a7ef1938d9332e06fb3fd8490ca212f61fac6b	2018-10-10 21:03:19 -07:00
Zhichao Cao	7ca1a1f0d8	Fix trace_analyzer potential huge memory wasting due to no valid query analyzed (#4473 ) Summary: If the query types being analyzed do not appear in the trace, the current trace_analyzer will use 0 as the begin time, which create the time duration from 1970/01/01 to the now time. It will waste huge memory. Fixed by adding the trace_create_time to limit the duration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4473 Differential Revision: D10246204 Pulled By: zhichao-cao fbshipit-source-id: 42850b080b2e62f586fe73afd7737c2246d1a8c8	2018-10-10 10:00:00 -07:00
Igor Canadi	1cf5deb8fd	Introduce CacheAllocator, a custom allocator for cache blocks (#4437 ) Summary: This is a conceptually simple change, but it touches many files to pass the allocator through function calls. We introduce CacheAllocator, which can be used by clients to configure custom allocator for cache blocks. Our motivation is to hook this up with folly's `JemallocNodumpAllocator` (`f43ce6d686/folly/experimental/JemallocNodumpAllocator.h`), but there are many other possible use cases. Additionally, this commit cleans up memory allocation in `util/compression.h`, making sure that all allocations are wrapped in a unique_ptr as soon as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437 Differential Revision: D10132814 Pulled By: yiwu-arbug fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564	2018-10-02 17:24:58 -07:00
Andrew Kryczka	d56070d875	Fix benchmark script with vector memtable (#4428 ) Summary: I guess we didn't update this script when `--allow_concurrent_memtable_write` became true by default. Fixes #4413. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4428 Differential Revision: D10036452 Pulled By: ajkr fbshipit-source-id: f464be0642bd096d9040f82cdc3eae614a902183	2018-09-26 13:22:45 -07:00
Abhishek Madan	519f8b145f	Generate appropriate number of keys in db_bench (#4404 ) Summary: If range tombstones are generated every few writes, the KeyGenerator's limit is now extended to account for the additional Next() calls. This is primarily important for `filluniquerandom` benchmarks that enforce the call limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4404 Differential Revision: D9949326 Pulled By: abhimadan fbshipit-source-id: 0bdfeb2cad2098dc0b8b029236dab5e4bef25e38	2018-09-19 16:28:21 -07:00
Zhongyi Xie	9b3cf908a6	add missing range in random.choice argument (#4397 ) Summary: This will fix the broken asan crash test: > Traceback (most recent call last): File "tools/db_crashtest.py", line 384, in <module> main() File "tools/db_crashtest.py", line 368, in main parser.add_argument("--" + k, type=type(v() if callable(v) else v)) File "tools/db_crashtest.py", line 59, in <lambda> "index_block_restart_interval": lambda: random.choice(1, 16), TypeError: choice() takes exactly 2 arguments (3 given) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4397 Differential Revision: D9933041 Pulled By: miasantreble fbshipit-source-id: 10998e5bc6b6a5cea3e4088b18465affc246e639	2018-09-19 12:13:20 -07:00
Maysam Yabandeh	a0ebec3804	Extend crash test with index_block_restart_interval (#4383 ) Summary: The default for index_block_restart_interval is 1 but some use 16 in production. The patch extends crash test to test both values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4383 Differential Revision: D9887304 Pulled By: maysamyabandeh fbshipit-source-id: a8d00fea974a79ad563f9f4d9d7b069e9f746a8f	2018-09-18 15:43:29 -07:00
Andrew Kryczka	8c25204633	Support manual flush in stress/crash tests (#4368 ) Summary: - Made stress test call `Flush()` periodically according to `--flush_one_in` flag. - Enabled by default in crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4368 Differential Revision: D9838593 Pulled By: ajkr fbshipit-source-id: fe5a6e49b36e5ea752acc3aa8be364f8ef34d9cc	2018-09-17 12:27:55 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Dmitri Smirnov	879998b369	Adjust c test and fix windows compilation issues Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4369 Differential Revision: D9844200 Pulled By: sagar0 fbshipit-source-id: 0d9f5f73b28234eaac55d3551ce4e2dc177af138	2018-09-14 20:57:22 -07:00
Andrew Kryczka	c94523ee56	Delete code for WAL reader to start at nonzero offset (#4362 ) Summary: The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362 Differential Revision: D9817829 Pulled By: ajkr fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d	2018-09-13 17:13:03 -07:00
kckjn97	902261519e	correct mistyped msg. (#4341 ) Summary: corrected the mistyped message. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4341 Differential Revision: D9816571 Pulled By: ajkr fbshipit-source-id: 1df0424e981a01470a638a37b925c4133d59a48b	2018-09-13 14:57:38 -07:00
Maysam Yabandeh	3f5282268f	Skip concurrency control during recovery of pessimistic txn (#4346 ) Summary: TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346 Differential Revision: D9759149 Pulled By: maysamyabandeh fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b	2018-09-10 16:57:53 -07:00
Andrew Kryczka	2c14662213	Revert "Digest ZSTD compression dictionary once per SST file (#4251 )" (#4347 ) Summary: Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting. Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347 Differential Revision: D9668365 Pulled By: ajkr fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd	2018-09-06 09:58:34 -07:00
Andrew Kryczka	1a88c43751	Reduce empty SST creation/deletion in compaction (#4336 ) Summary: This is a followup to #4311. Checking `!RangeDelAggregator::IsEmpty()` before opening a dedicated range tombstone SST did not properly prevent empty SSTs from being generated. That's because it relies on `CollapsedRangeDelMap::Size`, which had an underflow bug when the map was empty. This PR fixes that underflow bug. Also fixed an uninitialized variable in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4336 Differential Revision: D9600080 Pulled By: ajkr fbshipit-source-id: bc6980ca79d2cd01b825ebc9dbccd51c1a70cfc7	2018-08-31 12:28:52 -07:00
Zhongyi Xie	1cf17ba53b	Rename DecodeCFAndKey to resolve naming conflict in unity test (#4323 ) Summary: Currently unity-test is failing because both trace_replay.cc and trace_analyzer_tool.cc defined `DecodeCFAndKey` under anonymous namespace. It is supposed to be fine except unity test will dump all source files together and now we have a conflict. Another issue with trace_analyzer_tool.cc is that it is using some utility functions from ldb_cmd which is not included in Makefile for unity_test, I chose to update TESTHARNESS to include LIBOBJECTS. Feel free to comment if there is a less intrusive way to solve this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4323 Differential Revision: D9599170 Pulled By: miasantreble fbshipit-source-id: 38765b11f8e7de92b43c63bdcf43ea914abdc029	2018-08-30 18:42:51 -07:00
Shrikanth Shankar	4848bd0c4e	Drop unnecessary deletion markers during compaction (issue - 3842) (#4289 ) Summary: This PR fixes issue 3842. We drop deletion markers iff 1. We are the bottom most level AND 2. All other occurrences of the key are in the same snapshot range as the delete I've also enhanced db_stress_test to add an option that does a full compare of the keys. This is done by a single thread (thread # 0). For tests I've run (so far) make check -j64 db_stress db_stress --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify that new code doesnt break existing tests / ./db_stress --compare_full_db_state_snapshot=true --acquire_snapshot_one_in=1000 --ops_per_thread=100000 / to verify new test code */ Pull Request resolved: https://github.com/facebook/rocksdb/pull/4289 Differential Revision: D9491165 Pulled By: shrikanthshankar fbshipit-source-id: ce144834f31736c189aaca81bed356ba990331e2	2018-08-24 15:17:54 -07:00
Yanqin Jin	8022500ecc	Add compatibility test of SST ingestion (#4310 ) Summary: Test plan ``` $cd rocksdb/ $./tools/check_format_compatible.sh ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4310 Differential Revision: D9498125 Pulled By: riversand963 fbshipit-source-id: 83cf6992949a52199e7812bb41bc9281ac271a24	2018-08-24 14:27:43 -07:00
Andrew Kryczka	e7bb8e9b92	Fix clang build of db_stress (#4312 ) Summary: Blame: #4307 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4312 Differential Revision: D9494093 Pulled By: ajkr fbshipit-source-id: eb6be2675c08b9ab508378d45110eb0fcf260a42	2018-08-23 21:57:57 -07:00
Andrew Kryczka	6c40806e51	Digest ZSTD compression dictionary once per SST file (#4251 ) Summary: In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file. ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary. There are a couple other changes included in this PR: - Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it. - Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251 Differential Revision: D9257078 Pulled By: ajkr fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438	2018-08-23 19:28:18 -07:00
Andrew Kryczka	ee234e83e3	Invoke OnTableFileCreated for empty SSTs (#4307 ) Summary: The API comment on `OnTableFileCreationStarted` (`b6280d01f9/include/rocksdb/listener.h (L331-L333)`) led users to believe a call to `OnTableFileCreationStarted` will always be matched with a call to `OnTableFileCreated`. However, we were skipping the `OnTableFileCreated` call in one case: no error happens but also no file is generated since there's no data. This PR adds the call to `OnTableFileCreated` for that case. The filename will be "(nil)" and the size will be zero. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4307 Differential Revision: D9485201 Pulled By: ajkr fbshipit-source-id: 2f077ec7913f128487aae2624c69a50762394df6	2018-08-23 18:27:30 -07:00
zhichao-cao	cf7150ac2e	Add the unit test of Iterator to trace_analyzer_test (#4282 ) Summary: Add the unit test of Iterator (Seek and SeekForPrev) to trace_analyzer_test. The output files after analyzing the trace file are checked to make sure that analyzing results are correct. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4282 Differential Revision: D9436758 Pulled By: zhichao-cao fbshipit-source-id: 88d471c9a69e07382d9c6a45eba72773b171e7c2	2018-08-23 17:28:32 -07:00
Yanqin Jin	bb5dcea98e	Add path to WritableFileWriter. (#4039 ) Summary: We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039 Differential Revision: D8670178 Pulled By: riversand963 fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a	2018-08-23 10:12:58 -07:00
Fenggang Wu	9d646a6311	Add db_bench options of data block hash index (#4281 ) Summary: Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`. `--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`; `--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281 Differential Revision: D9361476 Pulled By: fgwu fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10	2018-08-16 18:42:46 -07:00
Siying Dong	9c0c8f5ff6	GetAllKeyVersions() to take an extra argument of `max_num_ikeys`. (#4271 ) Summary: Right now, `ldb idump` may have memory out of control if there is a big range of tombstones. Add an option to cut maxinum number of keys in GetAllKeyVersions(), and push down --max_num_ikeys from ldb. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4271 Differential Revision: D9369149 Pulled By: siying fbshipit-source-id: 7cbb797b7d2fa16573495a7e84937456d3ff25bf	2018-08-16 15:57:08 -07:00
Zhichao Cao	8ae2bf5331	Fix the build and test bugs in the Trace_analyzer (#4274 ) Summary: The wrong options are used in the trace_analyzer_test, removed. The potential loses integer precision are fixed. Pass the specified testing case, make asan_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/4274 Reviewed By: yiwu-arbug Differential Revision: D9327811 Pulled By: zhichao-cao fbshipit-source-id: d62cb18d6586503a490cd323bfc1c672b68b346e	2018-08-14 18:27:48 -07:00
Anand Ananthabhotla	bf07e90cf2	Fix db_stress assertion failures on 0 byte SSTs (#4273 ) Summary: In the OnTableFileCreation() listener, assert on various TableProperties only when file size > 0 bytes. The listener can get called even for 0 byte SSTs which have been deleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4273 Differential Revision: D9322738 Pulled By: anand1976 fbshipit-source-id: 17cdfb3d0da946b9a158d7328e5db1c87973956b	2018-08-14 14:58:26 -07:00
Maysam Yabandeh	d122025891	Extend stress test to format_version 4 (#4265 ) Summary: Stress tests currently cover format_version 2 and 3. The patch adds 4 as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4265 Differential Revision: D9323185 Pulled By: maysamyabandeh fbshipit-source-id: 54d11e41ecae09bae14cadd7313f07c9a3db5a57	2018-08-14 14:13:33 -07:00
Zhichao Cao	999d955e4f	RocksDB Trace Analyzer (#4091 ) Summary: A framework of trace analyzing for RocksDB After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload. Input: 1. trace file 2. Whole keys space file Statistics: 1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family. 2. Key hotness (access count) of each one 3. Key space separation based on given prefix 4. Key size distribution 5. Value size distribution if appliable 6. Top K accessed keys 7. QPS statistics including the average QPS and peak QPS 8. Top K accessed prefix 9. The query correlation analyzing, output the number of X after Y and the corresponding average time intervals Output: 1. key access heat map (either in the accessed key space or whole key space) 2. trace sequence file (interpret the raw trace file to line base text file for future use) 3. Time serial (The key space ID and its access time) 4. Key access count distritbution 5. Key size distribution 6. Value size distribution (in each intervals) 7. whole key space separation by the prefix 8. Accessed key space separation by the prefix 9. QPS of each operation and each column family 10. Top K QPS and their accessed prefix range Test: 1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge 2. Generated the trace and analyze the trace Implemented but not tested (due to the limitation of trace_replay): 1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing 2. Analyzing the number of Key found by Get Future Work: 1. Support execution time analyzing of each requests 2. Support cache hit situation and block read situation of Get Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091 Differential Revision: D9256157 Pulled By: zhichao-cao fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6	2018-08-13 11:44:02 -07:00
Yanqin Jin	1b1d264342	Remove an assersion about file size (#4268 ) Summary: Due to `4ea56b1bd0`, we should also remove the assersion in stress test. This removal can be temporary, and we can add it back once we figure out the reason for the 0-byte SSTs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4268 Differential Revision: D9297186 Pulled By: riversand963 fbshipit-source-id: cebba9a68f42e815f8cf24471176d2cfdf962f63	2018-08-13 11:12:50 -07:00
Yanqin Jin	b271f956c2	Fix a TSAN failure (#4250 ) Summary: TSAN fails due to comparison between signed int and unsigned long. Fix it by static_casting. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4250 Differential Revision: D9256535 Pulled By: riversand963 fbshipit-source-id: c6bad23ff70c6d0ec58e2e85c401ce0ad45de609	2018-08-09 19:42:32 -07:00
Dmitri Smirnov	ab22cf349e	Implement Env::NumFileLinks (#4221 ) Summary: Although delete scheduler implementation allows for the interface not to be supported, the delete_scheduler_test does not allow for that. Address compiler warnings Make sst_dump_test use test directory structure as the current execution directory may not be writiable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4221 Differential Revision: D9210152 Pulled By: siying fbshipit-source-id: 381a74511e969ecb8089d5c4b4df87dc30c8df63	2018-08-09 14:29:11 -07:00
Yanqin Jin	de7f423a82	Add SST ingestion to ldb (#4205 ) Summary: We add two subcommands `write_extern_sst` and `ingest_extern_sst` to ldb. This PR avoids changing existing code because we hope to cherry-pick to earlier releases to support compatibility check for external SST file ingestion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4205 Differential Revision: D9112711 Pulled By: riversand963 fbshipit-source-id: 7cae88380d4de86da8440230e87eca66755648e4	2018-08-09 14:29:11 -07:00
Andrew Kryczka	7a9a164276	Fix db_bench default compression level (#4248 ) Summary: db_bench's previous default compression level (-1) was not the default compression level in all libraries. In particular, in ZSTD negative values are valid compression levels, while ZSTD's default compression level is three. This PR changes db_bench's default to be RocksDB's library-independent default compression level (see #3895). I also changed a couple other flags to get their default values from an options object directly rather than hardcoding. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4248 Differential Revision: D9235140 Pulled By: ajkr fbshipit-source-id: be4e0722d59fa1968832183db36d1d20fcf11e5b	2018-08-09 10:28:14 -07:00
Andrew Kryczka	6175b4b294	Support dictionary compression in stress/crash tests (#4234 ) Summary: - Add `--compression_max_dict_bytes` and `--compression_zstd_max_train_bytes` flags to stress test - Randomly enable/disable the above flags in crash test - Set `--compression_type=zstd` in FB-specific crash test runs Pull Request resolved: https://github.com/facebook/rocksdb/pull/4234 Differential Revision: D9187207 Pulled By: ajkr fbshipit-source-id: 8d78cf8d8e1165f2cd1c32e069b73726b5bc1fd2	2018-08-06 15:27:29 -07:00
Sagar Vemuri	fefdac1004	Fix lite build failure in db_bench due to trace/replay (#4225 ) Summary: Fix lite build failure in db_bench due to trace/replay feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4225 Differential Revision: D9153303 Pulled By: sagar0 fbshipit-source-id: 9f7a8035429d0dcdbe99616d11389ed7bccf44be	2018-08-03 11:58:55 -07:00
Pooja Malik	9dbf39399e	Rules Advisor: some fixes to support fetching stats from ODS (#4223 ) Summary: This PR includes fixes for some bugs that I encountered while testing the Optimizer with ODS stats support. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4223 Differential Revision: D9140786 Pulled By: poojam23 fbshipit-source-id: 045cb3f27d075c2042040ac2d561938349419516	2018-08-02 15:42:42 -07:00
Pooja Malik	892a156267	Advisor: README and blog, and also tests for DBBenchRunner, DatabaseOptions (#4201 ) Summary: This pull request adds a README file and a blog post for the Advisor tool. It also adds the missing tests for some Optimizer modules. Some comments are added to the classes being tested for improved readability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4201 Reviewed By: maysamyabandeh Differential Revision: D9125311 Pulled By: poojam23 fbshipit-source-id: aefcf2f06eaa05490cc2834ef5aa6e21f0d1dc55	2018-08-01 16:13:09 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Yanqin Jin	8abafb1feb	Generalize parameters generation. (#4046 ) Summary: Making generation of column families and keys virtual function so that subclasses of StressTest can override them to provide custom parameter generation for more flexibility. This will be useful for future tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4046 Differential Revision: D9073382 Pulled By: riversand963 fbshipit-source-id: 2754f0fdfa5c24d95c1f92d4944bc479552fb665	2018-07-30 17:42:12 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
DorianZheng	f5e46354d2	Protect external file when ingesting (#4099 ) Summary: If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099 Differential Revision: D9034092 Pulled By: riversand963 fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28	2018-07-27 14:13:12 -07:00
Pooja Malik	134a52e144	Optimizer's skeleton: use advisor to optimize config options (#4169 ) Summary: In https://github.com/facebook/rocksdb/pull/3934 we introduced advisor scripts that make suggestions in the config options based on the log file and stats from a run of rocksdb. The optimizer runs the advisor on a benchmark application in a loop and automatically applies the suggested changes until the config options are optimized. This is a work in progress and the patch is the initial skeleton for the optimizer. The sample application that is run in the loop is currently dbbench. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4169 Reviewed By: maysamyabandeh Differential Revision: D9023671 Pulled By: poojam23 fbshipit-source-id: a6192d475c462cf6eb2b316716f97cb400fcb64d	2018-07-26 17:13:32 -07:00
Siying Dong	4b0a43574a	db_stress to cover upper bound in iterators (#4162 ) Summary: db_stress doesn't cover upper or lower bound in iterators. Try to cover it by randomly assigning a random one. Also in prefix scan tests, with 50% of the chance, set next prefix as the upper bound. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4162 Differential Revision: D8953507 Pulled By: siying fbshipit-source-id: f0f04e9cb6c07cbebbb82b892ca23e0daeea708b	2018-07-23 10:45:29 -07:00
Zhichao Cao	6811fb0658	Fixed the db_bench MergeRandom only access CF_default (#4155 ) Summary: When running the tracing and analyzing, I found that MergeRandom benchmark in db_bench only access the default column family even the -num_column_families is specified > 1. changes: Using the db_with_cfh as DB to randomly select the column family to execute the Merge operation if -num_column_families is specified > 1. Tested with make asan_check and verified in tracing Pull Request resolved: https://github.com/facebook/rocksdb/pull/4155 Differential Revision: D8907888 Pulled By: zhichao-cao fbshipit-source-id: 2b4bc8fe0e99c8f262f5be6b986c7025d62cf850	2018-07-20 15:58:54 -07:00
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	2018-07-20 14:43:38 -07:00
Pooja Malik	1857576e03	db_bench support for OPTIONS+bloom and nicer output for perf_context (#4153 ) Summary: Adding the string "PERF_CONTEXT:" before the perf_context stats are printed. Setting the filter policy if it's a block based table even when options are being loaded from the provided FLAGS_options_file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4153 Differential Revision: D8905517 Pulled By: poojam23 fbshipit-source-id: 5956ed7882d39ec8ae654d5dadeb88727a36f0dd	2018-07-18 16:27:49 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Zhongyi Xie	23b76252c8	db_bench: enable setting cache_size when loading options file Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4118 Differential Revision: D8845554 Pulled By: miasantreble fbshipit-source-id: 13bd3c1259a7c30bad762a413fe3bb24eea650ba	2018-07-13 16:43:53 -07:00
Zhongyi Xie	de98fd88e3	Support compaction filter in db_bench (#4106 ) Summary: Right now there is no support for enabling compaction filter in db_bench, we should add support for that to facilitate testing of compaction filter. This PR adds a compaction filter called KeepFilter and make `Filter` always returns false, essentially a noop compaction filter. This will allow us to test compaction filter code path without having to support arbitrary compaction filters Pull Request resolved: https://github.com/facebook/rocksdb/pull/4106 Differential Revision: D8828517 Pulled By: miasantreble fbshipit-source-id: 9ad76d04103eaa9d00da98334b4a39e542d26c41	2018-07-12 19:42:27 -07:00
Andrew Kryczka	97fe23fc5c	Fix unsigned int flag in db_bench (#4129 ) Summary: `DEFINE_uint32` was unavailable on some platforms, e.g., https://travis-ci.org/facebook/rocksdb/jobs/403352902. Use `DEFINE_uint64` instead which should work as it's used many times elsewhere in this file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4129 Differential Revision: D8830311 Pulled By: ajkr fbshipit-source-id: b4fc90ba3f50e649c070ce8069c68e530d731f05	2018-07-12 18:43:23 -07:00
Andrew Kryczka	63904434eb	db_bench periodically dump stats to info log (#4109 ) Summary: give control of how often stats are printed, including jemalloc stats if enabled. Previously the default was 10 minutes so we'd only see updated stats for very long benchmark runs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4109 Differential Revision: D8796444 Pulled By: ajkr fbshipit-source-id: fd7902fe3f105fae89322c4ab63316bba4a2b15e	2018-07-12 15:57:42 -07:00
Manuel Ung	b9846370e9	WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078 ) Summary: This adds support for recovering WriteUnprepared transactions through the following changes: - The information in `RecoveredTransaction` is extended so that it can reference multiple batches. - `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not. - `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway. Commit/Rollback of live transactions is still unimplemented and will come later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078 Differential Revision: D8703382 Pulled By: lth fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a	2018-07-06 17:59:13 -07:00
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	2018-06-27 17:13:34 -07:00
Peter (Stig) Edwards	2694b6dc26	Remove unused imports, from python scripts. (#4057 ) Summary: Also remove redefined variable. As reported on https://lgtm.com/projects/g/facebook/rocksdb/ Closes https://github.com/facebook/rocksdb/pull/4057 Differential Revision: D8648342 Pulled By: ajkr fbshipit-source-id: afd2ba84d1364d316010179edd44777e64ca9183	2018-06-26 12:43:04 -07:00
Yanqin Jin	2729dd72ad	Reclaim memory allocated to backup_engine. Summary: Closes https://github.com/facebook/rocksdb/pull/4045 Differential Revision: D8595609 Pulled By: riversand963 fbshipit-source-id: 5ba5954d804b82b0e7264b2e18e1da4c94103b53	2018-06-23 17:12:14 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
Yi Wu	c726f7fda8	Fix dangling checkpoint pointer in db_stress (#4042 ) Summary: Fix db_stress failed to delete checkpoint pointer. It's caught by asan_crash test. Closes https://github.com/facebook/rocksdb/pull/4042 Differential Revision: D8592604 Pulled By: yiwu-arbug fbshipit-source-id: 7b2d67d5e3dfb05f71c33fcf320482303e97d3ef	2018-06-22 11:43:50 -07:00
Andrew Kryczka	0a5b16c7c5	Cleanup staging directory at start of checkpoint (#4035 ) Summary: - Attempt to clean the checkpoint staging directory before starting a checkpoint. It was already cleaned up at the end of checkpoint. But it wasn't cleaned up in the edge case where the process crashed while staging checkpoint files. - Attempt to clean the checkpoint directory before calling `Checkpoint::Create` in `db_stress`. This handles the case where checkpoint directory was created by a previous `db_stress` run but the process crashed before cleaning it up. - Use `DestroyDB` for cleaning checkpoint directory since a checkpoint is a DB. Closes https://github.com/facebook/rocksdb/pull/4035 Reviewed By: yiwu-arbug Differential Revision: D8580223 Pulled By: ajkr fbshipit-source-id: 28c667400e249fad0fdedc664b349031b7b61599	2018-06-21 16:27:12 -07:00
Yanqin Jin	397495964b	Fix a warning (treated as error) caused by type mismatch. Summary: Closes https://github.com/facebook/rocksdb/pull/4032 Differential Revision: D8573061 Pulled By: riversand963 fbshipit-source-id: 112324dcb35956d6b3ec891073f4f21493933c8b	2018-06-21 11:13:09 -07:00
Yanqin Jin	524c6e6b72	Add file name info to SequentialFileReader. (#4026 ) Summary: We potentially need this information for tracing, profiling and diagnosis. Closes https://github.com/facebook/rocksdb/pull/4026 Differential Revision: D8555214 Pulled By: riversand963 fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2	2018-06-21 08:42:24 -07:00
Andrew Kryczka	14cee194d6	Support file ingestion in stress test (#4018 ) Summary: Once per `ingest_external_file_one_in` operations, uses SstFileWriter to create a file containing `ingest_external_file_width` consecutive keys. The file is named containing the thread ID to avoid clashes. The file is then added to the DB using `IngestExternalFile`. We can't enable it by default in crash test because `nooverwritepercent` and `test_batches_snapshot` both must be zero for the DB's whole lifetime. Perhaps we should setup a separate test with that config as range deletion also requires it. Closes https://github.com/facebook/rocksdb/pull/4018 Differential Revision: D8507698 Pulled By: ajkr fbshipit-source-id: 1437ea26fd989349a9ce8b94117241c65e40f10f	2018-06-20 22:27:45 -07:00
Andrew Kryczka	7f3a634e06	Support pipelined write in stress/crash tests Summary: Closes https://github.com/facebook/rocksdb/pull/4019 Differential Revision: D8508681 Pulled By: ajkr fbshipit-source-id: 23a3c07d642386446e322b02e69cdf70d12ef009	2018-06-19 09:14:12 -07:00
Andrew Kryczka	8585059ae0	Support backup and checkpoint in db_stress (#4005 ) Summary: Add the `backup_one_in` and `checkpoint_one_in` options to periodically trigger backups and checkpoints. The directory names contain thread ID to avoid clashing with parallel backups/checkpoints. Enable checkpoint in crash test so our CI runs will use it. Didn't enable backup in crash test since it copies all the files which is too slow. Closes https://github.com/facebook/rocksdb/pull/4005 Differential Revision: D8472275 Pulled By: ajkr fbshipit-source-id: ff91bdc37caac4ffd97aea8df96b3983313ac1d5	2018-06-18 19:28:18 -07:00
Andrew Kryczka	de2c6fb158	Fix stderr processing in crash test (#4006 ) Summary: Fixed bug where `db_stress` output a line with a warning followed by a line with an error, and `db_crashtest.py` considered that a success. For example: ``` WARNING: prefix_size is non-zero but memtablerep != prefix_hash open error: Corruption: SST file is ahead of WALs ``` Closes https://github.com/facebook/rocksdb/pull/4006 Differential Revision: D8473463 Pulled By: ajkr fbshipit-source-id: 60461bdd7491d9d26c63f7d4ee522a0f88ba3de7	2018-06-18 17:58:13 -07:00
Hans-Wilhelm Warlo	4faaab70a6	Benchmark sine wave write rate limit (#3914 ) Summary: As mentioned at the [dev forum.](https://www.facebook.com/groups/rocksdb.dev/1693425187422655/) Let me know if you would like me to do any changes! Closes https://github.com/facebook/rocksdb/pull/3914 Differential Revision: D8452824 Pulled By: siying fbshipit-source-id: 56439b3228ecdcc5a199d5198eff2fab553be961	2018-06-15 12:12:03 -07:00
Siying Dong	f5281a53a4	tools/check_format_compatible.sh to cover forward option reading too (#3994 ) Summary: Make sure that some recent releases can read master's option files while ignoring unknown options. Also add two more recent release branches. Closes https://github.com/facebook/rocksdb/pull/3994 Differential Revision: D8409499 Pulled By: siying fbshipit-source-id: 1b025f19ba288da0517f6b4572797573e23e23c2	2018-06-15 11:12:29 -07:00
Andrew Kryczka	7497f992e0	Run manual compaction in stress/crash tests (#3936 ) Summary: - Add support to `db_stress` for `CompactRange` - Enable `CompactRange` and `CompactFiles` in crash tests Closes https://github.com/facebook/rocksdb/pull/3936 Differential Revision: D8230953 Pulled By: ajkr fbshipit-source-id: 208f9980b5bc8c204b1fa726e83791ad674e21e8	2018-06-13 16:45:28 -07:00
Andrew Kryczka	dd216dd76a	Choose unique keys faster in db_stress (#3990 ) Summary: db_stress initialization randomly chooses a set of keys to not overwrite. It was doing it separately for each column family. That caused 30+ second initialization times for the non-simple crash tests, which have 10 CFs. This PR: - reuses the same set of randomly chosen no-overwrite keys across all CFs - logs a couple more timestamps so we can more easily see initialization time Closes https://github.com/facebook/rocksdb/pull/3990 Differential Revision: D8393821 Pulled By: ajkr fbshipit-source-id: d0b263a298df607285ffdd8b0983ff6575cc6c34	2018-06-13 13:43:23 -07:00
Yanqin Jin	3470c75852	Fix build errors. Summary: Closes https://github.com/facebook/rocksdb/pull/3967 Differential Revision: D8322775 Pulled By: riversand963 fbshipit-source-id: bd73067bd5d3ed4627348f0685bc499359ad6442	2018-06-07 15:43:09 -07:00
Zhichao Cao	23e1d23675	Fixed the fprintf of uint64_t by using PRIu64 (#3963 ) Summary: Fixed the fprintf format of uint64_t by using PRIu64 in file tools/ldb_cmd.cc Closes https://github.com/facebook/rocksdb/pull/3963 Differential Revision: D8306179 Pulled By: zhichao-cao fbshipit-source-id: 597dcd55321576801bbf2cf4714736ebc4750a0c	2018-06-07 11:44:48 -07:00
Yanqin Jin	0a0860a5fb	Refactoring db_stress.cc (#3902 ) Summary: We use `db_stress.cc` intensively to test and verify the behavior of RocksDB. Sometimes we need to add new tests for recently added features. Original `StressTest` class provides many general functionality that can be leveraged by other tests. Therefore, in this refactoring PR, I try to identify the general operations as well as operations that future tests most likely want to customize. Future tests can inherit `StressTest` and overriding the virtual functions to test custom logic. Closes https://github.com/facebook/rocksdb/pull/3902 Differential Revision: D8284607 Pulled By: riversand963 fbshipit-source-id: 019302d04665a2b18334b6d05d04a477168c8ea4	2018-06-07 10:43:00 -07:00
Pooja Malik	5504a056f8	Adding advisor Rules and parser scripts with unit tests. (#3934 ) Summary: This adds some rules in the tools/advisor/advisor/rules.ini (refer this for more information) file and corresponding python parser scripts for parsing the rules file and the rocksdb LOG and OPTIONS files. This is WIP for adding rules depending on ODS. The starting point of the script is the rocksdb/tools/advisor/advisor/rule_parser.py file. Closes https://github.com/facebook/rocksdb/pull/3934 Reviewed By: maysamyabandeh Differential Revision: D8304059 Pulled By: poojam23 fbshipit-source-id: 47f2a50f04d46d40e225dd1cbf58ba490f79e239	2018-06-06 14:42:59 -07:00
Zhongyi Xie	f1592a06c2	run make format for PR 3838 (#3954 ) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f	2018-06-05 12:58:02 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Dmitri Smirnov	f4b72d7056	Provide a way to override windows memory allocator with jemalloc for ZSTD Summary: Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably. For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance. The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression. Closes https://github.com/facebook/rocksdb/pull/3838 Differential Revision: D8229794 Pulled By: miasantreble fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d	2018-06-04 12:12:48 -07:00
Andrew Kryczka	4f297ad05f	Fix crash test check for direct I/O Summary: We need to keep the DB directory around since the direct IO check in "db_crashtest.py" relies on it existing. This PR fixes an issue where it was removed after each stress test run during the second half of whitebox crash testing. Closes https://github.com/facebook/rocksdb/pull/3946 Differential Revision: D8247998 Pulled By: ajkr fbshipit-source-id: 4e7cffbdab9b40df125e7842d0d59916e76261d3	2018-06-03 21:42:12 -07:00
Andrew Kryczka	88c3ee2d31	Configure direct I/O statically in db_stress Summary: Previously `db_stress` attempted to configure direct I/O dynamically in `SetOptions()` which had multiple problems (ummm must've never been tested): - It's a DB option so SetDBOptions should've been called instead - It's not a dynamic option so even SetDBOptions would fail - It required enabling SyncPoint to mask O_DIRECT since it had no way to detect whether the DB directory was in tmpfs or not. This required locking that consumed ~80% of db_stress CPU. In this PR I delete the broken dynamic config and instead configure it statically, only enabling it if the DB directory truly supports O_DIRECT. Closes https://github.com/facebook/rocksdb/pull/3939 Differential Revision: D8238120 Pulled By: ajkr fbshipit-source-id: 60bb2deebe6c9b54a3f788079261715b4a229279	2018-06-01 16:42:34 -07:00
Jacquin Mininger	727eb881a5	Compile error in db bench tool Summary: Small format error below causes build to fail. I believe that this : ``` fprintf(stderr, "num reads to do %lu\n", reads_); ``` Can be changed to this: ``` fprintf(stderr, "num reads to do %" PRIu64 "\n", reads_); ``` Successful build ``` CC utilities/blob_db/blob_dump_tool.o AR librocksdb_debug.a ar: creating archive librocksdb_debug.a /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: librocksdb_debug.a(rocks_lua_compaction_filter.o) has no symbols CC tools/db_bench.o CC tools/db_bench_tool.o tools/db_bench_tool.cc:4532:46: error: format specifies type 'unsigned long' but the argument has type 'int64_t' (aka 'long long') [-Werror,-Wformat] fprintf(stderr, "num reads to do %lu\n", reads_); ~~~ ^~~~~~ %lld 1 error generated. make: *** [tools/db_bench_tool.o] Error 1 ``` ``` $ cd rocksdb $ make all $ g++ --version Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 9.1.0 (clang-902.0.39.1) Target: x86_64-apple-darwin17.5.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin ``` Closes https://github.com/facebook/rocksdb/pull/3909 Differential Revision: D8215710 Pulled By: siying fbshipit-source-id: 15e49fb02a818fec846e9f9b2a50e372b6b67751	2018-05-30 18:01:36 -07:00
Yi Wu	bc7e8d472e	LRUCache midpoint insertion Summary: Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache. Closes https://github.com/facebook/rocksdb/pull/3877 Differential Revision: D8100895 Pulled By: yiwu-arbug fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da	2018-05-24 15:57:33 -07:00
Dmitri Smirnov	3db8504cde	Catchup with posix features Summary: Catch up with Posix features NewWritableRWFile must fail when file does not exists Implement Env::Truncate() Adjust Env options optimization functions Implement MemoryMappedBuffer on Windows. Closes https://github.com/facebook/rocksdb/pull/3857 Differential Revision: D8053610 Pulled By: ajkr fbshipit-source-id: ccd0d46c29648a9f6f496873bc1c9d6c5547487e	2018-05-24 15:13:04 -07:00
Andrew Kryczka	fcb31016e9	Avoid single-deleting merge operands in db_stress Summary: I repro'd some of the "unexpected value" failures showing up in our CI lately and they always happened on keys that have a mix of single deletes and merge operands. The `SingleDelete()` API comment mentions it's incompatible with `Merge()`, so this PR prevents `db_stress` from mixing them. Closes https://github.com/facebook/rocksdb/pull/3878 Differential Revision: D8097346 Pulled By: ajkr fbshipit-source-id: 357a48c6a31156f4f8db3ce565638ad924c437a1	2018-05-22 10:58:36 -07:00
Zhongyi Xie	c3ebc75843	Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c	2018-05-21 14:43:11 -07:00
Yanqin Jin	a0c7b4d526	Set the default value of max_manifest_file_size. Summary: In the past, the default value of max_manifest_file_size is uint64_t::MAX, allowing a long running RocksDB process to grow its MANIFEST file to take up the entire disk, as reported in [issue 3851](https://github.com/facebook/rocksdb/issues/3851). It is reasonable and common to provide a default non-max value for this option. Therefore, I set the value to 1GB. siying miasantreble Please let me know whether this looks good to you. Thanks! Closes https://github.com/facebook/rocksdb/pull/3867 Differential Revision: D8051524 Pulled By: riversand963 fbshipit-source-id: 50251f0804b1fa933a19a30d19d261ea8b9d2b72	2018-05-18 08:11:55 -07:00
Sagar Vemuri	ebb823f746	Fix db_stress build on mac Summary: I noticed, while debugging an unrelated issue, that db_stress is failing to build on mac, leading to a failed `make all`. ``` $ make db_stress -j4 ... tools/db_stress.cc:862:69: error: cannot initialize a parameter of type 'uint64_t ' (aka 'unsigned long long ') with an rvalue of type 'size_t ' (aka 'unsigned long ') status = FLAGS_env->GetFileSize(FLAGS_expected_values_path, &size); ^~~~~ ./include/rocksdb/env.h:277:66: note: passing argument to parameter 'file_size' here virtual Status GetFileSize(const std::string& fname, uint64_t* file_size) = 0; ^ 1 error generated. make: * [tools/db_stress.o] Error 1 make: * Waiting for unfinished jobs.... ``` Closes https://github.com/facebook/rocksdb/pull/3839 Differential Revision: D7979236 Pulled By: sagar0 fbshipit-source-id: 0615e7bb5405bade71e4203803bf723720422d62	2018-05-14 11:14:07 -07:00
Andrew Kryczka	072ae671a7	Apply use_direct_io_for_flush_and_compaction to writes only Summary: Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used. This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously. Closes https://github.com/facebook/rocksdb/pull/3829 Differential Revision: D7915443 Pulled By: ajkr fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279	2018-05-09 19:42:58 -07:00
Andrew Kryczka	d19f568abf	Refactor argument handling in db_crashtest.py Summary: - Any options unknown to `db_crashtest.py` are now passed directly to `db_stress`. This way, we won't need to update `db_crashtest.py` every time `db_stress` gets a new option. - Remove `db_crashtest.py` redundant arguments where the value is the same as `db_stress`'s default - Remove `db_crashtest.py` redundant arguments where the value is the same in a previously applied options map. For example, default_params are always applied before whitebox_default_params, so if they require the same value for an argument, that value only needs to be provided in default_params. - Made the simple option maps applied in addition to the regular option maps. Previously they were exclusive which led to lots of duplication Closes https://github.com/facebook/rocksdb/pull/3809 Differential Revision: D7885779 Pulled By: ajkr fbshipit-source-id: 3a3243b55724d6d5bff36e939b582b9b62c538a8	2018-05-09 13:42:41 -07:00
Andrew Kryczka	4c5a3232e4	Fix db_stress memory leak ASAN error Summary: In case `--expected_values_path` is unset, we allocate a buffer internally to hold the expected DB state. This PR makes sure it is freed. Closes https://github.com/facebook/rocksdb/pull/3804 Differential Revision: D7874694 Pulled By: ajkr fbshipit-source-id: a8f7655e009507c4e639ceebfc3525d69c856e3b	2018-05-04 16:45:15 -07:00
Zhongyi Xie	a703432808	MaxFileSizeForLevel: adjust max_file_size for dynamic level compaction Summary: `MutableCFOptions::RefreshDerivedOptions` always assume base level is L1, which is not true when `level_compaction_dynamic_level_bytes=true` and Level based compaction is used. This PR fixes this by recomputing `max_file_size` at query time (in `MaxFileSizeForLevel`) Fixes https://github.com/facebook/rocksdb/issues/3229 In master: ``` Level Files Size(MB) -------------------- 0 14 846 1 0 0 2 0 0 3 0 0 4 0 0 5 15 366 6 11 481 Cumulative compaction: 3.83 GB write, 2.27 GB read ``` In branch: ``` Level Files Size(MB) -------------------- 0 9 544 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 445 935 Cumulative compaction: 2.91 GB write, 1.46 GB read ``` db_bench command used: ``` ./db_bench --benchmarks="fillrandom,deleterandom,fillrandom,levelstats,stats" --statistics -deletes=5000 -db=tmp -compression_type=none --num=20000 -value_size=100000 -level_compaction_dynamic_level_bytes=true -target_file_size_base=2097152 -target_file_size_multiplier=2 ``` Closes https://github.com/facebook/rocksdb/pull/3755 Differential Revision: D7721381 Pulled By: miasantreble fbshipit-source-id: 39afb8503190bac3b466adf9bbf2a9b3655789f8	2018-05-03 16:42:13 -07:00
Dmitri Smirnov	acb61b7a52	Adjust pread/pwrite to return Status Summary: Returning bytes_read causes the caller to call GetLastError() to report failure but the lasterror may be overwritten by then so we lose the error code. Fix up CMake file to include xpress source code only when needed. Fix warning for the uninitialized var. Closes https://github.com/facebook/rocksdb/pull/3795 Differential Revision: D7832935 Pulled By: anand1976 fbshipit-source-id: 4be21affb9b85d361b96244f4ef459f492b7cb2b	2018-05-01 13:42:46 -07:00
Andrew Kryczka	46152d53bf	Second attempt at db_stress crash-recovery verification Summary: - Original commit: `a4fb1f8c04` - Revert commit (we reverted as a quick fix to get crash tests passing): `6afe22db2e` This PR includes the contents of the original commit plus two bug fixes, which are: - In whitebox crash test, only set `--expected_values_path` for `db_stress` runs in the first half of the crash test's duration. In the second half, a fresh DB is created for each `db_stress` run, so we cannot maintain expected state across `db_stress` runs. - Made `Exists()` return true for `UNKNOWN_SENTINEL` values. I previously had an assert in `Exists()` that value was not `UNKNOWN_SENTINEL`. But it is possible for post-crash-recovery expected values to be `UNKNOWN_SENTINEL` (i.e., if the crash happens in the middle of an update), in which case this assertion would be tripped. The effect of returning true in this case is there may be cases where a `SingleDelete` deletes no data. But if we had returned false, the effect would be calling `SingleDelete` on a key with multiple older versions, which is not supported. Closes https://github.com/facebook/rocksdb/pull/3793 Differential Revision: D7811671 Pulled By: ajkr fbshipit-source-id: 67e0295bfb1695ff9674837f2e05bb29c50efc30	2018-04-30 12:27:34 -07:00
Andrew Kryczka	6afe22db2e	revert db_stress crash-recovery verification Summary: crash-recovery verification is failing in the whitebox testing, which may or may not be a valid correctness issue -- need more time to investigate. In the meantime, reverting so we don't mask other failures. Closes https://github.com/facebook/rocksdb/pull/3786 Differential Revision: D7794516 Pulled By: ajkr fbshipit-source-id: 28ccdfdb9ec9b3b0fb08c15cbf9d2e282201ff33	2018-04-27 12:57:01 -07:00
Zhongyi Xie	459bb9028f	remove prefixscanrandom from db_bench help Summary: fix issue reported in https://github.com/facebook/rocksdb/issues/3757 Closes https://github.com/facebook/rocksdb/pull/3784 Differential Revision: D7794107 Pulled By: miasantreble fbshipit-source-id: 43535074fcb82adb5656bcb916284b2dfc5cbb64	2018-04-27 12:13:19 -07:00
Andrew Kryczka	db36f222d8	Allow options file in db_stress and db_crashtest Summary: - When options file is provided to db_stress, take supported options from the file instead of from flags - Call `BuildOptionsTable` after `Open` so it can use `options_` once it has been populated either from flags or from file - Allow options filename to be passed via `db_crashtest.py` Closes https://github.com/facebook/rocksdb/pull/3768 Differential Revision: D7755331 Pulled By: ajkr fbshipit-source-id: 5205cc5deb0d74d677b9832174153812bab9a60a	2018-04-26 18:42:07 -07:00
Andrew Kryczka	a4fb1f8c04	Add crash-recovery correctness check to db_stress Summary: Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values. In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`. For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class. `db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup. Closes https://github.com/facebook/rocksdb/pull/3629 Differential Revision: D7463144 Pulled By: ajkr fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b	2018-04-24 15:58:22 -07:00
Gabriel Wicke	090c78a0d7	Support lowering CPU priority of background threads Summary: Background activities like compaction can negatively affect latency of higher-priority tasks like request processing. To avoid this, rocksdb already lowers the IO priority of background threads on Linux systems. While this takes care of typical IO-bound systems, it does not help much when CPU (temporarily) becomes the bottleneck. This is especially likely when using more expensive compression settings. This patch adds an API to allow for lowering the CPU priority of background threads, modeled on the IO priority API. Benchmarks (see below) show significant latency and throughput improvements when CPU bound. As a result, workloads with some CPU usage bursts should benefit from lower latencies at a given utilization, or should be able to push utilization higher at a given request latency target. A useful side effect is that compaction CPU usage is now easily visible in common tools, allowing for an easier estimation of the contribution of compaction vs. request processing threads. As with IO priority, the implementation is limited to Linux, degrading to a no-op on other systems. Closes https://github.com/facebook/rocksdb/pull/3763 Differential Revision: D7740096 Pulled By: gwicke fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c	2018-04-24 08:41:51 -07:00
Zhongyi Xie	8a9c7f71c9	fix compilation error: implicit conversion loses integer precision Summary: Fix compilation error with clang: > tools/db_stress.cc:2598:21: error: implicit conversion loses integer precision: 'gflags::uint64' (aka 'unsigned long') to 'uint32_t' (aka 'unsigned int') [-Werror,-Wshorten-64-to-32] Random rand(FLAGS_seed); ~~~~ ^~~~~~~~~~ Closes https://github.com/facebook/rocksdb/pull/3746 Differential Revision: D7703209 Pulled By: miasantreble fbshipit-source-id: 18c56a5138a2f308e4213594bc82e8e64bc21570	2018-04-19 18:57:43 -07:00
Maysam Yabandeh	6d06be22c0	Improve db_stress with transactions Summary: db_stress was already capable running transactions by setting use_txn. Running it under stress showed a couple of problems fixed in this patch. - The uncommitted transaction must be either rolled back or commit after recovery. - Current implementation of WritePrepared transaction cannot handle cf drop before crash. Clarified that in the comments and added safety checks. When running with use_txn, clear_column_family_one_in must be set to 0. Closes https://github.com/facebook/rocksdb/pull/3733 Differential Revision: D7654419 Pulled By: maysamyabandeh fbshipit-source-id: a024bad80a9dc99677398c00d29ff17d4436b7f3	2018-04-18 16:32:35 -07:00
Yanqin Jin	2ee1496c43	Add missing whitespace. Summary: Closes https://github.com/facebook/rocksdb/pull/3729 Differential Revision: D7645465 Pulled By: riversand963 fbshipit-source-id: a64da0960fe6c39847ef848b8888fe9a9c1df25d	2018-04-17 09:57:40 -07:00
Yi Wu	2c2f388897	db_bench fillXXXdeterministic should respect compression type Summary: db_bench fillXXXdeterministic should respect compression type when calling CompactFiles(). Closes https://github.com/facebook/rocksdb/pull/3731 Differential Revision: D7647761 Pulled By: yiwu-arbug fbshipit-source-id: 15e12429e0dd93ece2231b015f2e26c2d94781e6	2018-04-16 18:01:47 -07:00
Zhongyi Xie	af95aecd01	use delete[] to dealloc an array Summary: fix a bug in `db_stress` where an int array was incorrectly deallocated using delete instead of delete[] Closes https://github.com/facebook/rocksdb/pull/3725 Differential Revision: D7634749 Pulled By: miasantreble fbshipit-source-id: 489b776f5f4c03de1824edac5495787ec19cc910	2018-04-15 23:56:39 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
Amy Tai	28087acd79	Implemented Knuth shuffle to construct permutation for selecting no_o… Summary: …verwrite_keys. Also changed each no_overwrite_key set to an unordered set, otherwise Knuth shuffle only gets you 2x time improvement, because insertion (and subsequent internal sorting) into an ordered set is the bottleneck. With this change, each iteration of permutation construction and prefix selection takes around 40 secs, as opposed to 360 secs previously. However, this still means that with the default 10 CF per blackbox test case, the test is going to time out given the default interval of 200 secs. Also, there is currently an assertion error affecting all blackbox tests in db_crashtest.py; this assertion error will be fixed in a future PR. Closes https://github.com/facebook/rocksdb/pull/3699 Differential Revision: D7624616 Pulled By: amytai fbshipit-source-id: ea64fbe83407ff96c1c0ecabbc6c830576939393	2018-04-13 22:13:13 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Maysam Yabandeh	eb5a295440	WritePrepared Txn: add write_committed option to dump_wal Summary: Currently dump_wal cannot print the prepared records from the WAL that is generated by WRITE_PREPARED write policy since the default reaction of the handler is to return NotSupported if markers of WRITE_PREPARED are encountered. This patch enables the admin to pass --write_committed=false option, which will be accordingly passed to the handler. Note that DBFileDumperCommand and DBDumperCommand are still not updated by this patch but firstly they are not urgent and secondly we need to revise this approach later when we also add WRITE_UNPREPARED markers so I leave it for future work. Tested by running it on a WAL generated by WRITE_PREPARED: $ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log \| grep BEGIN_PREARE \| head -1 1,2,70,0,BEGIN_PREARE $ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log --write_committed=false \| grep BEGIN_PREARE \| head -1 1,2,70,0,BEGIN_PREARE PUT(0) : 0x30303031313330313938 PUT(0) : 0x30303032353732313935 END_PREPARE(0x74786E31313535383434323738303738363938313335312D30) Closes https://github.com/facebook/rocksdb/pull/3682 Differential Revision: D7522090 Pulled By: maysamyabandeh fbshipit-source-id: a0332207261c61e18b2f9dfbe9feecd9a1339aca	2018-04-07 21:56:42 -07:00
Phani Shekhar Mantripragada	446b32cfc3	Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d	2018-04-05 19:58:20 -07:00
Yi Wu	36a9f22931	Blob DB: blob_dump to show uncompressed values Summary: Make blob_dump tool able to show uncompressed values if the blob file is compressed. Also show total compressed vs. raw size at the end if --show_summary is provided. Closes https://github.com/facebook/rocksdb/pull/3633 Differential Revision: D7348926 Pulled By: yiwu-arbug fbshipit-source-id: ca709cb4ed5cf6a550ff2987df8033df81516f8e	2018-04-05 11:12:16 -07:00
Andrew Kryczka	b058a33705	Reduce default --nooverwritepercent in black-box crash tests Summary: Previously `python tools/db_crashtest.py blackbox` would do no useful work as the crash interval (two minutes) was shorter than the preparation phase. The preparation phase is slow because of the ridiculously inefficient way it computes which keys should not be overwritten. It was doing this for 60M keys since default values were `FLAGS_nooverwritepercent == 60` and `FLAGS_max_key == 100000000`. Move the "nooverwritepercent" override from whitebox-specific to the general options so it also applies to blackbox test runs. Now preparation phase takes a few seconds. Closes https://github.com/facebook/rocksdb/pull/3671 Differential Revision: D7457732 Pulled By: ajkr fbshipit-source-id: 601f4461a6a7e49e50449dcf15aebc9b8a98d6f0	2018-04-03 15:28:40 -07:00
Anand Ananthabhotla	f9f4d40f93	Align SST file data blocks to avoid spanning multiple pages Summary: Provide a block_align option in BlockBasedTableOptions to allow alignment of SST file data blocks. This will avoid higher IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages. When this option is set to true, the block alignment is set to lower of block size and 4KB. Closes https://github.com/facebook/rocksdb/pull/3502 Differential Revision: D7400897 Pulled By: anand1976 fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44	2018-03-26 20:26:10 -07:00
Sagar Vemuri	a993c0139d	Add 5.11 and 5.12 to tools/check_format_compatible.sh Summary: Closes https://github.com/facebook/rocksdb/pull/3646 Differential Revision: D7384727 Pulled By: sagar0 fbshipit-source-id: f713af7adb2ffea5303bbf0fac8a8a1630af7b38	2018-03-23 12:43:06 -07:00
Siying Dong	6383e42362	benchmark.sh to use --max_background_job Summary: Closes https://github.com/facebook/rocksdb/pull/3632 Differential Revision: D7347012 Pulled By: siying fbshipit-source-id: 46230ec4a917ccf4c478825b07e92b4665a4820b	2018-03-20 18:57:55 -07:00
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	2018-03-08 10:27:25 -08:00
Yi Wu	b864bc9b5b	Blob DB: Improve FIFO eviction Summary: Improving blob db FIFO eviction with the following changes, * Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size. * FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion. * FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files. * If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO. * Compaction filter also filter those blob indexes where corresponding blob file is gone. * Add an event listener to listen compaction/flush event and update sst file size. * Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db. * More blob db statistics around FIFO. * Fix some locking issue when accessing a blob file. Closes https://github.com/facebook/rocksdb/pull/3556 Differential Revision: D7139328 Pulled By: yiwu-arbug fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa	2018-03-06 11:57:42 -08:00
Pooya Shareghi	0a2354ca8f	Added bytes XOR merge operator Summary: Closes https://github.com/facebook/rocksdb/pull/575 I fixed the merge conflicts etc. Closes https://github.com/facebook/rocksdb/pull/3065 Differential Revision: D7128233 Pulled By: sagar0 fbshipit-source-id: 2c23a48c9f0432c290b0cd16a12fb691bb37820c	2018-03-06 10:27:36 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Maysam Yabandeh	d060421c77	Fix a leak in prepared_section_completed_ Summary: The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries. Closes https://github.com/facebook/rocksdb/pull/3545 Differential Revision: D7106071 Pulled By: maysamyabandeh fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48	2018-03-01 20:41:56 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Andrew Kryczka	1960e73e21	fix handling of empty string as checkpoint directory Summary: - made `CreateCheckpoint` properly return `InvalidArgument` when called with an empty directory. Previously it triggered an assertion failure due to a bug in the logic. - made `ldb` set empty `checkpoint_dir` if that's what the user specifies, so that we can use it to properly test `CreateCheckpoint` in the future. Differential Revision: D6874562 fbshipit-source-id: dcc1bd41768261d9338987fa7711444289707ed7	2018-02-20 16:44:00 -08:00
Yi Wu	989d12313c	Legocastle job to report lite build binary size to scuba Summary: Add a legocastle job to continuously build the last 10 commits every 4 hours and report lite build binary size to scuba. Closes https://github.com/facebook/rocksdb/pull/3511 Differential Revision: D7001730 Pulled By: yiwu-arbug fbshipit-source-id: 7c8ca87c46d663c786a0d32be69ebbe7b19a5eb9	2018-02-15 17:27:24 -08:00
Andrew Kryczka	0a0fad447b	db_bench separate options for partition index and filters Summary: Some workloads (like my current benchmarking) may want partitioned indexes without partitioned filters. Particularly, when `-optimize_filters_for_hits=true`, the total index size may be larger than the total filter size, so it can make sense to hold all filters in-memory but not all indexes. Closes https://github.com/facebook/rocksdb/pull/3492 Differential Revision: D6970092 Pulled By: ajkr fbshipit-source-id: b7fa1828e1d13829339aefb90fd56eb7c5337f61	2018-02-12 14:57:13 -08:00
Chinmay Kamat	9fc72d6f16	Compilation fixes for powerpc build, -Wparentheses-equality error and missing header guards Summary: This pull request contains miscellaneous compilation fixes. Thanks, Chinmay Closes https://github.com/facebook/rocksdb/pull/3462 Differential Revision: D6941424 Pulled By: sagar0 fbshipit-source-id: fe9c26507bf131221f2466740204bff40a15614a	2018-02-09 14:12:43 -08:00
Tamir Duberstein	cd5092e168	Suppress unused warnings Summary: - Use `__unused__` everywhere - Suppress unused warnings in Release mode + This currently affects non-MSVC builds (e.g. mingw64). Closes https://github.com/facebook/rocksdb/pull/3448 Differential Revision: D6885496 Pulled By: miasantreble fbshipit-source-id: f2f6adacec940cc3851a9eee328fafbf61aad211	2018-02-02 12:27:07 -08:00
Siying Dong	e2d4b0efb1	db_bench: sanity check CuckooTable with mmap_read option Summary: This is to avoid run time error. Fail the db_bench immediately if cuckoo table is used but mmap_read is not specified. Closes https://github.com/facebook/rocksdb/pull/3420 Differential Revision: D6838284 Pulled By: siying fbshipit-source-id: 20893fa28d40fadc31e4ff154bed02f5a1bad341	2018-01-29 14:27:32 -08:00

... 3 4 5 6 7 ...

1040 Commits