rocksdb

Author	SHA1	Message	Date
Akanksha Mahajan	e8a7001159	Update branch as "main" in tools/advisor/README.md (#8744 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8744 Reviewed By: ltamasi Differential Revision: D30716145 Pulled By: akankshamahajan15 fbshipit-source-id: c2fcaf9ddcae85a86c0f10496acab28cd795ff12	2021-09-01 20:26:28 -07:00
Peter Dillinger	c9cd5d25a8	Remove some unneeded code (#8736 ) Summary: * FullKey and ParseFullKey appear to serve no purpose in the public API (or anything else) so removed. Only use in one test updated. * NumberToString serves no purpose vs. ToString so removed, numerous calls updated * Remove unnecessary forward declarations in metadata.h by re-arranging class definitions. * Remove some unneeded semicolons Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736 Test Plan: existing tests Reviewed By: mrambacher Differential Revision: D30700039 Pulled By: pdillinger fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2	2021-09-01 14:28:58 -07:00
Levi Tamasi	22ecd7edc1	Add 6.24 to the format compatibility checker (#8716 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8716 Reviewed By: jay-zhuang Differential Revision: D30587722 Pulled By: ltamasi fbshipit-source-id: d2f5b08084778779c5a8b85635977babd8194d5c	2021-08-26 16:35:58 -07:00
Yanqin Jin	d8eb824325	Temporarily disable block-based filter when stress testing timestamp (#8703 ) Summary: Current implementation does not support user-defined timestamp when block-based filter is used. Will implement the support in the future, or wait to see if block-based filter can be deprecated and removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8703 Test Plan: make whitebox_crash_test_with_ts Reviewed By: pdillinger Differential Revision: D30528931 Pulled By: riversand963 fbshipit-source-id: 60dd74ee0a6194e69072069d8c4bd876f249f38d	2021-08-24 19:04:58 -07:00
Merlin Mao	785faf2d07	Simplify `TraceAnalyzer` (#8697 ) Summary: Handler functions now use a common output function to output to stdout/files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8697 Test Plan: `trace_analyzer_test` can pass. Reviewed By: zhichao-cao Differential Revision: D30527696 Pulled By: autopear fbshipit-source-id: c626cf4d53a39665a9c4bcf0cb019c448434abe4	2021-08-24 18:18:36 -07:00
Merlin Mao	f6437ea4d7	Refactor TraceAnalyzer to use `TraceRecord::Handler` to avoid casting. (#8678 ) Summary: `TraceAnalyzer` privately inherits `TraceRecord::Handler` and `WriteBatch::Handler`. `trace_analyzer_test` can pass with this change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8678 Reviewed By: zhichao-cao Differential Revision: D30459814 Pulled By: autopear fbshipit-source-id: a27f59ac4600f7c3682830c9b1d9dc79e53425be	2021-08-23 17:18:27 -07:00
Peter Dillinger	2a383f21f4	Add Bloom/Ribbon hybrid API support (#8679 ) Summary: This is essentially resurrection and fixing of the part of https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically, when configuring Ribbon filter, you can specify an LSM level before which Bloom will be used instead of Ribbon. But Bloom is only considered for Leveled and Universal compaction styles and file going into a known LSM level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as you would expect with NewRibbonFilterPolicy. So that this can be controlled with a single int value and so that flushes can be distinguished from intra-L0, we consider flush to go to level -1 for the purposes of this option. (Explained in API comment.) I also expect the most common and recommended Ribbon configuration to use Bloom during flush, to minimize slowing down writes and because according to my estimates, Ribbon only pays off if the structure lives in memory for more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy to be this mild hybrid configuration. I don't really want to add something like NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for flush, Ribbon otherwise) should be considered a natural choice. C APIs also updated, but because they don't support overloading, rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid configuration. While touching C API, I changed bits per key options from int to double. BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit unused fields from BloomFilterPolicy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679 Test Plan: new + updated tests, including crash test Reviewed By: jay-zhuang Differential Revision: D30445797 Pulled By: pdillinger fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1	2021-08-20 18:00:16 -07:00
Merlin Mao	d10801e983	Allow Replayer to report the results of TraceRecords. (#8657 ) Summary: `Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.) `Replayer::Replay()` reports the results via a callback function. New interface: `TraceRecordResult` in "rocksdb/trace_record_result.h". `DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657 Reviewed By: ajkr Differential Revision: D30290216 Pulled By: autopear fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6	2021-08-18 17:06:14 -07:00
Adam Retter	48c468c22e	Use non-zero exit codes in benchmark.sh when the benchmark cannot be run (#8554 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8554 Reviewed By: ajkr Differential Revision: D29756562 Pulled By: mrambacher fbshipit-source-id: ab2f5ef988c8ac7ea7c633e6a3dacaf16f021529	2021-08-16 06:25:28 -07:00
Merlin Mao	f58d276764	Make TraceRecord and Replayer public (#8611 ) Summary: New public interfaces: `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h". `Replayer`, available in `rocksdb/utilities/replayer.h`. User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file. Unit tests: - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes. - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611 Reviewed By: ajkr Differential Revision: D30266329 Pulled By: autopear fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8	2021-08-11 19:32:46 -07:00
Peter Dillinger	6450e9fc38	Update and enhance check_format_compatible.sh (#8651 ) Summary: The last few releases overlooked adding to this test. This change fixes that. This change also fixes the problem of older branches not understanding ROCKSDB_NO_FBCODE and referencing compilers no longer supported. During the test, build_detect_platform is patched to force no FBCODE compiler usage. (We should not need to update old branches perpetually.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8651 Test Plan: local run reproduces regression described in https://github.com/facebook/rocksdb/issues/8650 Reviewed By: jay-zhuang, zhichao-cao Differential Revision: D30261872 Pulled By: pdillinger fbshipit-source-id: 02b447d224d7e0eb8613c63185437ded146713bc	2021-08-11 16:02:26 -07:00
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	2021-08-10 18:09:03 -07:00
Andrew Kryczka	82b81dc8b5	Simplify GenericRateLimiter algorithm (#8602 ) Summary: `GenericRateLimiter` slow path handles requests that cannot be satisfied immediately. Such requests enter a queue, and their thread stays in `Request()` until they are granted or the rate limiter is stopped. These threads are responsible for unblocking themselves. The work to do so is split into two main duties. (1) Waiting for the next refill time. (2) Refilling the bytes and granting requests. Prior to this PR, the slow path logic involved a leader election algorithm to pick one thread to perform (1) followed by (2). It elected the thread whose request was at the front of the highest priority non-empty queue since that request was most likely to be granted. This algorithm was efficient in terms of reducing intermediate wakeups, which is a thread waking up only to resume waiting after finding its request is not granted. However, the conceptual complexity of this algorithm was too high. It took me a long time to draw a timeline to understand how it works for just one edge case yet there were so many. This PR drops the leader election to reduce conceptual complexity. Now, the two duties can be performed by whichever thread acquires the lock first. The risk of this change is increasing the number of intermediate wakeups, however, we took steps to mitigate that. - `wait_until_refill_pending_` flag ensures only one thread performs (1). This\ prevents the thundering herd problem at the next refill time. The remaining\ threads wait on their condition variable with an unbounded duration -- thus we\ must remember to notify them to ensure forward progress. - (1) is typically done by a thread at the front of a queue. This is trivial\ when the queues are initially empty as the first choice that arrives must be\ the only entry in its queue. When queues are initially non-empty, we achieve\ this by having (2) notify a thread at the front of a queue (preferring higher\ priority) to perform the next duty. - We do not require any additional wakeup for (2). Typically it will just be\ done by the thread that finished (1). Combined, the second and third bullet points above suggest the refill/granting will typically be done by a request at the front of its queue. This is important because one wakeup is saved when a granted request happens to be in an already running thread. Note there are a few cases that still lead to intermediate wakeup, however. The first two are existing issues that also apply to the old algorithm, however, the third (including both subpoints) is new. - No request may be granted (only possible when rate limit dynamically\ decreases). - Requests from a different queue may be granted. - (2) may be run by a non-front request thread causing it to not be granted even\ if some requests in that same queue are granted. It can happen for a couple\ (unlikely) reasons. - A new request may sneak in and grab the lock at the refill time, before the\ thread finishing (1) can wake up and grab it. - A new request may sneak in and grab the lock and execute (1) before (2)'s\ chosen candidate can wake up and grab the lock. Then that non-front request\ thread performing (1) can carry over to perform (2). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602 Test Plan: - Use existing tests. The edge cases listed in the comment are all performance\ related; I could not really think of any related to correctness. The logic\ looks the same whether a thread wakes up/finishes its work early/on-time/late,\ or whether the thread is chosen vs. "steals" the work. - Verified write throughput and CPU overhead are basically the same with and\ without this change, even in a rate limiter heavy workload: Test command: ``` $ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000 ``` Results before this PR: ``` fillrandom : 108.463 micros/op 9219 ops/sec; 9.0 MB/s 7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k ``` Results after this PR: ``` fillrandom : 108.108 micros/op 9250 ops/sec; 9.0 MB/s 7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k ``` Reviewed By: hx235 Differential Revision: D30048013 Pulled By: ajkr fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b	2021-08-09 16:47:15 -07:00
sdong	e7c24168d8	Move old files to warm tier in FIFO compactions (#8310 ) Summary: Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310 Test Plan: Add several unit tests. Reviewed By: ajkr Differential Revision: D28493792 fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501	2021-08-09 12:51:14 -07:00
Peter (Stig) Edwards	543a201b93	Remove unused variable - run_had_errors (#8599 ) Summary: Unused since `ab718b415f` . Noticed on `b215f1a832/files/tools/db_crashtest.py (xf254f528ad18f108)`:1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8599 Reviewed By: ajkr Differential Revision: D30057041 Pulled By: zhichao-cao fbshipit-source-id: e80438cf9717086d2bf67461e19393d426a7676e	2021-08-06 14:46:37 -07:00
HappyUncle	d56f74a4db	Update benchmark.sh (#8615 ) Summary: Fix help message. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8615 Reviewed By: siying Differential Revision: D30136092 Pulled By: mrambacher fbshipit-source-id: edf4112570514d709560baaf96a47c5f36f00665	2021-08-06 14:35:34 -07:00
mrambacher	d057e8326d	Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481 ) Summary: - Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes. - Added Options/Configurable/Object Registration for TTL and Cassandra variants - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option - Added tests for new functionality Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481 Reviewed By: zhichao-cao Differential Revision: D30136050 Pulled By: mrambacher fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af	2021-08-06 08:27:25 -07:00
mrambacher	ab7f7c9e49	Allow WAL dir to change with db dir (#8582 ) Summary: Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file. After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path). Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582 Reviewed By: pdillinger, autopear Differential Revision: D29881122 Pulled By: mrambacher fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb	2021-07-30 12:16:44 -07:00
Baptiste Lemaire	9501279d5f	Create fillanddeleteuniquerandom benchmark (db_bench), with new option flags. (#8593 ) Summary: Introduction of a new `fillanddeleteuniquerandom` benchmark (`db_bench`) with 5 new option flags to simulate a benchmark where the following sequence is repeated multiple times: "A set of keys S1 is inserted ('`disposable entries`'), then after some delay another set of keys S2 is inserted ('`persistent entries`') and the first set of keys S1 is deleted. S2 artificially represents the insertion of hypothetical results from some undefined computation done on the first set of keys S1. The next sequence can start as soon as the last disposable entry in the set S1 of this sequence is inserted, if the `delay` is non negligible." New flags: - `disposable_entries_delete_delay`: minimum delay in microseconds between insertion of the last `disposable` entry, and the start of the insertion of the first `persistent` entry. - `disposable_entries_batch_size`: number of `disposable` entries inserted at the beginning of each sequence. - `disposable_entries_value_size`: size of the random `value` string for the `disposable` entries. - `persistent_entries_batch_size`: number of `persistent` entries inserted at the end of each sequence, right before the deletion of the `disposable` entries starts. - `persistent_entries_value_size`: size of the random value string for the `persistent` entries. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8593 Reviewed By: pdillinger Differential Revision: D29974436 Pulled By: bjlemaire fbshipit-source-id: f578033e5b45e8268ba6fa6f38f4770c2e6e801d	2021-07-29 17:23:01 -07:00
Peter Dillinger	0804b44fb6	Some fixes and enhancements to `ldb repair` (#8544 ) Summary: * Basic handling of SST file with just range tombstones rather than failing assertion about smallest_seqno <= largest_seqno * Adds --verbose option so that there exists a way to see the INFO output from Repairer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8544 Test Plan: unit test added, manual testing for --verbose Reviewed By: ajkr Differential Revision: D29954805 Pulled By: pdillinger fbshipit-source-id: 696af25805fc36cc178b04ba6045922a22625fd9	2021-07-28 16:44:14 -07:00
Baptiste Lemaire	d6006f9c9b	Add experimental mempurge policy flag to db_stress. (#8588 ) Summary: Add `experimental_mempurge_policy` flag to `db_stress` and `db_crashtest.py`. This flag is only read if the `experimental_allow_mempurge` flag is set to `true`. This flag can take the following values: `kAlways`, and `kAlternate` (default). - `kAlways`: a flush is always redirected to a mempurge. If the mempurge aborts, the a regular flush proceeds. - `kAlternate`: if one or more of the flush input memtables is an mempurge output memtable, then a flush is performed, else a mempurge is carried out. Similar to kAlways, if a mempurge aborts, the FlushJob proceeds to a regular flush to storage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8588 Reviewed By: pdillinger Differential Revision: D29934251 Pulled By: bjlemaire fbshipit-source-id: 90c1debed2029b9915d066914556547507c33dae	2021-07-28 13:27:58 -07:00
mrambacher	3aee4fbd41	Make EventListener into a Customizable Class (#8473 ) Summary: - Added Type/CreateFromString - Added ability to load EventListeners to DBOptions - Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473 Reviewed By: zhichao-cao Differential Revision: D29901488 Pulled By: mrambacher fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254	2021-07-27 07:47:02 -07:00
Baptiste Lemaire	4361d6d163	Add simple heuristics for experimental mempurge. (#8583 ) Summary: Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE. `ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput). `ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios. This PR also includes support for `experimental_mempurge_policy` in `db_bench`. Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583 Reviewed By: pdillinger Differential Revision: D29888050 Pulled By: bjlemaire fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1	2021-07-26 11:56:29 -07:00
leipeng	2febf1c45c	db_bench_tool.cc: fix copy - paste (#8553 ) Summary: PR https://github.com/facebook/rocksdb/issues/8519 fix db_bench_tool.cc for MSVC build errors by simply copy-paste, this PR fix the copy-paste while also works for MSVC. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8553 Reviewed By: ajkr Differential Revision: D29838056 Pulled By: jay-zhuang fbshipit-source-id: 0cd60c146b87a355c3dc1061dfe813169d75cea4	2021-07-23 14:31:29 -07:00
Zhichao Cao	61c9bd49c1	Analyze MultiGet in trace_analyzer (#8575 ) Summary: Now we can analyze the MultiGet queries in the trace file and generate a set of the statistic and analysis files. Note that, when one MultiGet access N keys, we count each sub-get-query individually. But the over all query number is still the MultiGet not the sub-get-query. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8575 Test Plan: added new unit test and make check Reviewed By: anand1976 Differential Revision: D29860633 Pulled By: zhichao-cao fbshipit-source-id: a132128527f36828d266df8e36e3ec626c2170be	2021-07-22 16:52:20 -07:00
Baptiste Lemaire	6b4cdacf41	Add overwrite_probability for filluniquerandom benchmark in db_bench (#8569 ) Summary: Add flags `overwrite_probability` and `overwrite_window_size` flag to `db_bench`. Add the possibility of performing a `filluniquerandom` benchmark with an overwrite probability. For each write operation, there is a probability _p_ that the write is an overwrite (_p_=`overwrite_probability`). When an overwrite is decided, the key is randomly chosen from the last _N_ keys previously inserted into the DB (with _N_=`overwrite_window_size`). When a pure write is decided, the key inserted into the DB is unique and therefore will not be an overwrite. The `overwrite_window_size` is used so that the user can decide if the overwrite are mostly targeting recently inserted keys (when `overwrite_window_size` is small compared to the total number of writes), or can also target keys inserted "a long time ago" (when `overwrite_window_size` is comparable to total number of writes). Note that total number of writes = # of unique insertions + # of overwrites. No unit test specifically added. Local testing show the following throughputs for `filluniquerandom` with 1M total writes: - bypass the code inserts (no `overwrite_probability` flag specified): ~14.0MB/s - `overwrite_probability=0.99`, `overwrite_window_size=10`: ~17.0MB/s - `overwrite_probability=0.10`, `overwrite_window_size=10`: ~14.0MB/s - `overwrite_probability=0.99`, `overwrite_window_size=1M`: ~14.5MB/s - `overwrite_probability=0.10`, `overwrite_window_size=1M`: ~14.0MB/s Pull Request resolved: https://github.com/facebook/rocksdb/pull/8569 Reviewed By: pdillinger Differential Revision: D29818631 Pulled By: bjlemaire fbshipit-source-id: d472b4ea4e457a4da7c4ee4f14b40cccd6a4587a	2021-07-21 11:33:33 -07:00
sdong	bbc85a5f22	Fix minor wrong variable name in db_bench (#8549 ) Summary: Fix a minor variable name that is not accurate. This is recently introduced in https://github.com/facebook/rocksdb/pull/7818 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8549 Reviewed By: zhichao-cao Differential Revision: D29745585 fbshipit-source-id: 6268b348878fdf99a162b2cc3d5876fbd9bb10d9	2021-07-19 17:08:15 -07:00
Baptiste Lemaire	f4529a54bb	Add experimental_allow_mempurge flag to benchmark. (#8546 ) Summary: Tiny PR to add the `experimental_allow_mempurge` to the `db_bench` tool (`Mempurge` is the current prototype for memtable garbage collection). This is useful to benchmark the prototype of this new feature, stress test it and help find new meaningful heuristics for GC. By default, the flag to allow `mempurge` is set to `false`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8546 Reviewed By: anand1976 Differential Revision: D29738338 Pulled By: bjlemaire fbshipit-source-id: 01892883a2f1c714c110718674da05992d6e2dd6	2021-07-19 11:19:21 -07:00
sdong	1e5b631e51	db_bench seekrandom with multiDB should only create iterators queried (#7818 ) Summary: Right now, db_bench with seekrandom and multiple DB setup creates iterator for all DBs just to query one of them. It's different from most real workloads. Fix it by only creating iterators that will be queried. Also fix a bug that DBs are not destroyed in multi-DB mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7818 Test Plan: Run db_bench with single/multiDB X using/not using tailing iterator with ASAN build, and validate the behavior is expected. Reviewed By: ajkr Differential Revision: D25720226 fbshipit-source-id: c2ff7ff7120e5ba64287a30b057c5d29b2cbe20b	2021-07-16 12:28:10 -07:00
Baptiste Lemaire	0229a88dfe	Crashtest mempurge (#8545 ) Summary: Add `experiemental_allow_mempurge` flag support for `db_stress` and `db_crashtest.py`, with a `false` default value. I succesfully tested locally both `whitebox` and `blackbox` crash tests with `experiemental_allow_mempurge` flag set as true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8545 Reviewed By: akankshamahajan15 Differential Revision: D29734513 Pulled By: bjlemaire fbshipit-source-id: 24316c0eccf6caf409e95c035f31d822c66714ae	2021-07-16 10:20:22 -07:00
sherriiiliu	7b9ecd4067	fix several MSVC build errors (#8519 ) Summary: Fixed a few MSVC (VCToolsVersion=14.0) build errors and warnings * `DEFINE_string` is a macro and VC compiler complains that it cannot put [ifdef-inside-define](https://stackoverflow.com/questions/5586429/ifdef-inside-define) * `sleep()` is not a recognizable function. Use `FLAGS_env->SleepForMicroseconds` instead * Define precise type in comparison to avoid mismatch warning Pull Request resolved: https://github.com/facebook/rocksdb/pull/8519 Reviewed By: jay-zhuang Differential Revision: D29683086 fbshipit-source-id: 8c80941472089f8daba84ae29597e75e603850e4	2021-07-13 12:40:43 -07:00
mrambacher	da90e23998	Improvements to benchmark.sh script (#8346 ) Summary: 1. Fix printing of stats when there are no writes (wamp=0). Previously had a div0 error 2. Added multireadrandom command as a valid target 3. Added ability to pass additional command line options to db_bench. Now can say things like benchmark.sh readrandom --mmap_read and the option will be passed to db_bench. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8346 Reviewed By: zhichao-cao Differential Revision: D29500436 Pulled By: mrambacher fbshipit-source-id: 54e90708aae9133be3a903e35efdf8f8abbd86fa	2021-07-12 12:18:17 -07:00
Adam Retter	5afd1e309c	Correct CVS -> CSV typo (#8513 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8513 Reviewed By: jay-zhuang Differential Revision: D29654066 Pulled By: mrambacher fbshipit-source-id: b8f492fe21edd37fe1f1c5a4a0e9153f58bbf3e2	2021-07-12 05:05:16 -07:00
sdong	f33611d5e9	Stress test to inject read failures in DB reopen (#8476 ) Summary: Inject read failures in DB reopen, just as what we do for metadata writes and writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476 Test Plan: Some manual tests and make sure failures are triggered. Reviewed By: anand1976 Differential Revision: D29507283 fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c	2021-07-06 11:05:27 -07:00
mrambacher	570248aeff	Make SecondaryCache Customizable (#8480 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8480 Reviewed By: zhichao-cao Differential Revision: D29528740 Pulled By: mrambacher fbshipit-source-id: fd0f70d15f66611c8498257a9973f7e98ca13839	2021-07-06 09:18:08 -07:00
Peter (Stig) Edwards	b20737709f	Add -report_open_timing to db_bench (#8464 ) Summary: Hello and thanks for RocksDB, This PR adds support for ```-report_open_timing true``` to ```db_bench```. It can be useful when tuning RocksDB on filesystem/env with high latencies for file level operations (create/delete/rename...) seen during ```((Optimistic)Transaction)DB::Open```. Some examples: ``` > db_bench -benchmarks updaterandom -num 1 -db /dev/shm/db_bench > db_bench -benchmarks updaterandom -num 0 -db /dev/shm/db_bench -use_existing_db true -report_open_timing true -readonly true 2>&1 \| grep OpenDb OpenDb: 3.90133 milliseconds > db_bench -benchmarks updaterandom -num 0 -db /dev/shm/db_bench -use_existing_db true -report_open_timing true -use_secondary_db true 2>&1 \| grep OpenDb OpenDb: 3.33414 milliseconds > db_bench -benchmarks updaterandom -num 0 -db /dev/shm/db_bench -use_existing_db true -report_open_timing true 2>&1 \| grep -A1 OpenDb OpenDb: 6.05423 milliseconds > db_bench -benchmarks updaterandom -num 1 > db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_open_timing true -readonly true 2>&1 \| grep OpenDb OpenDb: 4.06859 milliseconds > db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_open_timing true -use_secondary_db true 2>&1 \| grep OpenDb OpenDb: 2.85794 milliseconds > db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_open_timing true 2>&1 \| grep OpenDb OpenDb: 6.46376 milliseconds > db_bench -benchmarks updaterandom -num 1 -db /clustered_fs/db_bench > db_bench -benchmarks updaterandom -num 0 -db /clustered_fs/db_bench -use_existing_db true -report_open_timing true -readonly true 2>&1 \| grep OpenDb OpenDb: 3.79805 milliseconds > db_bench -benchmarks updaterandom -num 0 -db /clustered_fs/db_bench -use_existing_db true -report_open_timing true -use_secondary_db true 2>&1 \| grep OpenDb OpenDb: 3.00174 milliseconds > db_bench -benchmarks updaterandom -num 0 -db /clustered_fs/db_bench -use_existing_db true -report_open_timing true 2>&1 \| grep OpenDb OpenDb: 24.8732 milliseconds ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8464 Reviewed By: hx235 Differential Revision: D29398096 Pulled By: zhichao-cao fbshipit-source-id: 8f05dc3284f084612a3f30234e39e1c37548f50c	2021-07-01 18:42:19 -07:00
sdong	ba224b75c7	Stress Test to inject write failures in reopen (#8474 ) Summary: Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474 Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected. Reviewed By: zhichao-cao Differential Revision: D29503116 fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab	2021-06-30 16:46:41 -07:00
anand76	6f9ed59b1d	Allow db_stress to use a secondary cache (#8455 ) Summary: Add a ```-secondary_cache_uri``` to db_stress to allow the user to specify a custom ```SecondaryCache``` object from the object registry. Also allow db_crashtest.py to be run with an alternate db_stress location. Together, these changes will allow us to run db_stress using FB internal components. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8455 Reviewed By: zhichao-cao Differential Revision: D29371972 Pulled By: anand1976 fbshipit-source-id: dd1b1fd80ebbedc11aa63d9246ea6ae49edb77c4	2021-06-27 23:54:39 -07:00
Akanksha Mahajan	be8199cdb9	Run Merge with Integrated BlobDB in stress, crash and db_bench (#8461 ) Summary: Run Merge with Intergrated BlobDB in stress tests, crash tests and db_bench. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8461 Test Plan: 1. python3 -u tools/db_crashtest.py --simple whitebox ---use_merge=1 --enable_blob_files=1 2. ./db_bench --benchmarks="readwhilemerging" --merge_operator=uint64add --enable_blob_files=true Reviewed By: ltamasi Differential Revision: D29394824 Pulled By: akankshamahajan15 fbshipit-source-id: 0a8e492b13129673e088fb8af3402ab678bb473a	2021-06-25 10:45:52 -07:00
Peter (Stig) Edwards	75741eb0ce	Add more ops to: db_bench -report_file_operations (#8448 ) Summary: Hello and thanks for RocksDB, Here is a PR to add file deletes, renames and ```Flush()```, ```Sync()```, ```Fsync()``` and ```Close()``` to file ops report. The reason is to help tune RocksDB options when using an env/filesystem with high latencies for file level ("metadata") operations, typically seen during ```DB::Open``` (```db_bench -num 0``` also see https://github.com/facebook/rocksdb/pull/7203 where IOTracing does not trace ```DB::Open```). Before: ``` > db_bench -benchmarks updaterandom -num 0 -report_file_operations true ... Entries: 0 ... Num files opened: 12 Num Read(): 6 Num Append(): 8 Num bytes read: 6216 Num bytes written: 6289 ``` After: ``` > db_bench -benchmarks updaterandom -num 0 -report_file_operations true ... Entries: 0 ... Num files opened: 12 Num files deleted: 3 Num files renamed: 4 Num Flush(): 10 Num Sync(): 5 Num Fsync(): 1 Num Close(): 2 Num Read(): 6 Num Append(): 8 Num bytes read: 6216 Num bytes written: 6289 ``` Before: ``` > db_bench -benchmarks updaterandom -report_file_operations true ... Entries: 1000000 ... Num files opened: 18 Num Read(): 396339 Num Append(): 1000058 Num bytes read: 892030224 Num bytes written: 187569238 ``` After: ``` > db_bench -benchmarks updaterandom -report_file_operations true ... Entries: 1000000 ... Num files opened: 18 Num files deleted: 5 Num files renamed: 4 Num Flush(): 1000068 Num Sync(): 9 Num Fsync(): 1 Num Close(): 6 Num Read(): 396339 Num Append(): 1000058 Num bytes read: 892030224 Num bytes written: 187569238 ``` Another example showing how using ```DB::OpenForReadOnly``` reduces file operations compared to ```((Optimistic)Transaction)DB::Open```: ``` > db_bench -benchmarks updaterandom -num 1 > db_bench -benchmarks updaterandom -num 0 -use_existing_db true -readonly true -report_file_operations true ... Entries: 0 ... Num files opened: 8 Num files deleted: 0 Num files renamed: 0 Num Flush(): 0 Num Sync(): 0 Num Fsync(): 0 Num Close(): 0 Num Read(): 13 Num Append(): 0 Num bytes read: 374 Num bytes written: 0 ``` ``` > db_bench -benchmarks updaterandom -num 1 > db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_file_operations true ... Entries: 0 ... Num files opened: 14 Num files deleted: 3 Num files renamed: 4 Num Flush(): 14 Num Sync(): 5 Num Fsync(): 1 Num Close(): 3 Num Read(): 11 Num Append(): 10 Num bytes read: 7291 Num bytes written: 7357 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8448 Reviewed By: anand1976 Differential Revision: D29333818 Pulled By: zhichao-cao fbshipit-source-id: a06a8c87f799806462319115195b3e94faf5f542	2021-06-24 11:56:51 -07:00
Baptiste Lemaire	3f20925dc4	Add list live files metadata (#8446 ) Summary: Add an argument to ldb to dump live file names, column families, and levels, `list_live_files_metadata`. The output shows all active SST file names, sorted first by column family and then by level. For each level the SST files are sorted alphabetically. Typically, the output looks like this: ``` ./ldb --db=/tmp/test_db list_live_files_metadata Live SST Files: ===== Column Family: default ===== ---------- level 0 ---------- /tmp/test_db/000069.sst ---------- level 1 ---------- /tmp/test_db/000064.sst /tmp/test_db/000065.sst /tmp/test_db/000066.sst /tmp/test_db/000071.sst ---------- level 2 ---------- /tmp/test_db/000038.sst /tmp/test_db/000039.sst /tmp/test_db/000052.sst /tmp/test_db/000067.sst /tmp/test_db/000070.sst ------------------------------ ``` Second, a flag was added `--sort_by_filename`, to change the layout of the output. When this flag is added to the command, the output shows all active SST files sorted by name, in front of which the LSM level and the column family are mentioned. With the same example, the following command would return: ``` ./ldb --db=/tmp/test_db list_live_files_metadata --sort_by_filename Live SST Files: /tmp/test_db/000038.sst : level 2, column family 'default' /tmp/test_db/000039.sst : level 2, column family 'default' /tmp/test_db/000052.sst : level 2, column family 'default' /tmp/test_db/000064.sst : level 1, column family 'default' /tmp/test_db/000065.sst : level 1, column family 'default' /tmp/test_db/000066.sst : level 1, column family 'default' /tmp/test_db/000067.sst : level 2, column family 'default' /tmp/test_db/000069.sst : level 0, column family 'default' /tmp/test_db/000070.sst : level 2, column family 'default' /tmp/test_db/000071.sst : level 1, column family 'default' ------------------------------ ``` Thus, the user can either request to show the files by levels, or sorted by filenames. This PR includes a simple Python unit test that makes sure the file name and level printed out by this new feature matches the one found with an existing feature, `dump_live_file`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8446 Reviewed By: akankshamahajan15 Differential Revision: D29320080 Pulled By: bjlemaire fbshipit-source-id: 01fb7b5637c59010d74c80730a28d815994e7009	2021-06-22 19:07:46 -07:00
Baptiste Lemaire	0a1aed4e71	Fix double slashes in user-provided db path. (#8439 ) Summary: At the moment, the following command : "`./ --db=mypath/ dump_file_files`" returns a series of erronous names with double slashes, ie: "`mypath//000xxx.sst`", including manifest file names with double slashes "`mypath//MANIFEST-00XXX`", whereas "`./ --db=mypath dump_file_files`" correctly returns "`mypath/000xxx.sst`" and "`mypath/MANIFEST-00XXX`". This (very short) PR simply checks if there is a need to add or remove any '`/`' character when the `db_path` and `manifest_filename`/sst `filenames` are concatenated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8439 Reviewed By: akankshamahajan15 Differential Revision: D29301349 Pulled By: bjlemaire fbshipit-source-id: 3e9e58f9749d278b654ae838fcee13ad698705a8	2021-06-22 11:46:25 -07:00
Zhichao Cao	82a70e1470	Trace MultiGet Keys and CF_IDs to the trace file (#8421 ) Summary: Tracing the MultiGet information including timestamp, keys, and CF_IDs to the trace file for analyzing and replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8421 Test Plan: make check, add test to trace_analyzer_test Reviewed By: anand1976 Differential Revision: D29221195 Pulled By: zhichao-cao fbshipit-source-id: 30c677d6c39ab31ef4bbdf7e0d1fa1fd79f295ff	2021-06-18 15:04:05 -07:00
Akanksha Mahajan	5ba1b6e549	Cache warming data blocks during flush (#8242 ) Summary: This PR prepopulates warm/hot data blocks which are already in memory into block cache at the time of flush. On a flush, the data block that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this data back into memory again, which is avoided by enabling newly added option. Right now, this is enabled only for flush for data blocks. We plan to expand this option to cover compactions in the future and for other types of blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242 Test Plan: Add new unit test Reviewed By: anand1976 Differential Revision: D28521703 Pulled By: akankshamahajan15 fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05	2021-06-17 21:56:47 -07:00
Peter Dillinger	865a25101d	Mark Ribbon filter and optimize_filters_for_memory as production (#8408 ) Summary: Marked the Ribbon filter and optimize_filters_for_memory features as production-ready, each enabling memory savings for Bloom-like filters. Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use Ribbon filters instead of Bloom, or `ribbonfilter` in place of `bloomfilter` in configuration string. Some small refactoring in db_stress. Removed/refactored unused code in db_bench, in part preparing for future default possibly being different from "disabled." Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408 Test Plan: Lots of prior automated, ad-hoc, and "real world" testing. Updated tests for new API names. Quick db_bench test: bloom fillrandom 77730 ops/sec rocksdb.block.cache.filter.bytes.insert COUNT : 89929384 ribbon fillrandom 71492 ops/sec rocksdb.block.cache.filter.bytes.insert COUNT : 64531384 Reviewed By: mrambacher Differential Revision: D29140805 Pulled By: pdillinger fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a	2021-06-17 12:29:16 -07:00
mrambacher	281ac9c89e	Add CreateFrom methods to Env/FileSystem (#8174 ) Summary: - Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load. This method/signature is a precursor to making these classes extend Customizable. - Added CreateFromSystem to Env. This method standardizes creating an Env from the environment variables. Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI. Now the code is more command/standardized. - Added CreateFromFlags to Env. These method allows Env to be create from string options (such as GFLAGS options) in a more standard way. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174 Reviewed By: zhichao-cao Differential Revision: D28999603 Pulled By: mrambacher fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb	2021-06-15 03:43:48 -07:00
Baptiste Lemaire	d61a449364	Fixed manifest_dump issues when printing keys and values containing null characters (#8378 ) Summary: Changed fprintf function to fputc in ApplyVersionEdit, and replaced null characters with whitespaces. Added unit test in ldb_test.py - verifies that manifest_dump --verbose output is correct when keys and values containing null characters are inserted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8378 Reviewed By: pdillinger Differential Revision: D29034584 Pulled By: bjlemaire fbshipit-source-id: 50833687a8a5f726e247c38457eadc3e6dbab862	2021-06-10 12:55:20 -07:00
Zhichao Cao	f44e69c64a	Use DbSessionId as cache key prefix when secondary cache is enabled (#8360 ) Summary: Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360 Test Plan: add the test to lru_cache_test Reviewed By: pdillinger Differential Revision: D29006215 Pulled By: zhichao-cao fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814	2021-06-10 11:02:43 -07:00
Jay Zhuang	eda83eaac0	Fix cmake build failure with gflags (#8324 ) Summary: - Fix cmake build failure with gflags. - Add CI tests for both gflags 2.1 and 2.2. - Fix ctest config with gtest. - Add CI to run test with ctest. One benefit of ctest is it support timeout, it's set to 5min in our CI, so we will know which test is hang. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8324 Test Plan: CI pass Reviewed By: ajkr Differential Revision: D28762517 Pulled By: jay-zhuang fbshipit-source-id: 09063c5af5f9f33abfcdeb48593acbd9826cd199	2021-06-01 14:43:15 -07:00
sdong	ab718b415f	Kill whitebox crash test if it is 15 minutes over the limit (#8341 ) Summary: Whitebox crash test can run significantly over the time limit for test slowness or no kiling points. This indefinite job can create problem when this test is periodically scheduled as a job. Instead, kill the job if it is 15 minutes over the limit. Refactor the code slightly to consolidate the code for executing commands for white and black box tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8341 Test Plan: Run both of black and white box tests with both of natual and explicit kill condition. Reviewed By: jay-zhuang Differential Revision: D28756170 fbshipit-source-id: f253149890e62ace78f871be927e093e9b12f49b	2021-06-01 09:34:53 -07:00

1 2 3 4 5 ...

1169 Commits