rocksdb

Author	SHA1	Message	Date
Mike Kolupaev	637e64b9ac	Add an option to prevent DB::Open() from querying sizes of all sst files (#6353 ) Summary: When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open. If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be really slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes. We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353 Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble. Differential Revision: D19656425 Pulled By: al13n321 fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82	2020-02-04 01:27:26 -08:00
Adam Retter	7242dae7fe	Improve RocksJava Comparator (#6252 ) Summary: This is a redesign of the API for RocksJava comparators with the aim of improving performance. It also simplifies the class hierarchy. NOTE: This breaks backwards compatibility for existing 3rd party Comparators implemented in Java... so we need to consider carefully which release branches this goes into. Previously when implementing a comparator in Java the developer had a choice of subclassing either `DirectComparator` or `Comparator` which would use direct and non-direct byte-buffers resepectively (via `DirectSlice` and `Slice`). In this redesign there we have eliminated the overhead of using the Java Slice classes, and just use `ByteBuffer`s. The `ComparatorOptions` supplied when constructing a Comparator allow you to choose between direct and non-direct byte buffers by setting `useDirect`. In addition, the `ComparatorOptions` now allow you to choose whether a ByteBuffer is reused over multiple comparator calls, by setting `maxReusedBufferSize > 0`. When buffers are reused, ComparatorOptions provides a choice of mutex type by setting `useAdaptiveMutex`. --- [JMH benchmarks previously indicated](https://github.com/facebook/rocksdb/pull/6241#issue-356398306) that the difference between C++ and Java for implementing a comparator was ~7x slowdown in Java. With these changes, when reusing buffers and guarding access to them via mutexes the slowdown is approximately the same. However, these changes offer a new facility to not reuse mutextes, which reduces the slowdown to ~5.5x in Java. We also offer a `thread_local` mechanism for reusing buffers, which reduces slowdown to ~5.2x in Java (closes https://github.com/facebook/rocksdb/pull/4425). These changes also form a good base for further optimisation work such as further JNI lookup caching, and JNI critical. --- These numbers were captured without jemalloc. With jemalloc, the performance improves for all tests, and the Java slowdown reduces to between 4.8x and 5.x. ``` ComparatorBenchmarks.put native_bytewise thrpt 25 124483.795 ± 2032.443 ops/s ComparatorBenchmarks.put native_reverse_bytewise thrpt 25 114414.536 ± 3486.156 ops/s ComparatorBenchmarks.put java_bytewise_non-direct_reused-64_adaptive-mutex thrpt 25 17228.250 ± 1288.546 ops/s ComparatorBenchmarks.put java_bytewise_non-direct_reused-64_non-adaptive-mutex thrpt 25 16035.865 ± 1248.099 ops/s ComparatorBenchmarks.put java_bytewise_non-direct_reused-64_thread-local thrpt 25 21571.500 ± 871.521 ops/s ComparatorBenchmarks.put java_bytewise_direct_reused-64_adaptive-mutex thrpt 25 23613.773 ± 8465.660 ops/s ComparatorBenchmarks.put java_bytewise_direct_reused-64_non-adaptive-mutex thrpt 25 16768.172 ± 5618.489 ops/s ComparatorBenchmarks.put java_bytewise_direct_reused-64_thread-local thrpt 25 23921.164 ± 8734.742 ops/s ComparatorBenchmarks.put java_bytewise_non-direct_no-reuse thrpt 25 17899.684 ± 839.679 ops/s ComparatorBenchmarks.put java_bytewise_direct_no-reuse thrpt 25 22148.316 ± 1215.527 ops/s ComparatorBenchmarks.put java_reverse_bytewise_non-direct_reused-64_adaptive-mutex thrpt 25 11311.126 ± 820.602 ops/s ComparatorBenchmarks.put java_reverse_bytewise_non-direct_reused-64_non-adaptive-mutex thrpt 25 11421.311 ± 807.210 ops/s ComparatorBenchmarks.put java_reverse_bytewise_non-direct_reused-64_thread-local thrpt 25 11554.005 ± 960.556 ops/s ComparatorBenchmarks.put java_reverse_bytewise_direct_reused-64_adaptive-mutex thrpt 25 22960.523 ± 1673.421 ops/s ComparatorBenchmarks.put java_reverse_bytewise_direct_reused-64_non-adaptive-mutex thrpt 25 18293.317 ± 1434.601 ops/s ComparatorBenchmarks.put java_reverse_bytewise_direct_reused-64_thread-local thrpt 25 24479.361 ± 2157.306 ops/s ComparatorBenchmarks.put java_reverse_bytewise_non-direct_no-reuse thrpt 25 7942.286 ± 626.170 ops/s ComparatorBenchmarks.put java_reverse_bytewise_direct_no-reuse thrpt 25 11781.955 ± 1019.843 ops/s ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6252 Differential Revision: D19331064 Pulled By: pdillinger fbshipit-source-id: 1f3b794e6a14162b2c3ffb943e8c0e64a0c03738	2020-02-03 12:30:13 -08:00
Maysam Yabandeh	3316d29221	Disable recycle_log_file_num when it is incompatible with recovery mode (#6351 ) Summary: Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351 Differential Revision: D19648931 Pulled By: maysamyabandeh fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7	2020-01-31 07:28:30 -08:00
anand76	fb05b5a652	Force a new manifest file if append to current one fails (#6331 ) Summary: Fix for issue https://github.com/facebook/rocksdb/issues/6316 When an append/sync of the manifest file fails due to an IO error such as NoSpace, we don't always put the DB in read-only mode. This is true for flush and compactions, as well as foreground operatons such as column family add/drop, CompactFiles etc. Subsequent changes to the DB will be recorded in the same manifest file, which would have a corrupted record in the middle due to the previous failure. On next DB::Open(), it will fail to process the full manifest and data will be lost. To fix this, we reset VersionSet::descriptor_log_ on append/sync failure, which will force a new manifest file to be written on the next append. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331 Test Plan: Add new unit tests in error_handler_test.cc Differential Revision: D19632951 Pulled By: anand1976 fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3	2020-01-30 10:56:29 -08:00
Levi Tamasi	9e3ace42a4	Add statistics for BlobDB GC (#6296 ) Summary: The patch adds statistics support to the new BlobDB garbage collection implementation; namely, it adds support for the following (pre-existing) tickers: `BLOB_DB_GC_NUM_FILES`: the number of blob files obsoleted by the GC logic. `BLOB_DB_GC_NUM_NEW_FILES`: the number of new blob files generated by the GC logic. `BLOB_DB_GC_FAILURES`: the number of failed GC passes (where a GC pass is equivalent to a (sub)compaction). `BLOB_DB_GC_NUM_KEYS_RELOCATED`: the number of blobs relocated to new blob files by the GC logic. `BLOB_DB_GC_BYTES_RELOCATED`: the total size of blobs relocated to new blob files. The tickers `BLOB_DB_GC_NUM_KEYS_OVERWRITTEN`, `BLOB_DB_GC_NUM_KEYS_EXPIRED`, `BLOB_DB_GC_BYTES_OVERWRITTEN`, `BLOB_DB_GC_BYTES_EXPIRED`, and `BLOB_DB_GC_MICROS` are not relevant for the new GC logic, and are thus marked deprecated. The patch also adds a couple of log messages that log the number and total size of blobs encountered and relocated during a GC pass, as well as the number of blob files created and obsoleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6296 Test Plan: Extended unit tests and used the BlobDB mode of `db_bench`. Differential Revision: D19402513 Pulled By: ltamasi fbshipit-source-id: d53d2bfbf4928a1db1e9346c67ebb9007b8932ec	2020-01-29 16:46:16 -08:00
Maysam Yabandeh	2f973ca96e	Double Crash in kPointInTimeRecovery with TransactionDB (#6313 ) Summary: In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery. Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313 Differential Revision: D19458291 Pulled By: maysamyabandeh fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d	2020-01-29 11:40:55 -08:00
sdong	8f2bee6747	Add ReadOptions.auto_prefix_mode (#6314 ) Summary: Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314 Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run. Differential Revision: D19458717 fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce	2020-01-28 14:44:05 -08:00
sdong	7aa66c704f	Move HISTORY.md entry of hash index fix from 6.7 to unreleased (#6337 ) Summary: Commits related to hash index fix have been reverted in 6.7.fb branch. Update HISTORY.md to keep it in sync. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6337 Differential Revision: D19593717 fbshipit-source-id: 466178dc6205c9e41ccced41bf281a0952bdc2ca	2020-01-27 17:45:42 -08:00
sdong	f10f135938	Fix regression bug of hash index with iterator total order seek (#6328 ) Summary: https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash. Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked Also, a null pointer check is added so that a bug like this won't cause segfault in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328 Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass. Differential Revision: D19586751 fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42	2020-01-27 15:44:54 -08:00
Fosco Marotto	bd698e4f55	Update version for next release, 6.7.0 (#6320 ) Summary: Adjusted history for 6.6.1 and 6.6.2, switched master version to 6.7.0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6320 Differential Revision: D19499272 Pulled By: gfosco fbshipit-source-id: 2bafb2456951f231e411e9c03aaa4c044f497684	2020-01-24 15:36:32 -08:00
Levi Tamasi	f34782a67d	Fix the "records dropped" statistics (#6325 ) Summary: The earlier code used two conflicting definitions for the number of input records going into a compaction, one based on the `rocksdb.num.entries` table property and one based on `CompactionIterationStats`. The first one is correct and in line with how output records are counted, while the second one incorrectly ignores input records in various cases when the `CompactionIterator` advances or reseeks the input iterator (this can happen, amongst other cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s, and compaction filters). This can result in the code undercounting the input records and computing an incorrect value for "records dropped" during the compaction. The patch fixes this by switching over to the correct (table property based) input record count for "records dropped". Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325 Test Plan: Tested using `make check` and `db_bench`. Differential Revision: D19525491 Pulled By: ltamasi fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636	2020-01-23 15:27:22 -08:00
anand76	0672a6db64	Fix queue manipulation in WriteThread::BeginWriteStall() (#6322 ) Summary: When there is a write stall, the active write group leader calls ```BeginWriteStall()``` to walk the queue of writers and remove any with the ```no_slowdown``` option set. There was a bug in the code which updated the back pointer but not the forward pointer (```link_newer```), corrupting the list and causing some threads to wait forever. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6322 Test Plan: Add a unit test in db_write_test Differential Revision: D19538313 Pulled By: anand1976 fbshipit-source-id: 6fbed819e594913f435886606f5d36f74f235c3a	2020-01-23 14:01:28 -08:00
Levi Tamasi	73f65b457e	Adjust thread pool sizes when setting max_background_jobs dynamically (#6300 ) Summary: https://github.com/facebook/rocksdb/pull/2205 introduced a new configuration option called `max_background_jobs`, superseding the earlier options `max_background_flushes` and `max_background_compactions`. However, unlike `max_background_compactions`, setting `max_background_jobs` dynamically through the `SetDBOptions` interface does not adjust the size of the thread pools (see https://github.com/facebook/rocksdb/issues/6298). The patch fixes this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300 Test Plan: Extended unit test. Differential Revision: D19430899 Pulled By: ltamasi fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074	2020-01-16 14:35:10 -08:00
sdong	d2b4d42d4b	Fix kHashSearch bug with SeekForPrev (#6297 ) Summary: When prefix is enabled the expected behavior when the prefix of the target does not exist is for Seek is to seek to any key larger than target and SeekToPrev to any key less than the target. Currently. the prefix index (kHashSearch) returns OK status but sets Invalid() to indicate two cases: a prefix of the searched key does not exist, ii) the key is beyond the range of the keys in SST file. The SeekForPrev implementation in BlockBasedTable thus does not have enough information to know when it should set the index key to first (to return a key smaller than target). The patch fixes that by returning NotFound status for cases that the prefix does not exist. SeekForPrev in BlockBasedTable accordingly SeekToFirst instead of SeekToLast on the index iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6297 Test Plan: SeekForPrev of non-exsiting prefix is added to block_test.cc, and a test case is added in db_test2, which fails without the fix. Differential Revision: D19404695 fbshipit-source-id: cafbbf95f8f60ff9ede9ccc99d25bfa1cf6fcdc3	2020-01-15 14:28:39 -08:00
Maysam Yabandeh	d4b7fbf0d5	kHashSearch incompatible with index_block_restart_interval>1 (#6294 ) Summary: kHashSearch index type is incompatible with index_block_restart_interval larger than 1. The patch asserts that and also resets index_block_restart_interval value if it is incompatible with kHashSearch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6294 Differential Revision: D19394229 Pulled By: maysamyabandeh fbshipit-source-id: 8a12712ab25e81094a7f71ecd43f773dd4fb6acd	2020-01-14 11:21:27 -08:00
sdong	894c6d21af	Bug when multiple files at one level contains the same smallest key (#6285 ) Summary: The fractional cascading index is not correctly generated when two files at the same level contains the same smallest or largest user key. The result would be that it would hit an assertion in debug mode and lower level files might be skipped. This might cause wrong results when the same user keys are of merge operands and Get() is called using the exact user key. In that case, the lower files would need to further checked. The fix is to fix the fractional cascading index. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6285 Test Plan: Add a unit test which would cause the assertion which would be fixed. Differential Revision: D19358426 fbshipit-source-id: 39b2b1558075fd95e99491d462a67f9f2298c48e	2020-01-13 16:27:42 -08:00
Sagar Vemuri	cfa585611d	Consider all compaction input files to compute the oldest ancestor time (#6279 ) Summary: Look at all compaction input files to compute the oldest ancestor time. In https://github.com/facebook/rocksdb/issues/5992 we changed how creation_time (aka oldest-ancestor-time) table property of compaction output files is computed from max(creation-time-of-all-compaction-inputs) to min(creation-time-of-all-inputs). This exposed a bug where, during compaction, the creation_time:s of only the L0 compaction inputs were being looked at, and all other input levels were being ignored. This PR fixes the issue. Some TTL compactions when using Level-Style compactions might not have run due to this bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6279 Test Plan: Enhanced the unit tests to validate that the correct time is propagated to the compaction outputs. Differential Revision: D19337812 Pulled By: sagar0 fbshipit-source-id: edf8a72f11e405e93032ff5f45590816debe0bb4	2020-01-10 19:02:42 -08:00
wolfkdy	1ab1231acf	parallel occ (#6240 ) Summary: This is a continuation of https://github.com/facebook/rocksdb/pull/5320/files I open a new mr for these purposes, half a year has past since the old mr is posted so it's almost impossible to fulfill some points below on the old mr, especially 5) 1) add validation modes for optimistic txns 2) modify unittests to test both modes 3) make format 4) refine hash functor 5) push to master Pull Request resolved: https://github.com/facebook/rocksdb/pull/6240 Differential Revision: D19301296 fbshipit-source-id: 5b5b3cbd39558f43947f7d2dec6cd31a06386edb	2020-01-07 14:20:38 -08:00
Yanqin Jin	1aaa145877	Fix a data race for cfd->log_number_ (#6249 ) Summary: A thread calling LogAndApply may release db mutex when calling WriteCurrentStateToManifest() which reads cfd->log_number_. Another thread can call SwitchMemtable() and writes to cfd->log_number_. Solution is to cache the cfd->log_number_ before releasing mutex in LogAndApply. Test Plan (on devserver): ``` $COMPILE_WITH_TSAN=1 make db_stress $./db_stress --acquire_snapshot_one_in=10000 --avoid_unnecessary_blocking_io=1 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=zstd --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_pipelined_write=0 --flush_one_in=1000000 --format_version=5 --get_live_files_and_wal_files_one_in=1000000 --index_block_restart_interval=5 --index_type=0 --log2_keys_per_lock=22 --long_running_snapshots=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000000 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --memtablerep=skip_list --mmap_read=0 --nooverwritepercent=1 --open_files=500000 --ops_per_thread=100000000 --partition_filters=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --set_options_one_in=10000 --snapshot_hold_ops=100000 --subcompactions=2 --sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35 ``` Then repeat the following multiple times, e.g. 100 after compiling with tsan. ``` $./db_test2 --gtest_filter=DBTest2.SwitchMemtableRaceWithNewManifest ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6249 Differential Revision: D19235077 Pulled By: riversand963 fbshipit-source-id: 79467b52f48739ce7c27e440caa2447a40653173	2020-01-06 20:09:51 -08:00
Adam Retter	e697da0b18	RocksDB#keyMayExist should not assume database values are unicode strings (#6186 ) Summary: Closes https://github.com/facebook/rocksdb/issues/6183 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6186 Differential Revision: D19201281 Pulled By: pdillinger fbshipit-source-id: 1c96b4ea09e826f91e44b0009eba3de0991d9053	2019-12-20 14:27:58 -08:00
Levi Tamasi	7a7ca8eb5b	BlobDB: only compare CF IDs when checking whether an API call is for the default CF (#6226 ) Summary: BlobDB currently only supports using the default column family. The earlier code enforces this by comparing the `ColumnFamilyHandle` passed to the `Get`/`Put`/etc. call with the handle returned by `DefaultColumnFamily` (which, at the end of the day, comes from `DBImpl::default_cf_handle_`). Since other `ColumnFamilyHandle`s can also point to the default column family, this can reject legitimate requests as well. (As an example, with the earlier code, the handle returned by `BlobDB::Open` cannot actually be used in API calls.) The patch fixes this by comparing only the IDs of the column family handles instead of the pointers themselves. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6226 Test Plan: `make check` Differential Revision: D19187461 Pulled By: ltamasi fbshipit-source-id: 54ce2e12ebb1f07e6d1e70e3b1e0213dfa94bda2	2019-12-19 18:05:49 -08:00
Levi Tamasi	130e710056	Add BlobDB GC cutoff parameter to db_bench (#6211 ) Summary: The patch makes it possible to set the BlobDB configuration option `garbage_collection_cutoff` on the command line. In addition, it changes the `db_bench` code so that the default values of BlobDB related parameters are taken from the defaults of the actual BlobDB configuration options (note: this changes the the default of `blob_db_bytes_per_sync`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6211 Test Plan: Ran `db_bench` with various values of the new parameter. Differential Revision: D19166895 Pulled By: ltamasi fbshipit-source-id: 305ccdf0123b9db032b744715810babdc3e3b7d5	2019-12-18 17:46:08 -08:00
Jermy Li	f453bcb40d	Add unit tests for concurrent CF iteration and drop (#6180 ) Summary: improve https://github.com/facebook/rocksdb/issues/6147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6180 Differential Revision: D19148936 fbshipit-source-id: f691c9879fd51d54e96c1a99670cf85ca4485a89	2019-12-18 11:54:35 -08:00
Levi Tamasi	cbd58af9c3	Update HISTORY.md with recent BlobDB related changes Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6202 Differential Revision: D19144158 Pulled By: ltamasi fbshipit-source-id: 3e2522ced458568e3a2a045663704e30ab0ac223	2019-12-17 19:09:21 -08:00
anand1976	1be48cb895	Fix crash in Transaction::MultiGet() when num_keys > 32 Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6192 Test Plan: Add a unit test that fails without the fix and passes now make check Differential Revision: D19124781 Pulled By: anand1976 fbshipit-source-id: 8c8cb6fa16c3fc23ec011e168561a13f76bbd783	2019-12-16 20:39:35 -08:00
Levi Tamasi	2d095b4dbc	Update HISTORY.md with the recent memtable trimming fixes Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6194 Differential Revision: D19125292 Pulled By: ltamasi fbshipit-source-id: d41aca2755ec4bec07feedd6b561e8d18606a931	2019-12-16 15:19:52 -08:00
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	2019-12-13 14:48:41 -08:00
奏之章	c4ce8e637f	Fix RangeDeletion bug (#6062 ) Summary: Read keys from a snapshot that a range deletion were added after the snapshot was created and this range deletion was inside an immutable memtable, we will get wrong key set. More detail rest in codes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6062 Differential Revision: D18966785 Pulled By: pdillinger fbshipit-source-id: 38a60bb1e2d0a1dbfc8ec641617200b6a02b86c3	2019-12-12 15:18:02 -08:00
ferhat elmas	afdc58d478	Fix typos in history Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6116 Differential Revision: D18935622 fbshipit-source-id: 59f7a7bc9f0116ae6354ea217896622a34329d3c	2019-12-11 11:04:46 -08:00
Yanqin Jin	09fcf4fb6b	Fix a potential bug scheduling unnecessary threads (#6104 ) Summary: RocksDB should decrement the counter `unscheduled_flushes_` as soon as the bg thread is scheduled. Before this fix, the counter is decremented only when the bg thread starts and picks an element from the flush queue. This may cause more than necessary bg threads to be scheduled. Not a correctness issue, but may affect flush thread count. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6104 Test Plan: ``` make check ``` Differential Revision: D18735584 Pulled By: riversand963 fbshipit-source-id: d36272d4a08a494aeeab6200a3cff7a3d1a2dc10	2019-11-27 14:48:49 -08:00
sdong	aa1857e2df	Support options.max_open_files = -1 with periodic_compaction_seconds (#6090 ) Summary: options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090 Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2. Differential Revision: D18702268 fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb	2019-11-26 21:39:56 -08:00
anand76	496a6ae895	Fix HISTORY.md for 6.6.0 (#6096 ) Summary: Some of the entries were incorrectly listed under 6.5.0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6096 Differential Revision: D18722801 Pulled By: gfosco fbshipit-source-id: 18d1187deb6a9d69a8feb68b727d2f720a65f2bc	2019-11-26 19:04:49 -08:00
Peter Dillinger	ca3b6c28c9	Expose and elaborate FilterBuildingContext (#6088 ) Summary: This change enables custom implementations of FilterPolicy to wrap a variety of NewBloomFilterPolicy and select among them based on contextual information such as table level and compaction style. * Moves FilterBuildingContext to public API and elaborates it with more useful data. (It would be nice to put more general options-like data, but at the time this object is constructed, we are using internal APIs ImmutableCFOptions and MutableCFOptions and don't have easy access to ColumnFamilyOptions that I can tell.) * Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to GetBuilderWithContext, because it's now public. * Plumbs through the table's "level_at_creation" for filter building context. * Simplified some tests by adding GetBuilder() to MockBlockBasedTableTester. * Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including sample wrapper class LevelAndStyleCustomFilterPolicy. * Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits where it does not reset perf context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088 Test Plan: make check, valgrind on db_bloom_filter_test Differential Revision: D18697817 Pulled By: pdillinger fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4	2019-11-26 18:24:10 -08:00
Peter Dillinger	57f3032285	Allow fractional bits/key in BloomFilterPolicy (#6092 ) Summary: There's no technological impediment to allowing the Bloom filter bits/key to be non-integer (fractional/decimal) values, and it provides finer control over the memory vs. accuracy trade-off. This is especially handy in using the format_version=5 Bloom filter in place of the old one, because bits_per_key=9.55 provides the same accuracy as the old bits_per_key=10. This change not only requires refining the logic for choosing the best num_probes for a given bits/key setting, it revealed a flaw in that logic. As bits/key gets higher, the best num_probes for a cache-local Bloom filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a standard Bloom filter. For example, at 16 bits per key, the best num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%). This change fixes and refines that logic (for the format_version=5 Bloom filter only, just in case) based on empirical tests to find accuracy inflection points between each num_probes. Although bits_per_key is now specified as a double, the new Bloom filter converts/rounds this to "millibits / key" for predictable/precise internal computations. Just in case of unforeseen compatibility issues, we round to the nearest whole number bits / key for the legacy Bloom filter, so as not to unlock new behaviors for it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092 Test Plan: unit tests included Differential Revision: D18711313 Pulled By: pdillinger fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a	2019-11-26 15:59:34 -08:00
sdong	77eab5c85a	Make default value of options.ttl to be 30 days when it is supported. (#6073 ) Summary: By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases. Make the default UINT64_MAX - 1 to indicate that it is not overridden by users. Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073 Test Plan: Add unit tests for it. Differential Revision: D18669626 fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8	2019-11-26 10:00:32 -08:00
Sebastiano Peluso	fcd7e03832	Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (#6072 ) Summary: This change ignores the value of BackupableDBOptions::max_valid_backups_to_open when a BackupEngine is not read-only. Issue: https://github.com/facebook/rocksdb/issues/4997 Note on tests: I had to remove test case WriteOnlyEngine of BackupableDBTest because it was not consistent with the new semantic of BackupableDBOptions::max_valid_backups_to_open. Maybe, we should think about adding a new interface for append-only BackupEngines. On the other hand, I changed LimitBackupsOpened test case to use a read-only BackupEngine, and I added a new specific test case for the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6072 Reviewed By: pdillinger Differential Revision: D18687364 Pulled By: sebastianopeluso fbshipit-source-id: 77bc1f927d623964d59137a93de123bbd719da4e	2019-11-26 10:00:31 -08:00
sdong	0bc87442ae	Update HISTORY.md for forward compatibility (#6085 ) Summary: https://github.com/facebook/rocksdb/pull/6060 broke forward compatiblity for releases from 3.10 to 4.2. Update HISTORY.md to mention it. Also remove it from the compatibility tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6085 Differential Revision: D18691694 fbshipit-source-id: 4ef903783dc722b8a4d3e8229abbf0f021a114c9	2019-11-26 10:00:31 -08:00
Sagar Vemuri	669ea77d9f	Support ttl in Universal Compaction (#6071 ) Summary: `options.ttl` is now supported in universal compaction, similar to how periodic compactions are implemented in PR https://github.com/facebook/rocksdb/issues/5970 . Setting `options.ttl` will simply set `options.periodic_compaction_seconds` to execute the periodic compactions code path. Discarded PR https://github.com/facebook/rocksdb/issues/4749 in lieu of this. This is a short term work-around/hack of falling back to periodic compactions when ttl is set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6071 Test Plan: Added a unit test. Differential Revision: D18668336 Pulled By: sagar0 fbshipit-source-id: e75f5b81ba949f77ef9eff05e44bb1c757f58612	2019-11-22 22:13:35 -08:00
sdong	d8c28e692a	Support options.ttl with options.max_open_files = -1 (#6060 ) Summary: Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead. Note that, this change will break forward compatibility for release 5.1 and older. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060 Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0. Differential Revision: D18631623 fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830	2019-11-22 21:23:00 -08:00
Yanqin Jin	0ce0edbe12	Fix a data race between GetColumnFamilyMetaData and MarkFilesBeingCompacted (#6056 ) Summary: Use db mutex to protect the execution of Version::GetColumnFamilyMetaData() called in DBImpl::GetColumnFamilyMetaData(). Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted() for access to FileMetaData::being_compacted. Other than mutex, there are several more alternatives. - Make FileMetaData::being_compacted an atomic variable. This will make FileMetaData non-copy-able. - Separate being_compacted from FileMetaData. This requires re-organizing data structures that are already used in many places. Test Plan (dev server): ``` make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056 Differential Revision: D18620488 Pulled By: riversand963 fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b	2019-11-20 16:36:29 -08:00
Little-Wallace	ec3e3c3e02	Fix corruption with intra-L0 on ingested files (#5958 ) Summary: ## Problem Description Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1. Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` ## Reason Here is my bug report: https://github.com/facebook/rocksdb/issues/5913 There are two situations that can cause the check to fail. ### First situation： - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. ### Secondary situaion - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267	2019-11-19 15:09:11 -08:00
tabokie	20b48c6478	Fix blob context when db_iter uses seek (#6051 ) Summary: Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex. Also patch existing test for the mentioned code path. Signed-off-by: tabokie <xy.tao@outlook.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051 Differential Revision: D18596274 Pulled By: ltamasi fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278	2019-11-19 11:39:02 -08:00
Peter Dillinger	f059c7d9b9	New Bloom filter implementation for full and partitioned filters (#6007 ) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only slightly slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f	2019-11-13 16:44:01 -08:00
anand76	6c7b1a0cc7	Batched MultiGet API for multiple column families (#5816 ) Summary: Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice. As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816 Test Plan: make check make asan_check asan_crash_test Differential Revision: D18408676 Pulled By: anand1976 fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d	2019-11-12 13:52:55 -08:00
anand76	03ce7fb292	Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014 ) Summary: The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014 Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after. Differential Revision: D18412753 Pulled By: anand1976 fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72	2019-11-11 16:59:15 -08:00
Sagar Vemuri	c17384fea4	Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992 ) Summary: When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster. This will help start cascading compactions to move the expired key range to the bottom-most level faster. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992 Test Plan: `make check` Differential Revision: D18257883 Pulled By: sagar0 fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc	2019-11-11 14:09:01 -08:00
Peter Dillinger	aa63abf698	Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015 ) Summary: Only if there is a crash, power failure, or I/O error in DeleteBackup, shared or private files from the backup might be left behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only by GarbageCollect. This makes the BackupEngine API "leaky by default." Even if it means a modest performance hit, I think we should make Delete and Purge do as they say, with ongoing best effort: i.e. future calls will attempt to finish any incomplete work from earlier calls. This change does that by having DeleteBackup and PurgeOldBackups do a GarbageCollect, unless (to minimize performance hit) this BackupEngine has already done a GarbageCollect and there have been no deletion-related I/O errors in that GarbageCollect or since then. Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files. Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015 Test Plan: Updated unit tests Differential Revision: D18401333 Pulled By: pdillinger fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925	2019-11-08 19:15:35 -08:00
Levi Tamasi	f80050fa8f	Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011 ) Summary: The patch exposes the file numbers of the SSTs as well as the oldest blob files they contain a reference to through the GetColumnFamilyMetaData/ GetLiveFilesMetaData interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011 Test Plan: Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest wasn't really testing anything because the generated memtables were never flushed, so the metadata structure was essentially empty.) Differential Revision: D18361697 Pulled By: ltamasi fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9	2019-11-07 14:04:16 -08:00
anand76	9836a1fa33	Fix MultiGet crash when no_block_cache is set (#5991 ) Summary: This PR fixes https://github.com/facebook/rocksdb/issues/5975. In ```BlockBasedTable::RetrieveMultipleBlocks()```, we were calling ```MaybeReadBlocksAndLoadToCache()```, which is a no-op if neither uncompressed nor compressed block cache are configured. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5991 Test Plan: 1. Add unit tests that fail with the old code and pass with the new 2. make check and asan_check Cc spetrunia Differential Revision: D18272744 Pulled By: anand1976 fbshipit-source-id: e62fa6090d1a6adf84fcd51dfd6859b03c6aebfe	2019-11-07 12:02:21 -08:00
Peter Dillinger	18f57f5ef8	Add new persistent 64-bit hash (#5984 ) Summary: For upcoming new SST filter implementations, we will use a new 64-bit hash function (XXH3 preview, slightly modified). This change updates hash.{h,cc} for that change, adds unit tests, and out-of-lines the implementations to keep hash.h as clean/small as possible. In developing the unit tests, I discovered that the XXH3 preview always returns zero for the empty string. Zero is problematic for some algorithms (including an upcoming SST filter implementation) if it occurs more often than at the "natural" rate, so it should not be returned from trivial values using trivial seeds. I modified our fork of XXH3 to return a modest hash of the seed for the empty string. With hash function details out-of-lines in hash.h, it makes sense to enable XXH_INLINE_ALL, so that direct calls to XXH64/XXH32/XXH3p are inlined. To fix array-bounds warnings on some inline calls, I injected some casts to uintptr_t in xxhash.cc. (Issue reported to Yann.) Revised: Reverted using XXH_INLINE_ALL for now. Some Facebook checks are unhappy about #include on xxhash.cc file. I would fix that by rename to xxhash_cc.h, but to best preserve history I want to do that in a separate commit (PR) from the uintptr casts. Also updated filter_bench for this change, improving the performance predictability of dry run hashing and adding support for 64-bit hash (for upcoming new SST filter implementations, minor dead code in the tool for now). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5984 Differential Revision: D18246567 Pulled By: pdillinger fbshipit-source-id: 6162fbf6381d63c8cc611dd7ec70e1ddc883fbb8	2019-10-31 16:36:35 -07:00
sdong	aa6f7d0995	Support periodic compaction in universal compaction (#5970 ) Summary: Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970 Test Plan: Add several test cases. Differential Revision: D18147097 fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1	2019-10-31 11:31:37 -07:00
sdong	2a9e5caffe	Make FIFO compaction take default 30 days TTL by default (#5987 ) Summary: Right now, by default FIFO compaction has no TTL. We believe that a default TTL of 30 days will be better. With this patch, the default will be changed to 30 days. Default of Options.periodic_compaction_seconds will mean the same as options.ttl. If Options.ttl and Options.periodic_compaction_seconds left default, a default 30 days TTL will be used. If both options are set, the stricter value of the two will be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5987 Test Plan: Add an option sanitize test to cover the case. Differential Revision: D18237935 fbshipit-source-id: a6dcea1f36c3849e13c0a69e413d73ad8eab58c9	2019-10-31 11:13:12 -07:00
Sagar Vemuri	4c9aa30a62	Auto enable Periodic Compactions if a Compaction Filter is used (#5865 ) Summary: - Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction. - The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions. Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage. `periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used. This is done only for Level Compaction style. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865 Test Plan: - Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set. - `COMPILE_WITH_ASAN=1 make check` Differential Revision: D17659180 Pulled By: sagar0 fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee	2019-10-29 15:05:51 -07:00
Vijay Nadimpalli	79018ba51b	Upgrading version to 6.6.0 on Master (#5965 ) Summary: Upgrading version to 6.6.0 on Master. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5965 Differential Revision: D18119839 Pulled By: vjnadimpalli fbshipit-source-id: 4adbcbb82b108d2f626e88c786453baad8455f4e	2019-10-28 13:14:45 -07:00
Yanqin Jin	2309fd63bf	Update column families' log number altogether after flushing during recovery (#5856 ) Summary: A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it. The bug can surface in the following way. 1. Database has multiple column families. 2. Between one DB restart, the last log file is corrupted in the middle (not the tail) 3. During restart, DB crashes between flushing between two column families. Then DB will fail to be opened again with error "SST file is ahead of WALs". Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST. Test plan (on devserver): ``` $make all && make check ``` Specifically ``` $make db_test2 $./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF ``` Also checked for compatibility as follows. Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory. Then checkout 5.4, build ldb, and dump the MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856 Differential Revision: D17620818 Pulled By: riversand963 fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039	2019-10-24 18:29:30 -07:00
Levi Tamasi	f7e7b34ebe	Propagate SST and blob file numbers through the EventListener interface (#5962 ) Summary: This patch adds a number of new information elements to the FlushJobInfo and CompactionJobInfo structures that are passed to EventListeners via the OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks. Namely, for flushes, the file numbers of the new SST and the oldest blob file it references are propagated. For compactions, the new pieces of information are the file number, level, and the oldest blob file referenced by each compaction input and output file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962 Test Plan: Extended the EventListener unit tests with logic that checks that these information elements are correctly propagated from the corresponding FileMetaData. Differential Revision: D18095568 Pulled By: ltamasi fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56	2019-10-24 14:44:15 -07:00
Peter Dillinger	27a124571f	Fix memory leak on error opening PlainTable (#5951 ) Summary: Several error paths in opening of a plain table would leak memory. PR https://github.com/facebook/rocksdb/issues/5940 opened the leak to one more error path, which happens to have been (mistakenly) exercised by CuckooTableDBTest.AdaptiveTable. That test has been fixed, and the exercising of plain table error cases (more than before) has been added as BadOptions1 and BadOptions2 to PlainTableDBTest. This effectively moved the memory leak to plain_table_db_test. Also here is a cheap fix for the memory leak, without (yet?) changing the signature of ReadTableProperties. This fixes ASAN on unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5951 Test Plan: make COMPILE_WITH_ASAN=1 check Differential Revision: D18051940 Pulled By: pdillinger fbshipit-source-id: e2952930c09a2b46c4f1ff09818c5090426929de	2019-10-21 16:53:06 -07:00
sdong	a0cd920026	LevelIterator to avoid gap after prefix bloom filters out a file (#5861 ) Summary: Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861 Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test Differential Revision: D17918213 fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac	2019-10-21 11:40:57 -07:00
Levi Tamasi	69bd8a2859	Update HISTORY.md with recent BlobDB adjacent changes Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5939 Differential Revision: D18009096 Pulled By: ltamasi fbshipit-source-id: 032a48a302f9da38aecf4055b5a8d4e1dffd9dc7	2019-10-18 10:24:23 -07:00
Yi Wu	1f9d7c0f54	Fix OnFlushCompleted fired before flush result write to MANIFEST (#5908 ) Summary: When there are concurrent flush job on the same CF, `OnFlushCompleted` can be called before the flush result being install to LSM. Fixing the issue by passing `FlushJobInfo` through `MemTable`, and the thread who commit the flush result can fetch the `FlushJobInfo` and fire `OnFlushCompleted` on behave of the thread actually writing the SST. Fix https://github.com/facebook/rocksdb/issues/5892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5908 Test Plan: Add new test. The test will fail without the fix. Differential Revision: D17916144 Pulled By: riversand963 fbshipit-source-id: e18df67d9533b5baee52ae3605026cdeb05cbe10	2019-10-16 10:40:23 -07:00
Maysam Yabandeh	2c9e9f2a59	Update HISTORY for SeekForPrev bug fix (#5925 ) Summary: Update history for the bug fix in https://github.com/facebook/rocksdb/pull/5907 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5925 Differential Revision: D17952605 Pulled By: maysamyabandeh fbshipit-source-id: 609afcbb2e4087f9153822c4d11193a75a7b0e7a	2019-10-16 07:59:26 -07:00
Yanqin Jin	6febfd8451	OnTableFileCreationCompleted use "(nil)" for empty file during flush (#5905 ) Summary: Compaction can call OnTableFileCreationCompleted(). If file is empty, "(nil)" is used as the file name. Do the same for flush. Test plan (dev server): ``` make all make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5905 Differential Revision: D17883285 Pulled By: riversand963 fbshipit-source-id: 6565884adbb00e8023d88b17dfb3b6eb92220b59	2019-10-14 09:54:10 -07:00
Yanqin Jin	167cdc9f17	Support custom env in sst_dump (#5845 ) Summary: This PR allows for the creation of custom env when using sst_dump. If the user does not set options.env or set options.env to nullptr, then sst_dump will automatically try to create a custom env depending on the path to the sst file or db directory. In order to use this feature, the user must call ObjectRegistry::Register() beforehand. Test Plan (on devserver): ``` $make all && make check ``` All tests must pass to ensure this change does not break anything. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5845 Differential Revision: D17678038 Pulled By: riversand963 fbshipit-source-id: 58ecb4b3f75246d52b07c4c924a63ee61c1ee626	2019-10-08 19:19:12 -07:00
Tomas Kolda	e3a93c9ee1	Fix crash when background task fails (#5879 ) Summary: Fixing crash. Full story in issue: https://github.com/facebook/rocksdb/issues/5878 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5879 Differential Revision: D17812299 Pulled By: anand1976 fbshipit-source-id: 14e5a4fc502ade974583da9692d0ed6e5014613a	2019-10-08 14:20:01 -07:00
sdong	846e05005d	Revert "Merging iterator to avoid child iterator reseek for some cases (#5286 )" (#5871 ) Summary: This reverts commit `9fad3e21eb`. Iterator verification in stress tests sometimes fail for assertion table/block_based/block_based_table_reader.cc:2973: void rocksdb::BlockBasedTableIterator<TBlockIter, TValue>::FindBlockForward() [with TBlockIter = rocksdb::DataBlockIter; TValue = rocksdb::Slice]: Assertion `!next_block_is_out_of_bound \|\| user_comparator_.Compare(*read_options_.iterate_upper_bound, index_iter_->user_key()) <= 0' failed. It is likely to be linked to https://github.com/facebook/rocksdb/pull/5286 together with https://github.com/facebook/rocksdb/pull/5468 as the former PR makes some child iterator's seek being avoided, so that upper bound condition fails to be updated there. Strictly speaking, the former PR was merged before the latter one, but the latter one feels a more important improvement so I choose to revert the former one for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5871 Differential Revision: D17689196 fbshipit-source-id: 4ded5be68f67bee2782d31a29cb72ea68f59dd8c	2019-10-01 11:22:41 -07:00
Yanqin Jin	ae45835703	Add TryCatchUpWithPrimary to StackableDB (#5855 ) Summary: as title. Test Plan (on devserver): ``` $make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5855 Differential Revision: D17615125 Pulled By: riversand963 fbshipit-source-id: bd6ed8cf59eafff41f0d1fc044f39e8f3573172a	2019-09-26 17:00:55 -07:00
Maysam Yabandeh	6652c94f59	Fix a bug in format_version 3 + partition filters + prefix search (#5835 ) Summary: Partitioned filters make use of a top-level index to find the partition in which the filter resides. The top-level index has a key per partition. The key is guaranteed to be larger or equal than any key in that partition. When used with format_version 3, which excludes the sequence number form index keys, the separator key in the index could be equal to the prefix of the keys in the next partition. In this way, when searching for the key, the top-level index will lead us to the previous partition, which has no key with that prefix. The prefix bloom test thus returns false, although the prefix exists in the bloom of the next partition. The patch fixes that by a hack: It always adds the prefix of the first key of the next partition to the bloom of the current partition. In this way, in the corner cases that the index will lead us to the previous partition, we still can find the bloom filter there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5835 Differential Revision: D17513585 Pulled By: maysamyabandeh fbshipit-source-id: e2d1ff26c759e6e03875c4d57f4228316ecf50e9	2019-09-24 14:00:11 -07:00
WangQingping	02554b3c38	Update HISTORY.md for stop manual compaction Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5837 Differential Revision: D17529753 fbshipit-source-id: 98bbf22c690384b2f440286151dffdaaa744e97c	2019-09-23 15:03:39 -07:00
Maysam Yabandeh	6ec6a4a9a4	Remove snap_refresh_nanos option (#5826 ) Summary: The snap_refresh_nanos option didn't bring much benefit. Remove the feature to simplify the code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5826 Differential Revision: D17467147 Pulled By: maysamyabandeh fbshipit-source-id: 4f950b046990d0d1292d7fc04c2ccafaf751c7f0	2019-09-18 20:26:04 -07:00
Maysam Yabandeh	638d239507	Charge block cache for cache internal usage (#5797 ) Summary: For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor. This PR builds up on https://github.com/facebook/rocksdb/issues/4258 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797 Test Plan: - Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata. - The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata. Differential Revision: D17396833 Pulled By: maysamyabandeh fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f	2019-09-16 15:26:21 -07:00
Peter (Stig) Edwards	54fc617891	Mention sst_dump cmd=recompress changes (#5807 ) Summary: As requested by siying in https://github.com/facebook/rocksdb/pull/5791#issuecomment-531417468 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5807 Differential Revision: D17399349 fbshipit-source-id: 5986c3894f356becd393fee0f1aeadcd9affc798	2019-09-16 10:45:03 -07:00
Lingjing You	9ba88a1e5d	Update history.md for option memtable_insert_hint_per_batch (#5799 ) Summary: Update history.md for option memtable_insert_hint_per_batch Pull Request resolved: https://github.com/facebook/rocksdb/pull/5799 Differential Revision: D17369186 fbshipit-source-id: 71d82f9d99d9a52d1475d1b0153670957b6111e9	2019-09-13 10:51:32 -07:00
Ronak Sisodia	27f516acc8	Update HISTORY.md for option to make write group size configurable (#5798 ) Summary: Update HISTORY.md for option to make write group size configurable . Pull Request resolved: https://github.com/facebook/rocksdb/pull/5798 Differential Revision: D17369062 fbshipit-source-id: 390a3fa0b01675e91879486a729cf2cc7624d106	2019-09-13 10:43:09 -07:00
Peter Dillinger	b55b2f45d0	Faster new DynamicBloom implementation (for memtable) (#5762 ) Summary: Since DynamicBloom is now only used in-memory, we're free to change it without schema compatibility issues. The new implementation is drawn from (with manifest permission) `303542a767/bloom_simulation_tests/foo.cc (L613)` This has several speed advantages over the prior implementation: * Uses fastrange instead of % * Minimum logic to determine first (and all) probed memory addresses * (Major) Two probes per 64-bit memory fetch/write. * Very fast and effective (murmur-like) hash expansion/re-mixing. (At least on recent CPUs, integer multiplication is very cheap.) While a Bloom filter with 512-bit cache locality has about a 1.15x FP rate penalty (e.g. 0.84% to 0.97%), further restricting to two probes per 64 bits incurs an additional 1.12x FP rate penalty (e.g. 0.97% to 1.09%). Nevertheless, the unit tests show no "mediocre" FP rate samples, unlike the old implementation with more erratic FP rates. Especially for the memtable, we expect speed to outweigh somewhat higher FP rates. For example, a negative table query would have to be 1000x slower than a BF query to justify doubling BF query time to shave 10% off FP rate (working assumption around 1% FP rate). While that seems likely for SSTs, my data suggests a speed factor of roughly 50x for the memtable (vs. BF; ~1.5% lower write throughput when enabling memtable Bloom filter, after this change). Thus, it's probably not worth even 5% more time in the Bloom filter to shave off 1/10th of the Bloom FP rate, or 0.1% in absolute terms, and it's probably at least 20% slower to recoup that much FP rate from this new implementation. Because of this, we do not see a need for a 'locality' option that affects the MemTable Bloom filter and have decoupled the MemTable Bloom filter from Options::bloom_locality. Note that just 3% more memory to the Bloom filter (10.3 bits per key vs. just 10) is able to make up for the ~12% FP rate drop in the new implementation: [] # Nearly "ideal" FP-wise but reasonably fast cache-local implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out 10000000 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out time: 3.29372 sampled_fp_rate: 0.00985956 ... [] # Close match to this new implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out 10000000 6 10.3 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.10072 sampled_fp_rate: 0.00985655 ... [] # Old locality=1 implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out 10000000 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out time: 3.95472 sampled_fp_rate: 0.00988943 ... Also note the dramatic speed improvement vs. alternatives. -- Performance unit test: DynamicBloomTest.concurrent_with_perf is updated to report more precise timing data. (Measure running time of each thread, not just longest running thread, etc.) Results averaged over various sizes enabled with --enable_perf and 20 runs each; old dynamic bloom refers to locality=1, the faster of the old: old dynamic bloom, avg add latency = 65.6468 new dynamic bloom, avg add latency = 44.3809 old dynamic bloom, avg query latency = 50.6485 new dynamic bloom, avg query latency = 43.2186 old avg parallel add latency = 41.678 new avg parallel add latency = 24.5238 old avg parallel hit latency = 14.6322 new avg parallel hit latency = 12.3939 old avg parallel miss latency = 16.7289 new avg parallel miss latency = 12.2134 Tested on a dedicated 64-bit production machine at Facebook. Significant improvement all around. Despite now using std::atomic<uint64_t>, quick before-and-after test on a 32-bit machine (Intel Atom N270, released 2008) shows no regression in performance, in some cases modest improvement. -- Performance integration test (synthetic): with DEBUG_LEVEL=0, used TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom,readmissing,readrandom,stats --num=2000000 and optionally with -memtable_whole_key_filtering -memtable_bloom_size_ratio=0.01 300 runs each configuration. Write throughput change by enabling memtable bloom: Old locality=0: -3.06% Old locality=1: -2.37% New: -1.50% conclusion -> seems to substantially close the gap Readmissing throughput change by enabling memtable bloom: Old locality=0: +34.47% Old locality=1: +34.80% New: +33.25% conclusion -> maybe a small new penalty from FP rate Readrandom throughput change by enabling memtable bloom: Old locality=0: +31.54% Old locality=1: +31.13% New: +30.60% conclusion -> maybe also from FP rate (after memtable flush) -- Another conclusion we can draw from this new implementation is that the existing 32-bit hash function is not inherently crippling the Bloom filter speed or accuracy, below about 5 million keys. For speed, the implementation is essentially the same whether starting with 32-bits or 64-bits of hash; it just determines whether the first multiplication after fastrange is a pseudorandom expansion or needed re-mix. Note that this multiplication can occur while memory is fetching. For accuracy, in a standard configuration, you need about 5 million keys before you have about a 1.1x FP penalty due to using a 32-bit hash vs. 64-bit: [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.52069 sampled_fp_rate: 0.0118267 ... [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out time: 2.43871 sampled_fp_rate: 0.0109059 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5762 Differential Revision: D17214194 Pulled By: pdillinger fbshipit-source-id: ad9da031772e985fd6b62a0e1db8e81892520595	2019-09-05 14:59:25 -07:00
Affan Dar	229e6fbe0e	Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765 ) Summary: Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine. Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files. Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765 Differential Revision: D17172717 Pulled By: affandar fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37	2019-09-04 12:10:17 -07:00
Yanqin Jin	44eca41add	Fix a bug in file ingestion (#5760 ) Summary: Before this PR, when the number of column families involved in a file ingestion exceeds 2, a bug in the looping logic prevents correct file number being assigned to each ingestion job. Also skip deleting non-existing hard links during cleanup-after-failure. Test plan (devserver) ``` $COMPILE_WITH_ASAN=1 make all $./external_sst_file_test --gtest_filter=ExternalSSTFileTest/ExternalSSTFileTest.IngestFilesIntoMultipleColumnFamilies_/ $makke check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5760 Differential Revision: D17142982 Pulled By: riversand963 fbshipit-source-id: 06c1847a4e7a402647bcf28d124e70f2a0f9daf6	2019-08-30 18:29:07 -07:00
Pratik Dhandharia	a281822331	Lower the risk for users to run options.force_consistency_checks = true (#5744 ) Summary: Open-source users recently reported two occurrences of LSM-tree corruption (https://github.com/facebook/rocksdb/issues/5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be. Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5744 Differential Revision: D17096940 Pulled By: pdhandharia fbshipit-source-id: b6780039044e265f26ed2ad03c51f4abbe8b603c	2019-08-29 14:07:37 -07:00
Zhongyi Xie	2f41ecfe75	Refactor trimming logic for immutable memtables (#5022 ) Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5	2019-08-23 13:55:34 -07:00
anand76	9046bdc5d3	Fix MultiGet() bug when whole_key_filtering is disabled (#5665 ) Summary: The batched MultiGet() implementation was not correctly handling bloom filter lookups when whole_key_filtering is disabled. It was incorrectly skipping keys not in the prefix_extractor domain, and not calling transform for keys in domain. This PR fixes both problems by moving the domain check and transformation to the FilterBlockReader. Tests: Unit test (confirmed failed before the fix) make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/5665 Differential Revision: D16902380 Pulled By: anand1976 fbshipit-source-id: a6be81ad68a6e37134a65246aec7a2c590eccf00	2019-08-21 10:23:23 -07:00
Maysam Yabandeh	7bc18e2727	Disable snapshot refresh feature when snap_refresh_nanos is 0 (#5724 ) Summary: The comments of snap_refresh_nanos advertise that the snapshot refresh feature will be disabled when the option is set to 0. This contract is however not honored in the code: https://github.com/facebook/rocksdb/pull/5278 The patch fixes that and also adds an assert to ensure that the feature is not used when the option is zero. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5724 Differential Revision: D16918185 Pulled By: maysamyabandeh fbshipit-source-id: fec167287df7d85093e087fc39c0eb243e3bbd7e	2019-08-20 11:40:07 -07:00
sdong	3552473668	Introduce IngestExternalFileOptions.verify_checksums_readahead_size (#5721 ) Summary: Recently readahead is introduced for checksum verifying. However, users cannot override the setting for the checksum verifying before external SST file ingestion. Introduce a new option for the purpose. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5721 Test Plan: Add a new unit test for it. Differential Revision: D16906896 fbshipit-source-id: 218ec37001ddcc05411cefddbe233d15ab308476	2019-08-20 10:43:39 -07:00
sdong	e1c468d16f	Do readahead in VerifyChecksum() (#5713 ) Summary: Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713 Test Plan: Add a new unit test. Differential Revision: D16860874 fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413	2019-08-16 16:42:56 -07:00
Yanqin Jin	353a68d550	Update HISTORY.md for 6.4.0 (#5714 ) Summary: Update HISTORY.md by removing a feature from "Unreleased" to 6.4.0 after cherry-picking related commits to 6.4.fb branch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5714 Differential Revision: D16865334 Pulled By: riversand963 fbshipit-source-id: f17ede905a1dfbbcdf98806ca398c618cf54748a	2019-08-16 15:09:20 -07:00
Levi Tamasi	3a3dc29437	Update HISTORY.md for 6.3.2/6.4.0 and add a not-yet-released change Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5710 Test Plan: HISTORY.md-only change, no testing required. Differential Revision: D16836869 Pulled By: ltamasi fbshipit-source-id: 978148f1d14b0c46839a94d7ada8a5e8ecf73965	2019-08-16 11:17:03 -07:00
sdong	bd2c753dd0	Add command "list_file_range_deletes" in ldb (#5615 ) Summary: Add a command in ldb so that users can print out tombstones in SST files. In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615 Test Plan: Add a new unit test Differential Revision: D16550326 fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e	2019-08-15 17:01:03 -07:00
Yanqin Jin	5d9a67e718	Support loading custom objects in unit tests (#5676 ) Summary: Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc. This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning. By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop. Test plan (on devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 all $make check ``` All unit tests must pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676 Differential Revision: D16679157 Pulled By: riversand963 fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d	2019-08-09 15:12:08 -07:00
Eli Pozniansky	4834dab578	Improve CPU Efficiency of ApproximateSize (part 2) (#5609 ) Summary: In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609 Differential Revision: D16433481 Pulled By: elipoz fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b	2019-07-31 08:50:00 -07:00
Levi Tamasi	b538e756c2	Split the recent block based table changes between 6.3 and 6.4 in HISTORY.md Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5653 Differential Revision: D16573445 Pulled By: ltamasi fbshipit-source-id: 19c639044fcfd43b5d5c627c8def33ff2dbb2af8	2019-07-30 17:46:02 -07:00
Fosco Marotto	265db3ebb5	Update history and version for 6.4.0 (#5652 ) Summary: Master branch had been left at 6.2 and history of 6.3 and beyond were merged. Updated this to correct. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5652 Differential Revision: D16570498 Pulled By: gfosco fbshipit-source-id: 79f62ec570539a3e3d7d7c84a6cf7b722395fafe	2019-07-30 16:10:06 -07:00
Eli Pozniansky	9625a2bc2b	Added SizeApproximationOptions to DB::GetApproximateSizes (#5626 ) Summary: The new DB::GetApproximateSizes with SizeApproximationOptions argument, which allows to add more options/knobs to the DB::GetApproximateSizes call (beyond only the include_flags) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5626 Differential Revision: D16496913 Pulled By: elipoz fbshipit-source-id: ee8c6c182330a285fa056ecfc3905a592b451720	2019-07-25 22:42:30 -07:00
Mark Rambacher	cfcf045acc	The ObjectRegistry class replaces the Registrar and NewCustomObjects.… (#5293 ) Summary: The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method). This change is necessary for a few reasons: - By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable. - By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not. When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program. Test plan (on riversand963's devserver) ``` $COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check ``` All tests pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293 Differential Revision: D16363396 Pulled By: riversand963 fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180	2019-07-23 17:13:05 -07:00
Levi Tamasi	092f417037	Move the uncompression dictionary object out of the block cache (#5584 ) Summary: RocksDB has historically stored uncompression dictionary objects in the block cache as opposed to storing just the block contents. This neccesitated evicting the object upon table close. With the new code, only the raw blocks are stored in the cache, eliminating the need for eviction. In addition, the patch makes the following improvements: 1) Compression dictionary blocks are now prefetched/pinned similarly to index/filter blocks. 2) A copy operation got eliminated when the uncompression dictionary is retrieved. 3) Errors related to retrieving the uncompression dictionary are propagated as opposed to silently ignored. Note: the patch temporarily breaks the compression dictionary evicition stats. They will be fixed in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584 Test Plan: make asan_check Differential Revision: D16344151 Pulled By: ltamasi fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b	2019-07-23 16:01:44 -07:00
sdong	3782accf7d	ldb sometimes specify a string-append merge operator (#5607 ) Summary: Right now, ldb cannot scan a DB with merge operands with default ldb. There is no hard to give a general merge operator so that it can at least print out something Pull Request resolved: https://github.com/facebook/rocksdb/pull/5607 Test Plan: Run ldb against a DB with merge operands and see the outputs. Differential Revision: D16442634 fbshipit-source-id: c66c414ec07f219cfc6e6ec2cc14c783ee95df54	2019-07-23 14:25:18 -07:00
Maysam Yabandeh	327c4807a7	Disable refresh snapshot feature by default (#5606 ) Summary: There are concerns about the correctness of this patch. Disabling by default until the concerns are resolved. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5606 Differential Revision: D16428064 Pulled By: maysamyabandeh fbshipit-source-id: a89280f0ea85796c9c9dfbfd9a8e91dad9b000b3	2019-07-22 20:05:00 -07:00
sdong	66b5613d0c	row_cache to share entry for recent snapshots (#5600 ) Summary: Right now, users cannot take advantage of row cache, unless no snapshot is used, or Get() is repeated for the same snapshots. This limits the usage of row cache. This change eliminate this restriction in some cases. If the snapshot used is newer than the largest sequence number in the file, and write callback function is not registered, the same row cache key is used as no snapshot is given. We still need the callback function restriction for now because the callback function may filter out different keys for different snapshots even if the snapshots are new. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5600 Test Plan: Add a unit test. Differential Revision: D16386616 fbshipit-source-id: 6b7d214bd215d191b03ccf55926ad4b703ec2e53	2019-07-22 18:56:19 -07:00
Venki Pallipadi	3a6e83b56b	HISTORY update for export and import column family APIs Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5587 Differential Revision: D16359919 fbshipit-source-id: cfd9c448d79a8b8e7ac1d2b661d10151df269dba	2019-07-18 10:16:38 -07:00
Levi Tamasi	3bde41b5a3	Move the filter readers out of the block cache (#5504 ) Summary: Currently, when the block cache is used for the filter block, it is not really the block itself that is stored in the cache but a FilterBlockReader object. Since this object is not pure data (it has, for instance, pointers that might dangle, including in one case a back pointer to the TableReader), it's not really sharable. To avoid the issues around this, the current code erases the cache entries when the TableReader is closed (which, BTW, is not sufficient since a concurrent TableReader might have picked up the object in the meantime). Instead of doing this, the patch moves the FilterBlockReader out of the cache altogether, and decouples the filter reader object from the filter block. In particular, instead of the TableReader owning, or caching/pinning the FilterBlockReader (based on the customer's settings), with the change the TableReader unconditionally owns the FilterBlockReader, which in turn owns/caches/pins the filter block. This change also enables us to reuse the code paths historically used for data blocks for filters as well. Note: Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504 Test Plan: make asan_check Differential Revision: D16036974 Pulled By: ltamasi fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091	2019-07-16 13:14:58 -07:00
sdong	aa0367aabb	Allow ldb to open DB as secondary (#5537 ) Summary: Right now ldb can open running DB through read-only DB. However, it might leave info logs files to the read-only DB directory. Add an option to open the DB as secondary to avoid it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5537 Test Plan: Run ./ldb scan --max_keys=10 --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp --no_value --hex and ./ldb get 0x00000000000000103030303030303030 --hex --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp against a normal db_bench run and observe the output changes. Also observe that no new info logs files are created under /tmp/rocksdbtest-2491/dbbench. Run without --secondary_path and observe that new info logs created under /tmp/rocksdbtest-2491/dbbench. Differential Revision: D16113886 fbshipit-source-id: 4e09dec47c2528f6ca08a9e7a7894ba2d9daebbb	2019-07-09 12:51:28 -07:00
Yanqin Jin	7c76a7fba2	Support GetAllKeyVersions() for non-default cf (#5544 ) Summary: Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families. Test plan (devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test $./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions ``` All other unit tests must pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544 Differential Revision: D16147551 Pulled By: riversand963 fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168	2019-07-07 22:43:52 -07:00
anand76	e0d9d57750	Fix bugs in WAL trash file handling (#5520 ) Summary: 1. Cleanup WAL trash files on open 2. Don't apply deletion rate limit if WAL dir is different from db dir Pull Request resolved: https://github.com/facebook/rocksdb/pull/5520 Test Plan: Add new unit tests and make check Differential Revision: D16096750 Pulled By: anand1976 fbshipit-source-id: 6f07858ad864b754b711db416f0389c45ede599b	2019-07-06 21:07:32 -07:00
sdong	e4dcf5fd22	db_bench to add a new "benchmark" to print out all stats history (#5532 ) Summary: Sometimes it is helpful to fetch the whole history of stats after benchmark runs. Add such an option Pull Request resolved: https://github.com/facebook/rocksdb/pull/5532 Test Plan: Run the benchmark manually and observe the output is as expected. Differential Revision: D16097764 fbshipit-source-id: 10b5b735a22a18be198b8f348be11f11f8806904	2019-07-03 20:03:28 -07:00
Yi Wu	662ce62044	Reduce iterator key comparison for upper/lower bound check (2nd attempt) (#5468 ) Summary: This is a second attempt for https://github.com/facebook/rocksdb/issues/5111, with the fix to redo iterate bounds check after `SeekXXX()`. This is because MyRocks may change iterate bounds between seek. See https://github.com/facebook/rocksdb/issues/5111 for original benchmark result and discussion. Closes https://github.com/facebook/rocksdb/issues/5463. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5468 Test Plan: Existing rocksdb tests, plus myrocks test `rocksdb.optimizer_loose_index_scans` and `rocksdb.group_min_max`. Differential Revision: D15863332 fbshipit-source-id: ab4aba5899838591806b8673899bd465f3f53e18	2019-07-02 11:48:46 -07:00
anand76	7259e28d91	MultiGet parallel IO (#5464 ) Summary: Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464 Test Plan: 1. make check 2. make asan_check 3. make asan_crash Differential Revision: D15911771 Pulled By: anand1976 fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394	2019-06-30 20:56:04 -07:00
sdong	15fd3be07b	LRU Cache to enable mid-point insertion by default (#5508 ) Summary: Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508 Test Plan: Run all existing tests. Differential Revision: D16021179 fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb	2019-06-27 10:20:57 -07:00
Yanqin Jin	c08c0ae731	Add C binding for secondary instance (#5505 ) Summary: Add C binding for secondary instance as well as unit test. Test plan (on devserver) ``` $make clean && COMPILE_WITH_ASAN=1 make -j20 all $./c_test $make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5505 Differential Revision: D16000043 Pulled By: riversand963 fbshipit-source-id: 3361ef6bfdf4ce12438cee7290a0ac203b5250bd	2019-06-27 08:58:54 -07:00
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
Yi Wu	2730fe693e	Fix ingested file and direcotry not being sync (#5435 ) Summary: It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest. Also syncing after writing global sequence when write_global_seqno=true was removed in https://github.com/facebook/rocksdb/issues/4172. Adding it back. Fixes https://github.com/facebook/rocksdb/issues/5287. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5435 Test Plan: Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false. https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9 More test suggestions are welcome. Differential Revision: D15941675 Pulled By: riversand963 fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78	2019-06-21 10:15:38 -07:00
Simon Grätzer	fe90ed7a70	Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474 ) Summary: When tailing the WAL with TransactionLogIterator, it used to return Corruption status to indicate that the WAL has new tail that is not visible to the iterator, which is a misleading status. The patch replaces it with TryAgain which is more descriptive of a status, indicating that the user needs to create a new iterator to fetch the recent tail. Fixes https://github.com/facebook/rocksdb/issues/5455 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5474 Differential Revision: D15898953 Pulled By: maysamyabandeh fbshipit-source-id: 40966f6457cb539e1aeb104daeada6b0e46059fc	2019-06-19 08:10:08 -07:00
Levi Tamasi	5355e527d9	Make the 'block read count' performance counters consistent (#5484 ) Summary: The patch brings the semantics of per-block-type read performance context counters in sync with the generic block_read_count by only incrementing the counter if the block was actually read from the file. It also fixes index_block_read_count, which fell victim to the refactoring in PR https://github.com/facebook/rocksdb/issues/5298. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5484 Test Plan: Extended the unit tests. Differential Revision: D15887431 Pulled By: ltamasi fbshipit-source-id: a3889759d0ac5759d56625d692cd828d1b9207a6	2019-06-18 19:03:24 -07:00
Yanqin Jin	f287f8dc93	Fix a bug caused by secondary not skipping the beginning of new MANIFEST (#5472 ) Summary: While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows. The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files again to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them. Test plan (on dev server) ``` $make clean && make -j32 all $./db_secondary_test ``` All existing unit tests must pass as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472 Differential Revision: D15866771 Pulled By: riversand963 fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed	2019-06-18 11:21:37 -07:00
Sagar Vemuri	f1219644ec	Validate CF Options when creating a new column family (#5453 ) Summary: It seems like CF Options are not properly validated when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations, will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`). Test Plan: Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions` ``` TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions ``` Also ran gtest-parallel to make sure the new test is not flaky. ``` TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000 [10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453 Differential Revision: D15816851 Pulled By: sagar0 fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5	2019-06-14 14:11:10 -07:00
Levi Tamasi	ba64a4cf52	Revert "Reduce iterator key comparison for upper/lower bound check (#5111 )" (#5440 ) Summary: This reverts commit `f3a7847598`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5440 Differential Revision: D15765967 Pulled By: ltamasi fbshipit-source-id: d027fe24132e3729289cd7c01857a7eb449d9dd0	2019-06-11 16:23:41 -07:00
Yanqin Jin	6ce5580882	Improve memtable earliest seqno assignment for secondary instance (#5413 ) Summary: In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay. When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families. Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables. 1. Primary triggers a flush and flushes "cf1". "default" is not flushed. 2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST. 3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413 Differential Revision: D15637407 Pulled By: riversand963 fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580	2019-06-10 12:58:14 -07:00
Levi Tamasi	bee2f48a66	Refactor the handling of cache related counters and statistics (#5408 ) Summary: The patch cleans up the handling of cache hit/miss/insertion related performance counters, get context counters, and statistics by eliminating some code duplication and factoring out the affected logic into separate methods. In addition, it makes the semantics of cache hit metrics more consistent by changing the code so that accessing a partition of partitioned indexes/filters through a pinned reference no longer counts as a cache hit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5408 Differential Revision: D15610883 Pulled By: ltamasi fbshipit-source-id: ee749c18965077aca971d8f8bee8b24ed8fa76f1	2019-06-06 11:36:40 -07:00
Yanqin Jin	340ed4fac7	Add support for timestamp in Get/Put (#5079 ) Summary: It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now). Test plan (on devserver): ``` $COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` All tests must pass. We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled. ``` $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 ``` Repeat for 6 times for both versions. Results are as follows: ``` \| \| readrandom \| fillrandom \| \| master \| 16.77 MB/s \| 47.05 MB/s \| \| PR5079 \| 16.44 MB/s \| 47.03 MB/s \| ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079 Differential Revision: D15132946 Pulled By: riversand963 fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c	2019-06-05 23:10:47 -07:00
Andrew Kryczka	ebe89ef9d8	Fix merging range tombstone covering put during flush/compaction (#5406 ) Summary: Flush/compaction use `MergeUntil` which has a special code path to handle a merge ending with a non-`Merge` point key. In particular if that key is a `Put` we forgot to check whether it is covered by a range tombstone. If it is covered then we must not include it in the following call to `TimedFullMerge`. Fixes #5392. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5406 Differential Revision: D15611144 Pulled By: sagar0 fbshipit-source-id: ba6a7863ca2d043f591de78fd0c4f4561f0c500e	2019-06-04 10:24:14 -07:00
Siying Dong	cb094e13bb	Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370 ) Summary: Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370 Differential Revision: D15570413 Pulled By: siying fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e	2019-05-31 10:50:19 -07:00
Yanqin Jin	b9f5900658	Fix WAL replay by skipping old write batches (#5170 ) Summary: 1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables. 2. Add support for benchmarking secondary instance to db_bench_tool. With changes made in this PR, we can start benchmarking secondary instance using two processes. It is also possible to vary the frequency at which the secondary instance tries to catch up with the primary. The info log of the secondary can be found in a directory whose path can be specified with '-secondary_path'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170 Differential Revision: D15564608 Pulled By: riversand963 fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8	2019-05-30 19:33:33 -07:00
Levi Tamasi	1e35584251	Move the index readers out of the block cache (#5298 ) Summary: Currently, when the block cache is used for index blocks as well, it is not really the index block that is stored in the cache but an IndexReader object. Since this object is not pure data (it has, for instance, pointers that might dangle), it's not really sharable. To avoid the issues around this, the current code uses a dummy unique cache key for each TableReader to store the IndexReader, and erases the IndexReader entry when the TableReader is closed. Instead of doing this, the new code moves the IndexReader out of the cache altogether. In particular, instead of the TableReader owning, or caching/pinning the IndexReader based on the customer's settings, the TableReader unconditionally owns the IndexReader, which in turn owns/caches/pins the index block (which is itself sharable and thus can be safely put in the cache without any hacks). Note: the change has two side effects: 1) Partitions of partitioned indexes no longer affect the read amplification statistics. 2) Eviction statistics for index blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298 Differential Revision: D15303203 Pulled By: ltamasi fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47	2019-05-30 11:53:27 -07:00
haoyuhuang	74a334a2eb	Provide an option so that SST ingestion won't fall back to copy after hard linking fails (#5333 ) Summary: RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named 'failed_move_fall_back_to_copy' for users to choose which behavior they want. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333 Differential Revision: D15457597 Pulled By: HaoyuHuang fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214	2019-05-23 21:58:52 -07:00
Siying Dong	b2274da0e5	LogWriter to only flush after finish generating whole record (#5328 ) Summary: Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328 Differential Revision: D15425032 Pulled By: siying fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa	2019-05-21 12:33:17 -07:00
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
Maysam Yabandeh	5c0e304170	WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313 ) Summary: WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes. The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313 Differential Revision: D15379977 Pulled By: maysamyabandeh fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00	2019-05-20 07:49:20 -07:00
Yanqin Jin	fb4c6a31ce	Log replay integration for secondary instance (#5305 ) Summary: RocksDB secondary can replay both MANIFEST and WAL now. On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data. This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305 Differential Revision: D15386512 Pulled By: riversand963 fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355	2019-05-17 19:19:51 -07:00
yiwu-arbug	f3a7847598	Reduce iterator key comparison for upper/lower bound check (#5111 ) Summary: Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111 Differential Revision: D15330061 Pulled By: siying fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024	2019-05-17 10:28:31 -07:00
Yanqin Jin	ad27045d14	Update HISTORY after cherrypicking a bug fix to 6.2 (#5309 ) Summary: After cherry-pick a bug fix to 6.2.fb branch, update the HISTORY.md file to reflect this change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5309 Differential Revision: D15358002 Pulled By: riversand963 fbshipit-source-id: 5a60510ec6dd444ce5ffaefc69b2e4c38914a921	2019-05-15 13:47:36 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Yanqin Jin	e626016545	Fix a race condition caused by unlocking db mutex (#5294 ) Summary: Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows. Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong. The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used or the memtable list does not have unpicked memtables. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294 Differential Revision: D15295297 Pulled By: riversand963 fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd	2019-05-10 17:56:48 -07:00
Siying Dong	9fad3e21eb	Merging iterator to avoid child iterator reseek for some cases (#5286 ) Summary: When reseek happens in merging iterator, reseeking a child iterator can be avoided if: (1) the iterator represents imutable data (2) reseek() to a larger key than the current key (3) the current key of the child iterator is larger than the seek key because it is guaranteed that the result will fall into the same position. This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286 Differential Revision: D15283635 Pulled By: siying fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83	2019-05-09 14:20:04 -07:00
Siying Dong	25d81e4577	DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244 ) Summary: Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number. We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed. In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244 Differential Revision: D15067257 Pulled By: siying fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9	2019-05-09 12:24:04 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Siying Dong	4479dff208	Reduce binary search when reseek into the same data block (#5256 ) Summary: Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following: 1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file 2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256 Differential Revision: D15105072 Pulled By: siying fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80	2019-05-01 14:26:30 -07:00
Siying Dong	4e0f2aadb0	DB::Close() to fail when there are unreleased snapshots (#5272 ) Summary: Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272 Differential Revision: D15159713 Pulled By: siying fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd	2019-05-01 10:17:30 -07:00
Maysam Yabandeh	521d234bda	Revert snap_refresh_nanos feature (#5269 ) Summary: Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269 Differential Revision: D15151285 Pulled By: maysamyabandeh fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e	2019-05-01 10:07:30 -07:00
Fosco Marotto	36ea379cdc	Update history and version for future 6.2.0 (#5270 ) Summary: Update history before branch cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5270 Differential Revision: D15153700 Pulled By: gfosco fbshipit-source-id: 2c81e01a2ab965661b1d88209dca74ba0a3756cb	2019-04-30 15:09:36 -07:00
Sagar Vemuri	3548e4220d	Improve explicit user readahead performance (#5246 ) Summary: Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`. 1. Stop creating new table readers when the user explicitly sets readahead size. 2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases). 3. Add `readahead_size` to db_bench. Benchmarks: https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737 For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%. For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%. Test Plan: Updated `DBIteratorTest.ReadAhead` test to make sure that: - no new table readers are created for iterators on setting ReadOptions.readahead_size - At least "readahead" number of bytes are actually getting read on each iterator read. TODO later: - Use similar logic for compactions as well. - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246 Differential Revision: D15107946 Pulled By: sagar0 fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea	2019-04-26 21:24:10 -07:00
Maysam Yabandeh	506e8448be	Refresh snapshot list during long compactions (#5099 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099 Differential Revision: D15086710 Pulled By: maysamyabandeh fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12	2019-04-25 18:17:22 -07:00
Andrew Kryczka	6eb317bb4c	Option string/map/file can set env from object registry (#5237 ) Summary: - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object. - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup. - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237 Differential Revision: D15056360 Pulled By: siying fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e	2019-04-25 11:35:09 -07:00
Yanqin Jin	da96f2fe00	Close WAL files before deletion (#5233 ) Summary: Currently one thread in RocksDB keeps a WAL file open while another thread deletes it. Although the first thread never writes to the WAL again, it still tries to close it in the end. This is fine on POSIX, but can be problematic on other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or throw exceptions. The solution is to let the second thread close the WAL before deleting it. RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233 Differential Revision: D15032670 Pulled By: riversand963 fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754	2019-04-25 10:11:41 -07:00
Zhongyi Xie	66d8360beb	update history.md (#5245 ) Summary: update history.md for `BottommostLevelCompaction::kForceOptimized` to mention possible user impact. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5245 Differential Revision: D15073712 Pulled By: miasantreble fbshipit-source-id: d40f698c42e8a6368be4eac0a00d02279615edea	2019-04-24 21:30:00 -07:00
Andrew Kryczka	8272a6de57	Optionally wait on bytes_per_sync to smooth I/O (#5183 ) Summary: The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways. My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes. Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE \| SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it. There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically). The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183 Differential Revision: D14953553 Pulled By: siying fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9	2019-04-22 11:51:39 -07:00
Mike Kolupaev	df38c1ce66	Add BlockBasedTableOptions::index_shortening (#5174 ) Summary: Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174 Differential Revision: D14884185 Pulled By: al13n321 fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534	2019-04-22 08:20:35 -07:00
Sagar Vemuri	dc64c2f5cc	Fix history to not include some features in 6.1 (#5224 ) Summary: Fix HISTORY.md by removing a few items from 6.1.1 history as they did not make into the 6.1.fb branch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5224 Differential Revision: D15017030 Pulled By: sagar0 fbshipit-source-id: 090724d326d29168952e06dc1a5090c03fdd739e	2019-04-19 13:00:53 -07:00
Zhongyi Xie	3bdce20e2b	reorganize history.md to list unreleased changes separately Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5216 Differential Revision: D15003749 Pulled By: miasantreble fbshipit-source-id: a52c264e694cd7c55813be33ee22b4f3046b545a	2019-04-18 14:55:57 -07:00
Zhongyi Xie	baa5302447	Avoid double-compacting data in bottom level in manual compactions (#5138 ) Summary: Depending on the config, manual compaction (leveled compaction style) does following compactions: L0->L1 L1->L2 ... Ln-1 -> Ln Ln -> Ln The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only. Resolves issue https://github.com/facebook/rocksdb/issues/4995 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138 Differential Revision: D14940106 Pulled By: miasantreble fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5	2019-04-16 23:32:20 -07:00
Siying Dong	beb44ec3eb	WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175 ) Summary: Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget. Change it to 256KB accordingly. There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175 Differential Revision: D14954289 Pulled By: siying fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94	2019-04-16 12:03:07 -07:00
yiwu-arbug	f1239d5f10	Avoid per-key upper bound check in BlockBasedTableIterator (#5142 ) Summary: This is second attempt for #5101. Original commit message: `BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons. This patch come with two fixes: Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter. Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142 Differential Revision: D14907113 Pulled By: siying fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb	2019-04-16 11:37:47 -07:00
Fosco Marotto	b5cad5c986	Update history and version to 6.1.1 (#5171 ) Summary: Including latest fixes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5171 Differential Revision: D14875157 Pulled By: gfosco fbshipit-source-id: 86ec7ee3553a9b25ab71ed98966ce08a16322e2c	2019-04-15 10:49:38 -07:00
yiwu-arbug	cca141ecf8	Fix crash with memtable prefix bloom and key out of prefix extractor domain (#5190 ) Summary: Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190 Differential Revision: D14923773 Pulled By: miasantreble fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04	2019-04-12 17:07:49 -07:00
Maysam Yabandeh	fe642cbee6	WritePrepared: fix race condition in reading batch with duplicate keys (#5147 ) Summary: When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete. The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache. A unit test is added to reproduce the race condition with duplicate keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147 Differential Revision: D14758815 Pulled By: maysamyabandeh fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719	2019-04-12 14:40:41 -07:00

1 2 3 4 5 ...

770 Commits