rocksdb

Author	SHA1	Message	Date
matthewvon	5a2b4ed671	BugFix: fs_posix.cc GetFreeSpace uses wrong value non-root users (#8370 ) Summary: fs_posix.cc GetFreeSpace() calculates free space based upon a call to statvfs(). However, there are two extremely different values in statvfs's returned structure: f_bfree which is free space for root and f_bavail which is free space for non-root users. The existing code uses f_bfree. Many disks have 5 to 10% of the total disk space reserved for root only. Therefore GetFreeSpace() does not realize that non-root users may not have storage available. This PR detects whether the effective posix user is root or not, then selects the appropriate available space value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8370 Reviewed By: mrambacher Differential Revision: D29032710 Pulled By: jay-zhuang fbshipit-source-id: 57feba34ed035615a479956d28f98d85735281c0	2021-06-10 11:11:54 -07:00
David Devecsery	80a59a03a7	Cancel compact range (#8351 ) Summary: Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure. Tested via two tests added to db_tests2.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351 Reviewed By: ajkr Differential Revision: D28808894 Pulled By: ddevec fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd	2021-06-07 11:41:31 -07:00
Andrew Kryczka	9167ece586	Snapshot release triggered compaction without multiple tombstones (#8357 ) Summary: This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase. This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop. The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are: - files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys - it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357 Test Plan: - make check Reviewed By: jay-zhuang Differential Revision: D28855340 Pulled By: ajkr fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798	2021-06-04 00:21:40 -07:00
anand76	799cf37cb1	Update HISTORY and version to 6.21 (#8363 ) Summary: Update HISTORY and version to 6.21 on master. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8363 Reviewed By: jay-zhuang Differential Revision: D28888818 Pulled By: anand1976 fbshipit-source-id: 9e5fac3b99ecc9f3b7d9f21474a39fa50decb117	2021-06-03 19:32:14 -07:00
Peter Dillinger	956ce9bde2	Some API clarification for manual compaction and listeners (#8330 ) Summary: Avoid people hitting bugs Pull Request resolved: https://github.com/facebook/rocksdb/pull/8330 Test Plan: comments only Reviewed By: siying Differential Revision: D28683157 Pulled By: pdillinger fbshipit-source-id: 2b34d3efb5e2fa34bea93d54c940cbd425212d25	2021-05-26 08:14:38 -07:00
Peter Dillinger	3469d60fcc	Add table properties for number of entries added to filters (#8323 ) Summary: With Ribbon filter work and possible variance in actual bits per key (or prefix; general term "entry") to achieve certain FP rates, I've received a request to be able to track actual bits per key in generated filters. This change adds a num_filter_entries table property, which can be combined with filter_size to get bits per key (entry). This can vary from num_entries in at least these ways: * Different versions of same key are only counted once in filters. * With prefix filters, several user keys map to the same filter entry. * A single filter can include both prefixes and user keys. Note that FilterBlockBuilder::NumAdded() didn't do anything useful except distinguish empty from non-empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323 Test Plan: basic unit test included, others updated Reviewed By: jay-zhuang Differential Revision: D28596210 Pulled By: pdillinger fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376	2021-05-21 17:11:32 -07:00
Jay Zhuang	6c86543590	Fix manual compaction `max_compaction_bytes` under-calculated issue (#8269 ) Summary: Fix a bug that for manual compaction, `max_compaction_bytes` is only limit the SST files from input level, but not overlapped files on output level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8269 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D28231044 Pulled By: jay-zhuang fbshipit-source-id: 9d7d03004f30cc4b1b9819830141436907554b7c	2021-05-21 14:03:44 -07:00
sdong	bd3d080ef8	Try to build with liburing by default. (#8322 ) Summary: By default, try to build with liburing. For make, if ROCKSDB_USE_IO_URING is not set, treat as 1, which means RocksDB will try to build with liburing. For cmake, add WITH_LIBURING to control it, with default on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8322 Test Plan: Build using cmake and make. Reviewed By: anand1976 Differential Revision: D28586498 fbshipit-source-id: cfd39159ab697f4b93a9293a59c07f839b1e7ed5	2021-05-21 10:21:53 -07:00
sdong	2f1984dd45	Compare memtable insert and flush count (#8288 ) Summary: When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\ heck against memory corruption. This change will also allow more counters to be added in the future for better validation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288 Test Plan: Pass all existing tests Reviewed By: ajkr Differential Revision: D28369194 fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad	2021-05-20 16:07:28 -07:00
Jay Zhuang	3786181a90	Add remote compaction public API (#8300 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8300 Reviewed By: ajkr Differential Revision: D28464726 Pulled By: jay-zhuang fbshipit-source-id: 49e9f4fb791808a6cbf39a7b1a331373f645fc5e	2021-05-19 21:41:31 -07:00
Peter Dillinger	311a544c2a	Use deleters to label cache entries and collect stats (#8297 ) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb	2021-05-19 16:51:13 -07:00
anand76	9d61a0856d	Sync ingested files only if reopen is supported by the FS (#8296 ) Summary: Some file systems (especially distributed FS) do not support reopening a file for writing. The ExternalSstFileIngestionJob calls ReopenWritableFile in order to sync the ingested file, which typically makes sense only on a local file system with a page cache (i.e Posix). So this change tries to sync the ingested file only if ReopenWritableFile doesn't return Status::NotSupported(). Tests: Add a new unit test in external_sst_file_basic_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8296 Reviewed By: jay-zhuang Differential Revision: D28420865 Pulled By: anand1976 fbshipit-source-id: 380e7f5ff95324997f7a59864a9ac96ebbd0100c	2021-05-18 19:33:55 -07:00
sdong	60e5af83c1	Handle return code by io_uring_submit_and_wait() and io_uring_wait_cqe() (#8311 ) Summary: Right now return codes by io_uring_submit_and_wait() and io_uring_wait_cqe() are not handled. It is not the good practice. Although these two functions are not supposed to return non-0 values in normal exeuction, people suspect that they might return non-0 value when an interruption happens, and the code might cause hanging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8311 Test Plan: Make sure at least normal test cases still pass. Reviewed By: anand1976 Differential Revision: D28500828 fbshipit-source-id: 8a76cea9cafbd041102e0b6a8eef9d0bfed7c211	2021-05-18 16:09:14 -07:00
Peter Dillinger	78a309bf86	New Cache API for gathering statistics (#8225 ) Summary: Adds a new Cache::ApplyToAllEntries API that we expect to use (in follow-up PRs) for efficiently gathering block cache statistics. Notable features vs. old ApplyToAllCacheEntries: * Includes key and deleter (in addition to value and charge). We could have passed in a Handle but then more virtual function calls would be needed to get the "fields" of each entry. We expect to use the 'deleter' to identify the origin of entries, perhaps even more. * Heavily tuned to minimize latency impact on operating cache. It does this by iterating over small sections of each cache shard while cycling through the shards. * Supports tuning roughly how many entries to operate on for each lock acquire and release, to control the impact on the latency of other operations without excessive lock acquire & release. The right balance can depend on the cost of the callback. Good default seems to be around 256. * There should be no need to disable thread safety. (I would expect uncontended locks to be sufficiently fast.) I have enhanced cache_bench to validate this approach: * Reports a histogram of ns per operation, so we can look at the ditribution of times, not just throughput (average). * Can add a thread for simulated "gather stats" which calls ApplyToAllEntries at a specified interval. We also generate a histogram of time to run ApplyToAllEntries. To make the iteration over some entries of each shard work as cleanly as possible, even with resize between next set of entries, I have re-arranged which hash bits are used for sharding and which for indexing within a shard. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225 Test Plan: A couple of unit tests are added, but primary validation is manual, as the primary risk is to performance. The primary validation is using cache_bench to ensure that neither the minor hashing changes nor the simulated stats gathering significantly impact QPS or latency distribution. Note that adding op latency histogram seriously impacts the benchmark QPS, so for a fair baseline, we need the cache_bench changes (except remove simulated stat gathering to make it compile). In short, we don't see any reproducible difference in ops/sec or op latency unless we are gathering stats nearly continuously. Test uses 10GB block cache with 8KB values to be somewhat realistic in the number of items to iterate over. Baseline typical output: ``` Complete in 92.017 s; Rough parallel ops/sec = 869401 Thread ops/sec = 54662 Operation latency (ns): Count: 80000000 Average: 11223.9494 StdDev: 29.61 Min: 0 Median: 7759.3973 Max: 9620500 Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58 ------------------------------------------------------ [ 0, 1 ] 68 0.000% 0.000% ( 2900, 4400 ] 89 0.000% 0.000% ( 4400, 6600 ] 33630240 42.038% 42.038% ######## ( 6600, 9900 ] 18129842 22.662% 64.700% ##### ( 9900, 14000 ] 7877533 9.847% 74.547% ## ( 14000, 22000 ] 15193238 18.992% 93.539% #### ( 22000, 33000 ] 3037061 3.796% 97.335% # ( 33000, 50000 ] 1626316 2.033% 99.368% ( 50000, 75000 ] 421532 0.527% 99.895% ( 75000, 110000 ] 56910 0.071% 99.966% ( 110000, 170000 ] 16134 0.020% 99.986% ( 170000, 250000 ] 5166 0.006% 99.993% ( 250000, 380000 ] 3017 0.004% 99.996% ( 380000, 570000 ] 1337 0.002% 99.998% ( 570000, 860000 ] 805 0.001% 99.999% ( 860000, 1200000 ] 319 0.000% 100.000% ( 1200000, 1900000 ] 231 0.000% 100.000% ( 1900000, 2900000 ] 100 0.000% 100.000% ( 2900000, 4300000 ] 39 0.000% 100.000% ( 4300000, 6500000 ] 16 0.000% 100.000% ( 6500000, 9800000 ] 7 0.000% 100.000% ``` New, gather_stats=false. Median thread ops/sec of 5 runs: ``` Complete in 92.030 s; Rough parallel ops/sec = 869285 Thread ops/sec = 54458 Operation latency (ns): Count: 80000000 Average: 11298.1027 StdDev: 42.18 Min: 0 Median: 7722.0822 Max: 6398720 Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78 ------------------------------------------------------ [ 0, 1 ] 109 0.000% 0.000% ( 2900, 4400 ] 793 0.001% 0.001% ( 4400, 6600 ] 34054563 42.568% 42.569% ######### ( 6600, 9900 ] 17482646 21.853% 64.423% #### ( 9900, 14000 ] 7908180 9.885% 74.308% ## ( 14000, 22000 ] 15032072 18.790% 93.098% #### ( 22000, 33000 ] 3237834 4.047% 97.145% # ( 33000, 50000 ] 1736882 2.171% 99.316% ( 50000, 75000 ] 446851 0.559% 99.875% ( 75000, 110000 ] 68251 0.085% 99.960% ( 110000, 170000 ] 18592 0.023% 99.983% ( 170000, 250000 ] 7200 0.009% 99.992% ( 250000, 380000 ] 3334 0.004% 99.997% ( 380000, 570000 ] 1393 0.002% 99.998% ( 570000, 860000 ] 700 0.001% 99.999% ( 860000, 1200000 ] 293 0.000% 100.000% ( 1200000, 1900000 ] 196 0.000% 100.000% ( 1900000, 2900000 ] 69 0.000% 100.000% ( 2900000, 4300000 ] 32 0.000% 100.000% ( 4300000, 6500000 ] 10 0.000% 100.000% ``` New, gather_stats=true, 1 second delay between scans. Scans take about 1 second here so it's spending about 50% time scanning. Still the effect on ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs: ``` Complete in 91.890 s; Rough parallel ops/sec = 870608 Thread ops/sec = 54551 Operation latency (ns): Count: 80000000 Average: 11311.2629 StdDev: 45.28 Min: 0 Median: 7686.5458 Max: 10018340 Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86 ------------------------------------------------------ [ 0, 1 ] 71 0.000% 0.000% ( 2900, 4400 ] 291 0.000% 0.000% ( 4400, 6600 ] 34492060 43.115% 43.116% ######### ( 6600, 9900 ] 16727328 20.909% 64.025% #### ( 9900, 14000 ] 7845828 9.807% 73.832% ## ( 14000, 22000 ] 15510654 19.388% 93.220% #### ( 22000, 33000 ] 3216533 4.021% 97.241% # ( 33000, 50000 ] 1680859 2.101% 99.342% ( 50000, 75000 ] 439059 0.549% 99.891% ( 75000, 110000 ] 60540 0.076% 99.967% ( 110000, 170000 ] 14649 0.018% 99.985% ( 170000, 250000 ] 5242 0.007% 99.991% ( 250000, 380000 ] 3260 0.004% 99.995% ( 380000, 570000 ] 1599 0.002% 99.997% ( 570000, 860000 ] 1043 0.001% 99.999% ( 860000, 1200000 ] 471 0.001% 99.999% ( 1200000, 1900000 ] 275 0.000% 100.000% ( 1900000, 2900000 ] 143 0.000% 100.000% ( 2900000, 4300000 ] 60 0.000% 100.000% ( 4300000, 6500000 ] 27 0.000% 100.000% ( 6500000, 9800000 ] 7 0.000% 100.000% ( 9800000, 14000000 ] 1 0.000% 100.000% Gather stats latency (us): Count: 46 Average: 980387.5870 StdDev: 60911.18 Min: 879155 Median: 1033777.7778 Max: 1261431 Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00 ------------------------------------------------------ ( 860000, 1200000 ] 45 97.826% 97.826% #################### ( 1200000, 1900000 ] 1 2.174% 100.000% Most recent cache entry stats: Number of entries: 1295133 Total charge: 9.88 GB Average key size: 23.4982 Average charge: 8.00 KB Unique deleters: 3 ``` Reviewed By: mrambacher Differential Revision: D28295742 Pulled By: pdillinger fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95	2021-05-11 16:17:10 -07:00
mrambacher	9f2d255aed	Add ObjectRegistry to ConfigOptions (#8166 ) Summary: This change enables a couple of things: - Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production" - The ObjectRegistry is created fewer times and can be re-used The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166 Reviewed By: zhichao-cao Differential Revision: D27657952 Pulled By: mrambacher fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd	2021-05-11 06:47:22 -07:00
mrambacher	ff463742b5	Add Merge Operator support to WriteBatchWithIndex (#8135 ) Summary: The WBWI has two differing modes of operation dependent on the value of the constructor parameter `overwrite_key`. Currently, regardless of the parameter, neither mode performs as expected when using Merge. This PR remedies this by correctly invoking the appropriate Merge Operator before returning results from the WBWI. Examples of issues that exist which are solved by this PR: ## Example 1 with `overwrite_key=false` Currently, from an empty database, the following sequence: ``` Put('k1', 'v1') Merge('k1', 'v2') Get('k1') ``` Incorrectly yields `v2`, that is to say that the Merge behaves like a Put. ## Example 2 with o`verwrite_key=true` Currently, from an empty database, the following sequence: ``` Put('k1', 'v1') Merge('k1', 'v2') Get('k1') ``` Incorrectly yields `ERROR: kMergeInProgress`. ## Example 3 with `overwrite_key=false` Currently, with a database containing `('k1' -> 'v1')`, the following sequence: ``` Merge('k1', 'v2') GetFromBatchAndDB('k1') ``` Incorrectly yields `v1,v2` ## Example 4 with `overwrite_key=true` Currently, with a database containing `('k1' -> 'v1')`, the following sequence: ``` Merge('k1', 'v1') GetFromBatchAndDB('k1') ``` Incorrectly yields `ERROR: kMergeInProgress`. ## Example 5 with `overwrite_key=false` Currently, from an empty database, the following sequence: ``` Put('k1', 'v1') Merge('k1', 'v2') GetFromBatchAndDB('k1') ``` Incorrectly yields `v1,v2` ## Example 6 with `overwrite_key=true` Currently, from an empty database, `('k1' -> 'v1')`, the following sequence: ``` Put('k1', 'v1') Merge('k1', 'v2') GetFromBatchAndDB('k1') ``` Incorrectly yields `ERROR: kMergeInProgress`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135 Reviewed By: pdillinger Differential Revision: D27657938 Pulled By: mrambacher fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df	2021-05-10 12:50:25 -07:00
sdong	f89a53655d	Change date format in HISTORY.md (#8278 ) Summary: Per previous discussion, change date format in HISTORY.md to follow ISO 8601. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8278 Reviewed By: jay-zhuang Differential Revision: D28294022 fbshipit-source-id: 563f29c56143519b4a871df82a17dd0a168a578c	2021-05-07 16:16:30 -07:00
Andrew Kryczka	a639c02f8e	Allow applying `CompactionFilter` outside of compaction (#8243 ) Summary: From HISTORY.md release note: - Allow `CompactionFilter`s to apply in more table file creation scenarios such as flush and recovery. For compatibility, `CompactionFilter`s by default apply during compaction. Users can customize this behavior by overriding `CompactionFilterFactory::ShouldFilterTableFileCreation()`. - Removed unused structure `CompactionFilterContext` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8243 Test Plan: added unit tests Reviewed By: pdillinger Differential Revision: D28088089 Pulled By: ajkr fbshipit-source-id: 0799be7908e3b39fea09fc3f1ab00e13ad817fae	2021-05-07 16:01:40 -07:00
Peter Dillinger	c26b75baa5	Deprecate obsolete "backupable db" from public APIs (#8274 ) Summary: An early design of BackupEngine used stackable DB, so I guess a DB had to opt-in to being backupable. Unfortunately the naming of that obsolete design still infects our public API and implementation. This change fixes the public API, with a deprecated backward-compatibility header. `BackupableDBOptions` is renamed to `BackupEngineOptions` (copy-replace in the public header) and backup_engine.h replaces backupable_db.h (present for backward compatibility). The only other change in backupable_db.h -> backup_engine.h is cleaning up headers. Later changes will fix the internal implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274 Test Plan: The internal implementation of BackupEngine uses the name BackupEngineOptions, while the unit tests use the old name BackupableDBOptions. This gives me confidence that both still work. Reviewed By: mrambacher Differential Revision: D28259471 Pulled By: pdillinger fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613	2021-05-07 13:53:15 -07:00
sdong	a4919d6b62	Cap automatic arena block size to 1 MB (#7907 ) Summary: Larger arena block size does provide the benefit of reducing allocation overhead, however it may cause other troubles. For example, allocator is more likely not to allocate them to physical memory and trigger page fault. Weighing the risk, we cap the arena block size to 1MB. Users can always use a larger value if they want. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7907 Test Plan: Run all existing tests Reviewed By: pdillinger Differential Revision: D26135269 fbshipit-source-id: b7f55afd03e6ee1d8715f90fa11b6c33944e9ea8	2021-05-07 13:15:34 -07:00
Andrew Kryczka	0f42e50fec	Fix `GetLiveFiles()` returning OPTIONS-000000 (#8268 ) Summary: See release note in HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268 Test Plan: unit test repro Reviewed By: siying Differential Revision: D28227901 Pulled By: ajkr fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339	2021-05-05 12:54:46 -07:00
Peter Dillinger	3b981eaa1d	Fix use-after-free threading bug in ClockCache (#8261 ) Summary: In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with -use_clock_cache, as well as db_bench -use_clock_cache, but not single-threaded. Smaller cache size hits failure much faster. ASAN reported the failuer as calling malloc_usable_size on the `key` pointer of a ClockCache handle after it was reportedly freed. On detailed inspection I found this bad sequence of operations for a cache entry: state=InCache=1,refs=1 [thread 1] Start ClockCacheShard::Unref (from Release, no mutex) [thread 1] Decrement ref count state=InCache=1,refs=0 [thread 1] Suspend before CalcTotalCharge (no mutex) [thread 2] Start UnsetInCache (from Insert, mutex held) [thread 2] clear InCache bit state=InCache=0,refs=0 [thread 2] Calls RecycleHandle (based on pre-updated state) [thread 2] Returns to Insert which calls Cleanup which deletes `key` [thread 1] Resume ClockCacheShard::Unref [thread 1] Read `key` in CalcTotalCharge To fix this, I've added a field to the handle to store the metadata charge so that we can efficiently remember everything we need from the handle in Unref. We must not read from the handle again if we decrement the count to zero with InCache=1, which means we don't own the entry and someone else could eject/overwrite it immediately. Note before this change, on amd64 sizeof(Handle) == 56 even though there are only 48 bytes of data. Grouping together the uint32_t fields would cut it down to 48, but I've added another uint32_t, which takes it back up to 56. Not a big deal. Also fixed DisownData to cooperate with ASAN as in LRUCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261 Test Plan: Manual + adding use_clock_cache to db_crashtest.py Base performance ./cache_bench -use_clock_cache Complete in 17.060 s; QPS = 2458513 New performance ./cache_bench -use_clock_cache Complete in 17.052 s; QPS = 2459695 Any difference is easily buried in small noise. Crash test shows still more bug(s) in ClockCache, so I'm expecting to disable ClockCache from production code in a follow-up PR (if we can't find and fix the bug(s)) Reviewed By: mrambacher Differential Revision: D28207358 Pulled By: pdillinger fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294	2021-05-04 22:18:00 -07:00
Peter Dillinger	d2ca04e3ed	Add more LSM info to FilterBuildingContext (#8246 ) Summary: Add `num_levels`, `is_bottommost`, and table file creation `reason` to `FilterBuildingContext`, in anticipation of more powerful Bloom-like filter support. To support this, added `is_bottommost` and `reason` to `TableBuilderOptions`, which allowed removing `reason` parameter from `rocksdb::BuildTable`. I attempted to remove `skip_filters` from `TableBuilderOptions`, because filter construction decisions should arise from options, not one-off parameters. I could not completely remove it because the public API for SstFileWriter takes a `skip_filters` parameter, and translating this into an option change would mean awkwardly replacing the table_factory if it is BlockBasedTableFactory with new filter_policy=nullptr option. I marked this public skip_filters option as deprecated because of this oddity. (skip_filters on the read side probably makes sense.) At least `skip_filters` is now largely hidden for users of `TableBuilderOptions` and is no longer used for implementing the optimize_filters_for_hits option. Bringing the logic for that option closer to handling of FilterBuildingContext makes it more obvious that hese two are using the same notion of "bottommost." (Planned: configuration options for Bloom-like filters that generalize `optimize_filters_for_hits`) Recommended follow-up: Try to get away from "bottommost level" naming of things, which is inaccurate (see VersionStorageInfo::RangeMightExistAfterSortedRun), and move to "bottommost run" or just "bottommost." Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246 Test Plan: extended an existing unit test to exercise and check various filter building contexts. Also, existing tests for optimize_filters_for_hits validate some of the "bottommost" handling, which is now closely connected to FilterBuildingContext::is_bottommost through TableBuilderOptions::is_bottommost Reviewed By: mrambacher Differential Revision: D28099346 Pulled By: pdillinger fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a	2021-04-30 13:50:13 -07:00
Peter Dillinger	85becd94c1	Refactor: use TableBuilderOptions to reduce parameter lists (#8240 ) Summary: Greatly reduced the not-quite-copy-paste giant parameter lists of rocksdb::NewTableBuilder, rocksdb::BuildTable, BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor. Moved weird separate parameter `uint32_t column_family_id` of TableFactory::NewTableBuilder into TableBuilderOptions. Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t target_file_size` is not randomly placed between uint64_t timestamps (was easy to mix up). Replaced a couple of fields of BlockBasedTableBuilder::Rep with a FilterBuildingContext. The motivation for this change is making it easier to pass along more data into new fields in FilterBuildingContext (follow-up PR). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240 Test Plan: ASAN make check Reviewed By: mrambacher Differential Revision: D28075891 Pulled By: pdillinger fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f	2021-04-29 07:00:50 -07:00
Akanksha Mahajan	a0e0feca62	Improve BlockPrefetcher to prefetch only for sequential scans (#7394 ) Summary: BlockPrefetcher is used by iterators to prefetch data if they anticipate more data to be used in future and this is valid for forward sequential scans. But BlockPrefetcher tracks only num_file_reads_ and not if reads are sequential. This presents problem for MultiGet with large number of keys when it reseeks index iterator and data block. FilePrefetchBuffer can end up doing large readahead for reseeks as readahead size increases exponentially once readahead is enabled. Same issue is with BlockBasedTableIterator. Add previous length and offset read as well in BlockPrefetcher (creates FilePrefetchBuffer) and FilePrefetchBuffer (does prefetching of data) to determine if reads are sequential and then prefetch. Update the last block read after cache hit to take reads from cache also in account. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7394 Test Plan: Add new unit test case Reviewed By: anand1976 Differential Revision: D23737617 Pulled By: akankshamahajan15 fbshipit-source-id: 8e6917c25ed87b285ee495d1b68dc623d71205a3	2021-04-28 12:53:46 -07:00
Zhichao Cao	09a9ec3ac0	Fix the false positive alert of CF consistency check in WAL recovery (#8207 ) Summary: In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue. Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here. Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207 Test Plan: added unit test, make crash_test Reviewed By: riversand963 Differential Revision: D27898839 Pulled By: zhichao-cao fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e	2021-04-22 10:28:37 -07:00
Akanksha Mahajan	596e9008e4	Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898 ) Summary: When WriteBufferManager is shared across DBs and column families to maintain memory usage under a limit, OOMs have been observed when flush cannot finish but writes continuously insert to memtables. In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write, this change will stall incoming writers until flush is completed and memory_usage drops. Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_ (memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true. DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall() (which adds dummy stall object to the writer's queue). Then DB is blocked on a state State::Blocked (current write doesn't go through). WBStallInterface object maintained by every DB instance is added to the queue of WriteBufferManager. If multiple DBs tries to write during this stall, they will also be blocked when check WriteBufferManager::ShouldStall() returns true. End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush to complete and avoid continous stalling if memory usage remains close to buffer_size. WriterBufferManager::EndWriteStall() is called, which removes all instances from its queue and signal them to continue. Their state is changed to State::Running and they are unblocked. DBImpl then signal all incoming writers of that DB to continue by calling write_thread_.EndWriteStall() (which removes dummy stall object from the queue). DB instance creates WBMStallInterface which is an interface to block and signal DBs during stall. When DB needs to be blocked or signalled by WriteBufferManager, state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898 Test Plan: Added a new test db/db_write_buffer_manager_test.cc Reviewed By: anand1976 Differential Revision: D26093227 Pulled By: akankshamahajan15 fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae	2021-04-21 13:54:02 -07:00
Peter Dillinger	95f6add746	Revert Ribbon starting level support from #8198 (#8212 ) Summary: This partially reverts commit `10196d7edc`. The problem with this change is because of important filter use cases: FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so would only use Ribbon filters if specifically including level 0 for the Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate unknown level, and this would be treated the same as level 0 unless fixed. We are keeping the part about committing to permanent schema, which is only changes to API comments and HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212 Test Plan: CI Reviewed By: jay-zhuang Differential Revision: D27896468 Pulled By: pdillinger fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e	2021-04-20 19:46:40 -07:00
Andrew Kryczka	905dd17b35	Fix seqno in ingested file boundary key metadata (#8209 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/6245. Adapted from https://github.com/facebook/rocksdb/issues/8201 and https://github.com/facebook/rocksdb/issues/8205. Previously we were writing the ingested file's smallest/largest internal keys with sequence number zero, or `kMaxSequenceNumber` in case of range tombstone. The former (sequence number zero) is incorrect and can lead to files being incorrectly ordered. The fix in this PR is to overwrite boundary keys that have sequence number zero with the ingested file's assigned sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8209 Test Plan: repro unit test Reviewed By: riversand963 Differential Revision: D27885678 Pulled By: ajkr fbshipit-source-id: 4a9f2c6efdfff81c3a9923e915ea88b250ee7b6a	2021-04-20 14:00:21 -07:00
Levi Tamasi	1b99947e99	Mention PR 8206 in HISTORY.md (#8210 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8210 Reviewed By: akankshamahajan15 Differential Revision: D27887612 Pulled By: ltamasi fbshipit-source-id: 0db8d0b6047334dc47fe30a98804449043454386	2021-04-20 12:07:40 -07:00
Yanqin Jin	a376c22066	Handle rename() failure in non-local FS (#8192 ) Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4	2021-04-19 18:11:13 -07:00
Akanksha Mahajan	531a5f88a1	Update release version to 6.20 (#8199 ) Summary: Update release version to 6.20 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8199 Test Plan: No code change Reviewed By: ajkr Differential Revision: D27838750 Pulled By: akankshamahajan15 fbshipit-source-id: f02f722fc6bdd37d626d47a0e932bbecea3507a8	2021-04-16 20:15:36 -07:00
Peter Dillinger	10196d7edc	Ribbon long-term support, starting level support (#8198 ) Summary: Since the Ribbon filter schema seems good (compatible back to 6.15.0), this change commits to long term support of the SST schema, even though we expect the API for enabling Ribbon to change (still called NewExperimentalRibbonFilterPolicy). This also adds support for "hybrid" configuration in which some levels use Bloom (higher levels, lower numbered) for speed and the rest use Ribbon (lower levels, higher numbered) for memory space efficiency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198 Test Plan: unit test added, crash test support Reviewed By: jay-zhuang Differential Revision: D27831232 Pulled By: pdillinger fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab	2021-04-16 15:43:08 -07:00
Akanksha Mahajan	296b47db25	Extend file_checksum_dump ldb command and DB::GetLiveFilesChecksumInfo to blob files (#8179 ) Summary: Extend the DB::GetLiveFilesChecksumInfo API to blob files. This API is also used by the file_checksum_dump ldb command to dump checksum of SST files which now also dumps blob files checksum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8179 Test Plan: Add new unit test Reviewed By: zhichao-cao Differential Revision: D27714965 Pulled By: akankshamahajan15 fbshipit-source-id: d8b7343ea845a64c83800336d88cced7152a8c92	2021-04-15 09:38:13 -07:00
Justin Chapman	d89483098f	Assert unlimited max_open_files for FIFO compaction. (#8172 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/8014 - Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used. - This is to align with what the docs mention and to prevent premature data deletion. - Update tests to work with this assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172 Test Plan: ```bash $ make check -j$(nproc) Generated TARGETS Summary: - 6 libs - 0 binarys - 180 tests ``` Reviewed By: ajkr Differential Revision: D27768792 Pulled By: thejchap fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3	2021-04-14 12:05:47 -07:00
Yanqin Jin	fd00f39f97	Disable IOStatsContext/PerfContext if no thread local (#8117 ) Summary: Before this PR, `get_iostats_context()` will silently return a nullptr if no thread_local support is detected. This can be the result of build_detect_platform's failure to compile the simple code snippet on certain platforms, as reported in https://github.com/facebook/mysql-5.6/issues/904. To be safe, we should fail the compilation if user does not opt out IOStatsContext and ROCKSDB_SUPPORT_THREAD_LOCAL is not defined. If RocksDB relies on c++11, can we just always use thread_local? It turns out there might be performance concerns (https://github.com/facebook/rocksdb/issues/5774), which is beyond the scope of this PR. We can revisit this later. Here, we stick to the original impl. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8117 Reviewed By: ajkr Differential Revision: D27356847 Pulled By: riversand963 fbshipit-source-id: f7d5776842277598d8341b955febb601946801ae	2021-04-13 07:56:59 -07:00
Peter Dillinger	bb75092574	Misc Backup API enhancements (#8170 ) Summary: * CreateNewBackup(WithMetadata) returning the BackupID of new backup through optional new output param. This is especially useful with the new mutithreading support, so that you can transactionally determine the ID of a backup you create. * GetBackupInfo / GetLatestBackupInfo for individual backups, so that you don't have to comb through a vector of backups if you don't want to. Updated HISTORY.md (including re: BlobDB support as new feature) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8170 Test Plan: Added test logic to existing tests, to minimize increase in cost of running tests Reviewed By: zhichao-cao Differential Revision: D27680410 Pulled By: pdillinger fbshipit-source-id: 1fc45b73d81aae293ccd4a43d9583d7fd915d3eb	2021-04-12 11:00:47 -07:00
Giuseppe Ottaviano	48cd7a3aae	Fix flush reason attribution (#8150 ) Summary: Current flush reason attribution is misleading or incorrect (depending on what the original intention was): - Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager` - Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull` This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150 Reviewed By: zhichao-cao Differential Revision: D27569645 Pulled By: ot fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc	2021-04-07 23:18:37 -07:00
Akanksha Mahajan	d52b520d51	Integrated BlobDB for backup/restore support (#8129 ) Summary: Add support for blob files for backup/restore like table files. Since DB session ID is currently not supported for blob files (there is no place to store it in the header), so for blob files uses the kLegacyCrc32cAndFileSize naming scheme even if share_files_with_checksum_naming is set to kUseDbSessionId. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8129 Test Plan: Add new test units Reviewed By: ltamasi Differential Revision: D27408510 Pulled By: akankshamahajan15 fbshipit-source-id: b27434d189a639ef3e6ad165c61a143a2daaf06e	2021-04-07 13:38:54 -07:00
Peter Dillinger	879357fdb0	Make backups openable as read-only DBs (#8142 ) Summary: A current limitation of backups is that you don't know the exact database state of when the backup was taken. With this new feature, you can at least inspect the backup's DB state without restoring it by opening it as a read-only DB. Rather than add something like OpenAsReadOnlyDB to the BackupEngine API, which would inhibit opening stackable DB implementations read-only (if/when their APIs support it), we instead provide a DB name and Env that can be used to open as a read-only DB. Possible follow-up work: * Add a version of GetBackupInfo for a single backup. * Let CreateNewBackup return the BackupID of the newly-created backup. Implementation details: Refactored ChrootFileSystem to split off new base class RemapFileSystem, which allows more general remapping of files. We use this base class to implement BackupEngineImpl::RemapSharedFileSystem. To minimize API impact, I decided to just add these fields `name_for_open` and `env_for_open` to those set by GetBackupInfo when include_file_details=true. Creating the RemapSharedFileSystem adds a bit to the memory consumption, perhaps unnecessarily in some cases, but this has been mitigated by (a) only initialize the RemapSharedFileSystem lazily when GetBackupInfo with include_file_details=true is called, and (b) using the existing `shared_ptr<FileInfo>` objects to hold most of the mapping data. To enhance API safety, RemapSharedFileSystem is wrapped by new ReadOnlyFileSystem which rejects any attempts to write. This uncovered a couple of places in which DB::OpenForReadOnly would write to the filesystem, so I fixed these. Added a release note because this affects logging. Additional minor refactoring in backupable_db.cc to support the new functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142 Test Plan: new test (run with ASAN and UBSAN), added to stress test and ran it for a while with amplified backup_one_in Reviewed By: ajkr Differential Revision: D27535408 Pulled By: pdillinger fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148	2021-04-06 14:37:53 -07:00
Yanqin Jin	09528f9fa1	Fix a bug for SeekForPrev with partitioned filter and prefix (#8137 ) Summary: According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last key in the previous partition" so that SeekForPrev() in prefix mode can return correct result. The prefix of the last key in the previous partition does not necessarily have the same prefix as the first key in the current partition. Regardless of the first key in current partition, the prefix of the last key in the previous partition should be added. The existing code, however, does not follow this. Furthermore, there is another issue: when finishing current filter partition, `FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding the bloom of the prefix of the last key of the previous partition. Prefix extractor is FixedLength.2. ``` [ filter part 1 ] [ filter part 2 ] abc d ``` When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc" but smaller than "d". If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build incorrect iterator tree in non-total-order mode. Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion error when checking on an arbitrary key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137 Test Plan: ``` make check ``` Without this fix, the following command will fail pretty soon. ``` ./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \ --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \ --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \ --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \ --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \ --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \ --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \ --db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \ --enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \ --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \ --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \ --log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \ --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \ --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \ --max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \ --nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \ --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \ --readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \ --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \ --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \ --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \ --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \ --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \ --use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \ --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \ --write_dbid_to_manifest=1 --writepercent=35 ``` Reviewed By: pdillinger Differential Revision: D27553054 Pulled By: riversand963 fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98	2021-04-06 12:14:08 -07:00
Akanksha Mahajan	689b13e639	Add request_id in IODebugContext. (#8045 ) Summary: Add request_id in IODebugContext which will be populated by underlying FileSystem for IOTracing purposes. Update IOTracer to trace request_id in the tracing records. Provided API IODebugContext::SetRequestId which will set the request_id and enable tracing for request_id. The API hides the implementation and underlying file system needs to call this API directly. Update DB::StartIOTrace API and remove redundant Env* from the argument as its not used and DB already has Env that is passed down to IOTracer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045 Test Plan: Update unit test. Differential Revision: D26899871 Pulled By: akankshamahajan15 fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b	2021-04-01 13:14:51 -07:00
Andrew Kryczka	c43a37a922	Fix compression dictionary sampling with dedicated range tombstone SSTs (#8141 ) Summary: Return early in case there are zero data blocks when `BlockBasedTableBuilder::EnterUnbuffered()` is called. This crash can only be triggered by applying dictionary compression to SST files that contain only range tombstones. It cannot be triggered by a low buffer limit alone since we only consider entering unbuffered mode after buffering a data block causing the limit to be breached, or `Finish()`ing the file. It also cannot be triggered by a totally empty file because those go through `Abandon()` rather than `Finish()` so unbuffered mode is never entered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8141 Test Plan: added a unit test that repro'd the "Floating point exception" Reviewed By: riversand963 Differential Revision: D27495640 Pulled By: ajkr fbshipit-source-id: a463cfba476919dc5c5c380800a75a86c31ffa23	2021-04-01 05:08:17 -07:00
Andrew Kryczka	1ba2b8a568	Add sample_for_compression results to table properties (#8139 ) Summary: Added `TableProperties::{fast,slow}_compression_estimated_data_size`. These properties are present in block-based tables when `ColumnFamilyOptions::sample_for_compression > 0` and the necessary compression library is supported when the file is generated. They contain estimates of what `TableProperties::data_size` would be if the "fast"/"slow" compression library had been used instead. One limitation is we do not record exactly which "fast" (ZSTD or Zlib) or "slow" (LZ4 or Snappy) compression library produced the result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139 Test Plan: - new unit test - ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression Reviewed By: riversand963 Differential Revision: D27454338 Pulled By: ajkr fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4	2021-03-31 18:21:50 -07:00
sherriiiliu	e6534900bd	Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125 ) Summary: In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_ and bg_flush_scheduled_ to drop to 0. Unschedule is called prior to cancel any unscheduled flushes/compactions. It is assumed that anything in the high priority is a flush, and anything in the low priority pool is a compaction. This assumption, however, is broken when the high-pri pool is full. As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_ will remain > 0 and DB can be in hang state. The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_` inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB `mutex_` will make the counts atomic in `Unschedule`. Related discussion: https://github.com/facebook/rocksdb/issues/7928 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125 Test Plan: Added new test case which hangs without the fix. Reviewed By: jay-zhuang Differential Revision: D27390043 Pulled By: ajkr fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110	2021-03-30 18:35:20 -07:00
Peter Dillinger	ec11c23caa	Add thread safety to BackupEngine, explain more (#8115 ) Summary: BackupEngine previously had unclear but strict concurrency requirements that the API user must follow for safe use. Now we make that clear, by separating operations into "Read," "Append," and "Write" operations, and specifying which combinations are safe across threads on the same BackupEngine object (previously none; now all, using a read-write lock), and which are safe across different BackupEngine instances open on the same backup_dir. The changes to backupable_db.h should be backward compatible. It is mostly about eliminating copies of what should be the same function and (unsurprisingly) useful documentation comments were often placed on only one of the two copies. With the re-organization, we are also grouping different categories of operations. In the future we might add BackupEngineReadAppendOnly, but that didn't seem necessary. To mark API Read operations 'const', I had to mark some implementation functions 'const' and some fields mutable. Functional changes: * Added RWMutex locking around public API functions to implement thread safety on a single object. To avoid future bugs, this is another internal class layered on top (removing many "override" in BackupEngineImpl). It would be possible to allow more concurrency between operations, rather than mutual exclusion, but IMHO not worth the work. * Fixed a race between Open() (Initialize()) and CreateNewBackup() for different objects on the same backup_dir, where Initialize() could delete the temporary meta file created during CreateNewBackup(). (This was found by the new test.) Also cleaned up a couple of "status checked" TODOs, and improved a checksum mismatch error message to include involved files. Potential follow-up work: * CreateNewBackup has an API wart because it doesn't tell you the BackupID it just created, which makes it of limited use in a multithreaded setting. * We could also consider a Refresh() function to catch up to changes made from another BackupEngine object to the same dir. * Use a lock file to prevent multiple writer BackupEngines, but this won't work on remote filesystems not supporting lock files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115 Test Plan: new mini-stress test in backup unit tests, run with gcc, clang, ASC, TSAN, and UBSAN, 100 iterations each. Reviewed By: ajkr Differential Revision: D27347589 Pulled By: pdillinger fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15	2021-03-29 22:41:51 -07:00
Jay Zhuang	a037bb35e9	Compaction should not move data to up level (#8116 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8116 Reviewed By: ajkr, mrambacher Differential Revision: D27353828 Pulled By: jay-zhuang fbshipit-source-id: 42703fb01b04d92cc097d7979e64798448852e88	2021-03-29 17:10:42 -07:00
wolfkdy	63748c2204	On ARM platform, use yield op to relax CPU. See issue 7376 (#7438 ) Summary: see https://github.com/facebook/rocksdb/issues/7376. The `wfe` op on ARM platform is not suitable to relax CPU. Use `yield` op. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7438 Reviewed By: riversand963 Differential Revision: D24063427 Pulled By: jay-zhuang fbshipit-source-id: b0ebc5590d7555bd21b30f15cd59f84dc006367a	2021-03-26 18:13:24 -07:00
Andrew Kryczka	c20a7cd6c7	Apply `sample_for_compression` to all block-based tables (#8105 ) Summary: Previously it only applied to block-based tables generated by flush. This restriction was undocumented and blocked a new use case. Now compression sampling applies to all block-based tables we generate when it is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105 Test Plan: new unit test Reviewed By: riversand963 Differential Revision: D27317275 Pulled By: ajkr fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0	2021-03-25 15:00:45 -07:00
Jay Zhuang	45c65d6dcf	Use thread-safe `strerror_r()` to get error message (#8087 ) Summary: `strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from `0deef031cb/folly/String.cpp (L457)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087 Reviewed By: mrambacher Differential Revision: D27267151 Pulled By: jay-zhuang fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb	2021-03-24 23:07:27 -07:00
Zhichao Cao	7457c7cd00	Update release version to 6.19 (#8083 ) Summary: Update release version to 6.19 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8083 Test Plan: no code change Reviewed By: riversand963 Differential Revision: D27222083 Pulled By: zhichao-cao fbshipit-source-id: 94b49997019347e6e6a9e341837f4f9d3149428c	2021-03-21 18:33:46 -07:00
Zhichao Cao	dd0447ae2c	Add new Append API with DataVerificationInfo to Env WritableFile (#8071 ) Summary: Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071 Test Plan: make check, added new unit test. Reviewed By: anand1976 Differential Revision: D27177043 Pulled By: zhichao-cao fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e	2021-03-19 11:44:13 -07:00
Zhichao Cao	c810947184	Separate handling of WAL Sync io error with SST flush io error (#8049 ) Summary: In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL. Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049 Test Plan: added new unit test, make check Reviewed By: anand1976 Differential Revision: D26965529 Pulled By: zhichao-cao fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410	2021-03-18 14:33:16 -07:00
Peter Dillinger	e7a60d01b2	Revamp WriteController (#8064 ) Summary: WriteController had a number of issues: * It could introduce a delay of 1ms even if the write rate never exceeded the configured delayed_write_rate. * The DB-wide delayed_write_rate could be exceeded in a number of ways with multiple column families: * Wiping all pending delay "debts" when another column family joins the delay with GetDelayToken(). * Resetting last_refill_time_ to (now + sleep amount) means each column family can write with delayed_write_rate for large writes. * Updating bytes_left_ for a partial refill without updating last_refill_time_ would essentially give out random bonuses, especially to medium-sized writes. Now the code is much simpler, with these issues fixed. See comments in the new code and new (replacement) tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064 Test Plan: new tests, better than old tests Reviewed By: mrambacher Differential Revision: D27064936 Pulled By: pdillinger fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f	2021-03-18 09:47:31 -07:00
Zhichao Cao	08ec5e7321	Add the statistics and info log for Error handler (#8050 ) Summary: Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050 Test Plan: make check and add test to error_handler_fs_test Reviewed By: anand1976 Differential Revision: D26990565 Pulled By: zhichao-cao fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f	2021-03-17 22:38:13 -07:00
Akanksha Mahajan	27d57a035e	Use SST file manager to track blob files as well (#8037 ) Summary: Extend support to track blob files in SST File manager. This PR notifies SstFileManager whenever a new blob file is created, via OnAddFile and an obsolete blob file deleted via OnDeleteFile and delete file via ScheduleFileDeletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037 Test Plan: Add new unit tests Reviewed By: ltamasi Differential Revision: D26891237 Pulled By: akankshamahajan15 fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6	2021-03-17 20:44:49 -07:00
Mark Callaghan	326670d265	Add new db_bench --benchmarks options for controlling compaction (#8027 ) Summary: The new options are: * compact0 - compact L0 into L1 using one thread * compact1 - compact L1 into L2 using one thread * flush - flush memtable * waitforcompaction - wait for compaction to finish These are useful for reproducible benchmarks to help get the LSM tree shape into a deterministic state. I wrote about this at: http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027 Reviewed By: riversand963 Differential Revision: D27053861 Pulled By: ajkr fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e	2021-03-17 09:12:27 -07:00
Peter Dillinger	01c2ec3fcb	Add ROCKSDB_GTEST_BYPASS (#8048 ) Summary: This is for cases that do not meet the Facebook criteria for SKIP (see new comments). Also made ROCKSDB_GTEST_{SKIP,BYPASS} print the message because gtest doesn't ever seem to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8048 Test Plan: manual inspection of ./ribbon_test output, CI Reviewed By: mrambacher Differential Revision: D26953688 Pulled By: pdillinger fbshipit-source-id: c914eaffe7d419db6ab90a193d474531e23582e5	2021-03-12 16:02:06 -08:00
Peter Dillinger	589ea6bec2	Add BackupEngine API for backup file details (#8042 ) Summary: This API can be used for things like determining how much space can be freed up by deleting a particular backup, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042 Test Plan: validation of the API added to many existing backup unit tests Reviewed By: mrambacher Differential Revision: D26936577 Pulled By: pdillinger fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a	2021-03-12 11:03:54 -08:00
Yanqin Jin	82b3888433	Enable backward iterator for keys with user-defined timestamp (#8035 ) Summary: This PR does the following: - Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet. - Introduces a new helper API `Comparator::EqualWithoutTimestamp()`. - Fix a typo in `SetTimestamp()`. - Add/update unit tests Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps. 1. Disable timestamp, run: ``` ./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5 ``` Results: > Baseline > - seekrandom [AVG 6 runs] : 96115 ops/sec; 53.2 MB/sec > - seekrandom [MEDIAN 6 runs] : 98075 ops/sec; 54.2 MB/sec > > This PR > - seekrandom [AVG 6 runs] : 95521 ops/sec; 52.8 MB/sec > - seekrandom [MEDIAN 6 runs] : 96338 ops/sec; 53.3 MB/sec 2. Enable timestamp, run: ``` ./db_bench -user_timestamp_size=8 -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5 ``` Result: > Baseline: not supported > > This PR > - seekrandom [AVG 6 runs] : 90514 ops/sec; 50.1 MB/sec > - seekrandom [MEDIAN 6 runs] : 90834 ops/sec; 50.2 MB/sec Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035 Reviewed By: ltamasi Differential Revision: D26926668 Pulled By: riversand963 fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5	2021-03-10 11:15:46 -08:00
Peter Dillinger	847ca9f964	Make default share_files_with_checksum=true (#8020 ) Summary: New comment for share_files_with_checksum: // Only used if share_table_files is set to true. Setting to false is // DEPRECATED and potentially dangerous because in that case BackupEngine // can lose data if backing up databases with distinct or divergent // history, for example if restoring from a backup other than the latest, // writing to the DB, and creating another backup. Setting to true (default) // prevents these issues by ensuring that different table files (SSTs) with // the same number are treated as distinct. See // share_files_with_checksum_naming and ShareFilesNaming. I have also removed interim option kFlagMatchInterimNaming, which is no longer needed and was never needed for correct+compatible operation (just performance). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020 Test Plan: tests updated. Backward+forward compatibility verified with SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup options, and I manually verified shared_checksum in /tmp/rocksdb_format_compatible_peterd/bak/current/ after run. Reviewed By: ajkr Differential Revision: D26786331 Pulled By: pdillinger fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130	2021-03-09 16:27:13 -08:00
Peter Dillinger	0028e3398b	Make format_version=5 new default (#8017 ) Summary: Haven't seen any production issues with new Bloom filter and it's now > 1 year old (added in 6.6.0). Updated check_format_compatible.sh and HISTORY.md Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017 Test Plan: tests updated (or prior bugs fixed) Reviewed By: ajkr Differential Revision: D26762197 Pulled By: pdillinger fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2	2021-03-09 12:42:53 -08:00
Peter Dillinger	ce391ff84b	Clarifying comments for Read() APIs (#8029 ) Summary: I recently discovered the confusing, undocumented semantics of Read() functions in the FileSystem and Env APIs. I have added clarification to the best of my reverse-engineered understanding, and made a note in HISTORY.md for implementors to check their implementations, as a subtly non-adherent implementation could lead to RocksDB quietly ignoring some portion of a file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029 Test Plan: no code changes Reviewed By: anand1976 Differential Revision: D26831698 Pulled By: pdillinger fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03	2021-03-05 14:42:19 -08:00
Levi Tamasi	cb25bc1128	Update compaction statistics to include the amount of data read from blob files (#8022 ) Summary: The patch does the following: 1) Exposes the amount of data (number of bytes) read from blob files from `BlobFileReader::GetBlob` / `Version::GetBlob`. 2) Tracks the total number and size of blobs read from blob files during a compaction (due to garbage collection or compaction filter usage) in `CompactionIterationStats` and propagates this data to `InternalStats::CompactionStats` / `CompactionJobStats`. 3) Updates the formulae for write amplification calculations to include the amount of data read from blob files. 4) Extends the compaction stats dump with a new column `Rblob(GB)` and a new line containing the total number and size of blob files in the current `Version` to complement the information about the shape and size of the LSM tree that's already there. 5) Updates `CompactionJobStats` so that the number of files and amount of data written by a compaction are broken down per file type (i.e. table/blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26801199 Pulled By: ltamasi fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2	2021-03-04 00:43:48 -08:00
matthewvon	4126bdc0e1	Feature: add SetBufferSize() so that managed size can be dynamic (#7961 ) Summary: This PR adds SetBufferSize() to the WriteBufferManager object. This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows. The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager. This change is useful as is. However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is. Any format / spacing changes are due to clang-format as required by check-in automation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961 Reviewed By: ajkr Differential Revision: D26639075 Pulled By: akankshamahajan15 fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033	2021-03-03 14:22:11 -08:00
Levi Tamasi	a46f080cce	Break down the amount of data written during flushes/compactions per file type (#8013 ) Summary: The patch breaks down the "bytes written" (as well as the "number of output files") compaction statistics into two, so the values are logged separately for table files and blob files in the info log, and are shown in separate columns (`Write(GB)` for table files, `Wblob(GB)` for blob files) when the compaction statistics are dumped. This will also come in handy for fixing the write amplification statistics, which currently do not consider the amount of data read from blob files during compaction. (This will be fixed by an upcoming patch.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26742156 Pulled By: ltamasi fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5	2021-03-02 09:48:00 -08:00
Akanksha Mahajan	f19612970d	Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003 ) Summary: The checkpointing logic supports passing file level checksums to the copy_file_cb callback function which is used by the backup code for detecting corruption during file copies. However, this is currently implemented only for table files. This PR extends the checksum retrieval to blob files as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003 Test Plan: Add new test units Reviewed By: ltamasi Differential Revision: D26680701 Pulled By: akankshamahajan15 fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5	2021-03-01 20:07:07 -08:00
Yanqin Jin	cef4a6c49f	Compaction filter support for (new) BlobDB (#7974 ) Summary: Allow applications to implement a custom compaction filter and pass it to BlobDB. The compaction filter's custom logic can operate on blobs. To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method. Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in db/blob/db_blob_compaction_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974 Test Plan: make check Reviewed By: ltamasi Differential Revision: D26509280 Pulled By: riversand963 fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39	2021-02-25 16:32:35 -08:00
Akanksha Mahajan	2772eb7735	Update History.md for VerifyFileChecksums API supporting blob file (#7995 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7995 Reviewed By: ltamasi Differential Revision: D26625766 Pulled By: akankshamahajan15 fbshipit-source-id: d83c9e77695f4193da979b1ce7103b43bc1dd46c	2021-02-24 10:25:03 -08:00
xinyuliu	b085ee13e0	Append all characters not captured by xsputn() in overflow() function (#7991 ) Summary: In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is ``` int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) { // put a character return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch)); ``` As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow. When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991 Reviewed By: jay-zhuang Differential Revision: D26615692 Pulled By: ajkr fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468	2021-02-23 21:44:48 -08:00
Akanksha Mahajan	cd79a00903	Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951 ) Summary: RocksDB does auto-readahead for iterators on noticing more than two reads for a table file. The readahead starts at 8KB and doubles on every additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is 256*1024. This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be configured. If max_auto_readahead_size is set 0 then no implicit auto prefetching will be done. If max_auto_readahead_size provided is less than 8KB (which is initial readahead size used by rocksdb in case of auto-readahead), readahead size will remain same as max_auto_readahead_size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951 Test Plan: Add new unit test case. Reviewed By: anand1976 Differential Revision: D26568085 Pulled By: akankshamahajan15 fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af	2021-02-23 16:54:08 -08:00
Yanqin Jin	7343eb4a74	Update HISTORY and bump version (#7984 ) Summary: Prepare to cut 6.18.fb branch Pull Request resolved: https://github.com/facebook/rocksdb/pull/7984 Reviewed By: ajkr Differential Revision: D26557151 Pulled By: riversand963 fbshipit-source-id: 8c144c807090cdae67e6655e7a17056ce8c50bc0	2021-02-19 19:21:49 -08:00
Andrew Kryczka	d904233d2f	Limit buffering for collecting samples for compression dictionary (#7970 ) Summary: For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file. However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage. Related changes include: - Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks - Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary - Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970 Test Plan: - updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level - looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set. Reviewed By: pdillinger Differential Revision: D26467994 Pulled By: ajkr fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465	2021-02-19 14:09:54 -08:00
mrambacher	4bc9df9459	Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936 ) Summary: Added a "only_mutable_options" flag to the ConfigOptions. When set, the Configurable methods will only look at/update options that are marked as kMutable. Fixed DB::SetOptions to allow for the update of any mutable TableFactory options. Fixes https://github.com/facebook/rocksdb/issues/7385. Added tests for the new flag. Updated HISTORY.md Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936 Reviewed By: akankshamahajan15 Differential Revision: D26389646 Pulled By: mrambacher fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0	2021-02-19 10:29:02 -08:00
Zhichao Cao	b0fd1cc45a	Introduce a new trace file format (v 0.2) for better extension (#7977 ) Summary: The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977 Test Plan: make check. tested with old trace file in replay and analyzing. Reviewed By: anand1976 Differential Revision: D26529948 Pulled By: zhichao-cao fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76	2021-02-18 23:05:35 -08:00
Akanksha Mahajan	eacb14a10a	Update history.md for bug fix of actual error returned in DB::OpenForReadOnly (#7978 ) Summary: Update history.md for bug fix of actual error returned in DB::OpenForReadOnly Pull Request resolved: https://github.com/facebook/rocksdb/pull/7978 Reviewed By: jay-zhuang Differential Revision: D26519195 Pulled By: akankshamahajan15 fbshipit-source-id: 39fd2bcc12ab92a492e8254090b742efa377ed51	2021-02-18 11:42:05 -08:00
Jay Zhuang	59ba104e4a	Fix txn `MultiGet()` return un-committed data with snapshot (#7963 ) Summary: TransactionDB uses read callback to filter out un-committed data before a snapshot. But `MultiGet()` API doesn't use that at all, which causes returning unwanted data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963 Test Plan: Added unittest to reproduce Reviewed By: anand1976 Differential Revision: D26455851 Pulled By: jay-zhuang fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55	2021-02-18 08:49:00 -08:00
Levi Tamasi	ba8008c870	Mention the new BlobDB in HISTORY.md and remove the "under construction" signs (#7969 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7969 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D26467043 Pulled By: ltamasi fbshipit-source-id: c69a725669d18af6e911743c998e3a1db75948c0	2021-02-16 16:20:22 -08:00
Zhichao Cao	d1c510baec	Handoff checksum Implementation (#7523 ) Summary: in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0	2021-02-10 22:20:32 -08:00
Peter Dillinger	e4f1e64c30	Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889 ) Summary: Adds support for prefetching data in Ribbon queries, which especially optimizes batched Ribbon queries for MultiGet (~222ns/key to ~97ns/key) but also single key queries on cold memory (~333ns to ~226ns) because many queries span more than one cache line. This required some refactoring of the query algorithm, and there does not appear to be a noticeable regression in "hot memory" query times (perhaps from 48ns to 50ns). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889 Test Plan: existing unit tests, plus performance validation with filter_bench: Each data point is the best of two runs. I saturated the machine CPUs with other filter_bench runs in the background. Before: $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 125.86 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 Prelim FP rate %: 0.951827 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 48.0111 Batched, prepared net ns/op: 222.384 Batched, unprepared net ns/op: 343.908 Skewed 50% in 1% net ns/op: 252.916 Skewed 80% in 20% net ns/op: 320.579 Random filter net ns/op: 332.957 After: $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 128.117 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 Prelim FP rate %: 0.951827 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 49.8812 Batched, prepared net ns/op: 97.1514 Batched, unprepared net ns/op: 222.025 Skewed 50% in 1% net ns/op: 197.48 Skewed 80% in 20% net ns/op: 212.457 Random filter net ns/op: 226.464 Bloom comparison, for reference: $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50 WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 35.3042 Number of filters: 1993 Total size (MB): 238.488 Reported total allocated memory (MB): 262.875 Reported internal fragmentation: 10.2255% Bits/key stored: 10.0029 Prelim FP rate %: 0.965327 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 9.09931 Batched, prepared net ns/op: 34.21 Batched, unprepared net ns/op: 88.8564 Skewed 50% in 1% net ns/op: 139.75 Skewed 80% in 20% net ns/op: 181.264 Random filter net ns/op: 173.88 Reviewed By: jay-zhuang Differential Revision: D26378710 Pulled By: pdillinger fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69	2021-02-10 21:04:56 -08:00
Andrew Kryczka	c16d5a4fda	Makefile support to statically link external plugin code (#7918 ) Summary: Added support for detecting plugins linked in the "plugin/" directory and building them from our Makefile in a standardized way. See "plugin/README.md" for details. An example of a plugin that can be built in this way can be found in https://github.com/ajkr/dedupfs. There will be more to do in terms of making this process more convenient and adding support for CMake. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7918 Test Plan: my own plugin (https://github.com/ajkr/dedupfs) and also heard this patch worked with ZenFS. Reviewed By: pdillinger Differential Revision: D26189969 Pulled By: ajkr fbshipit-source-id: 6624d4357d0ffbaedb42f0d12a3fcb737c78f758	2021-02-10 08:35:34 -08:00
Jay Zhuang	cf160b98e1	Add full_history_ts_low option to compaction (#7884 ) Summary: The full_history_ts_low is used for user-defined timestamp GC compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884 Reviewed By: ltamasi Differential Revision: D25982553 Pulled By: jay-zhuang fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee	2021-02-08 13:45:48 -08:00
Levi Tamasi	974458891c	Revert "Turn on memtable bloom filter by default. (#6584 )" (#7939 ) Summary: This reverts commit `ee79a28963`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939 Reviewed By: siying Differential Revision: D26298564 Pulled By: ltamasi fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd	2021-02-06 22:34:30 -08:00
Andrew Kryczka	8d2bbdd04f	Allow range deletions in `*TransactionDB` only when safe (#7929 ) Summary: Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`. Fixes https://github.com/facebook/rocksdb/issues/7913. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929 Test Plan: unit tests covering the cases it's permitted/rejected Reviewed By: ltamasi Differential Revision: D26240254 Pulled By: ajkr fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9	2021-02-05 15:57:26 -08:00
sdong	ee79a28963	Turn on memtable bloom filter by default. (#6584 ) Summary: Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584 Test Plan: Run all existing tests. Reviewed By: pdillinger Differential Revision: D20626739 fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9	2021-02-05 12:59:46 -08:00
Andrew Kryczka	78ee8564ad	Integrity protection for live updates to WriteBatch (#7748 ) Summary: This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.). The foundation classes (`ProtectionInfo.`) embed the coverage info in their type, and provide `Protect.()` and `Strip.()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer. When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748 Test Plan: - an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught - add to stress/crash test to verify it works in variety of configs/operations without intentional corruption - [deferred] unit tests for `ProtectionInfo.` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc. Reviewed By: pdillinger Differential Revision: D25754492 Pulled By: ajkr fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866	2021-01-29 12:18:58 -08:00
mrambacher	0a9a05ae12	Make builds reproducible (#7866 ) Summary: Closes https://github.com/facebook/rocksdb/issues/7035 Changed how build_version.cc was generated: - Included the GIT tag/branch in the build_version file - Changed the "Build Date" to be: - If the GIT branch is "clean" (no changes), the date of the last git commit - If the branch is not clean, the current date - Added APIs to access the "build information", rather than accessing the strings directly. The build_version.cc file is now regenerated whenever the library objects are rebuilt. Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866 Reviewed By: pdillinger Differential Revision: D26086565 Pulled By: mrambacher fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f	2021-01-28 17:42:16 -08:00
Zhichao Cao	95013df278	Do not set bg error for compaction in retryable IO Error case (#7899 ) Summary: When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899 Test Plan: tested with error_handler_fs_test Reviewed By: anand1976 Differential Revision: D26094097 Pulled By: zhichao-cao fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f	2021-01-27 17:58:12 -08:00
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	2021-01-25 22:09:11 -08:00
Levi Tamasi	19076c95aa	Update HISTORY.md for PR 7888 (#7890 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7890 Reviewed By: ajkr Differential Revision: D26005509 Pulled By: ltamasi fbshipit-source-id: e7eb732180d447900788d0e3a17dfd1c3f1e708a	2021-01-21 14:20:10 -08:00
Andrew Kryczka	e18a4df62a	workaround race conditions during `PeriodicWorkScheduler` registration (#7888 ) Summary: This provides a workaround for two race conditions that will be fixed in a more sophisticated way later. This PR: (1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe. (2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888 Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881 Reviewed By: jay-zhuang Differential Revision: D25990891 Pulled By: ajkr fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38	2021-01-21 08:50:38 -08:00
Cheng Chang	b0c43e7081	Update HISTORY.md (#7887 ) Summary: Mention the forward compatibility fix for WAL related version edits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7887 Reviewed By: ltamasi Differential Revision: D25982494 Pulled By: cheng-chang fbshipit-source-id: 4be292aa4bf7fbc8a27c0bef1e7a98ad3ea8e1fa	2021-01-20 14:33:59 -08:00
Cheng Chang	928dea0e32	Update HISTORY.md (#7874 ) Summary: I find that the `track_and_verify_wals_in_manifest` option was only removed from 6.15 branch's HISTORY, but still appears under 6.15 in master branch's HISTORY. It should be moved to 6.16 since that's when the feature should be available. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7874 Reviewed By: jay-zhuang Differential Revision: D25935971 Pulled By: cheng-chang fbshipit-source-id: fe8bf1ec111597f9207e109aa3be65f8f919f1fd	2021-01-19 16:10:13 -08:00
Levi Tamasi	ffe4906192	Update version to 6.17 (#7871 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7871 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D25932233 Pulled By: ltamasi fbshipit-source-id: 8b80b0638a4f34f21a27ba80b3eda7d75410b2e8	2021-01-15 18:53:00 -08:00
anand76	8e7b068ecc	Make ldb load column family options from OPTIONS file (#7847 ) Summary: When the --try_load_options is used in conjunction with the --column_family option, ldb incorrectly sets the ColumnFamilyOptions for that column family to defaults. This PR fixes that by retaining from the OPTIONS file and applying command line overrides. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7847 Test Plan: Add a unit test in ldb_cmd_test Reviewed By: ajkr Differential Revision: D25874720 Pulled By: anand1976 fbshipit-source-id: 04bcf23b55e5a30b5b6a59b0e5cb4faef3da7429	2021-01-11 20:56:34 -08:00
Cheng Chang	fdbebdf484	Add note for PR 7789 in history (#7855 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7855 Reviewed By: ajkr Differential Revision: D25872797 Pulled By: cheng-chang fbshipit-source-id: 82159a13f897aaaad5f3c70c7dfa822e073bc623	2021-01-11 13:34:15 -08:00
Adam Retter	4926b33742	Improvements to Env::GetChildren (#7819 ) Summary: The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source. Also some minor fixes to `Env::GetChildren`: * Improve error handling in POSIX implementation * Remove unnecessary array allocation on Windows * Fix struct name for Windows Non-UTF-8 API Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819 Reviewed By: ajkr Differential Revision: D25837394 Pulled By: jay-zhuang fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e	2021-01-09 09:44:34 -08:00
Akanksha Mahajan	8ed680bdb0	Add new API to report dummy entries size in cache in WriteBufferManager (#7837 ) Summary: Add new API WriteBufferManager::dummy_entries_in_cache_usage() which reports the dummy entries size stored in cache to account for DataBlocks in WriteBufferManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7837 Test Plan: Updated test ./write_buffer_manager_test Reviewed By: ajkr Differential Revision: D25794312 Pulled By: akankshamahajan15 fbshipit-source-id: 197f5e8701e3dc57a7df72dab1735624f90daf4b	2021-01-08 13:26:24 -08:00
Zhichao Cao	48c0843e69	Treat File Scope Write IO Error the same as Retryable IO Error (#7840 ) Summary: In RocksDB, when IO error happens, the flags of IOStatus can be set. If the IOStatus is set as "File Scope IO Error", it indicate that the error is constrained in the file level. Since RocksDB does not continues write data to a file when any IO Error happens, File Scope IO Error can be treated the same as Retryable IO Error. Adding the logic to ErrorHandler::SetBGError to include the file scope IO Error in its error handling logic, which is the same as retryable IO Error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7840 Test Plan: added new unit tests in error_handler_fs_test. make check Reviewed By: anand1976 Differential Revision: D25820481 Pulled By: zhichao-cao fbshipit-source-id: 69cabd3d010073e064d6142ce1cabf341b8a6806	2021-01-07 16:31:33 -08:00
Adam Retter	6e0f62f2b6	Add more tests to ASSERT_STATUS_CHECKED (3), API change (#7715 ) Summary: Third batch of adding more tests to ASSERT_STATUS_CHECKED. * db_compaction_filter_test * db_compaction_test * db_dynamic_level_test * db_inplace_update_test * db_sst_test * db_tailing_iter_test * db_io_failure_test Also update GetApproximateSizes APIs to all return Status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7715 Reviewed By: jay-zhuang Differential Revision: D25806896 Pulled By: pdillinger fbshipit-source-id: 6cb9d62ba5a756c645812754c596ad3995d7c262	2021-01-06 14:15:02 -08:00
Andrew Kryczka	225abffd8f	Verify file checksum generator name (#7824 ) Summary: Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824 Reviewed By: zhichao-cao Differential Revision: D25740254 Pulled By: ajkr fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f	2021-01-04 11:51:50 -08:00
Jay Zhuang	a8aeefd0fd	Update release version to 6.16 (#7782 ) Summary: Update release version to 6.8 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7782 Reviewed By: siying Differential Revision: D25648579 Pulled By: jay-zhuang fbshipit-source-id: c536d606868b95c5fb2ae8f19c17eb259d67bc51	2020-12-19 12:39:21 -08:00
Peter Dillinger	4d1ac19e3d	aggregated-table-properties with GetMapProperty (#7779 ) Summary: So that we can more easily get aggregate live table data such as total filter, index, and data sizes. Also adds ldb support for getting properties Also fixed some missing/inaccurate related comments in db.h For example: $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties rocksdb.aggregated-table-properties.data_size: 102871 rocksdb.aggregated-table-properties.filter_size: 0 rocksdb.aggregated-table-properties.index_partitions: 0 rocksdb.aggregated-table-properties.index_size: 2232 rocksdb.aggregated-table-properties.num_data_blocks: 100 rocksdb.aggregated-table-properties.num_deletions: 0 rocksdb.aggregated-table-properties.num_entries: 15000 rocksdb.aggregated-table-properties.num_merge_operands: 0 rocksdb.aggregated-table-properties.num_range_deletions: 0 rocksdb.aggregated-table-properties.raw_key_size: 288890 rocksdb.aggregated-table-properties.raw_value_size: 198890 rocksdb.aggregated-table-properties.top_level_index_size: 0 $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1 rocksdb.aggregated-table-properties-at-level1.data_size: 80909 rocksdb.aggregated-table-properties-at-level1.filter_size: 0 rocksdb.aggregated-table-properties-at-level1.index_partitions: 0 rocksdb.aggregated-table-properties-at-level1.index_size: 1787 rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81 rocksdb.aggregated-table-properties-at-level1.num_deletions: 0 rocksdb.aggregated-table-properties-at-level1.num_entries: 12466 rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0 rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0 rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210 rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414 rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0 $ Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779 Test Plan: Added a test to ldb_test.py Reviewed By: jay-zhuang Differential Revision: D25653103 Pulled By: pdillinger fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3	2020-12-19 08:00:14 -08:00
Peter Dillinger	239d17a19c	Support optimize_filters_for_memory for Ribbon filter (#7774 ) Summary: Primarily this change refactors the optimize_filters_for_memory code for Bloom filters, based on malloc_usable_size, to also work for Ribbon filters. This change also replaces the somewhat slow but general BuiltinFilterBitsBuilder::ApproximateNumEntries with implementation-specific versions for Ribbon (new) and Legacy Bloom (based on a recently deleted version). The reason is to emphasize speed in ApproximateNumEntries rather than 100% accuracy. Justification: ApproximateNumEntries (formerly CalculateNumEntry) is only used by RocksDB for range-partitioned filters, called each time we start to construct one. (In theory, it should be possible to reuse the estimate, but the abstractions provided by FilterPolicy don't really make that workable.) But this is only used as a heuristic estimate for hitting a desired partitioned filter size because of alignment to data blocks, which have various numbers of unique keys or prefixes. The two factors lead us to prioritize reasonable speed over 100% accuracy. optimize_filters_for_memory adds extra complication, because precisely calculating num_entries for some allowed number of bytes depends on state with optimize_filters_for_memory enabled. And the allocator-agnostic implementation of optimize_filters_for_memory, using malloc_usable_size, means we would have to actually allocate memory, many times, just to precisely determine how many entries (keys) could be added and stay below some size budget, for the current state. (In a draft, I got this working, and then realized the balance of speed vs. accuracy was all wrong.) So related to that, I have made CalculateSpace, an internal-only API only used for testing, non-authoritative also if optimize_filters_for_memory is enabled. This simplifies some code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774 Test Plan: unit test updated, and for FilterSize test, range of tested values is greatly expanded (still super fast) Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes. Bloom+optimize_filters_for_memory: 1 Filter size: 197 (224 in memory) 134 Filter size: 3525 (3584 in memory) 107 Filter size: 4037 (4096 in memory) Total on disk: 904,506 Total in memory: 918,752 Ribbon+optimize_filters_for_memory: 1 Filter size: 3061 (3072 in memory) 110 Filter size: 3573 (3584 in memory) 58 Filter size: 4085 (4096 in memory) Total on disk: 633,021 (-30.0%) Total in memory: 634,880 (-30.9%) Bloom (no offm): 1 Filter size: 261 (320 in memory) 1 Filter size: 3333 (3584 in memory) 240 Filter size: 3717 (4096 in memory) Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm) Total in memory: 986,944 (+7.4% vs. +offm) Ribbon (no offm): 1 Filter size: 2949 (3072 in memory) 1 Filter size: 3381 (3584 in memory) 167 Filter size: 3701 (4096 in memory) Total on disk: 624,397 (-30.3% vs. Bloom) Total in memory: 690,688 (-30.0% vs. Bloom) Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom. Reviewed By: jay-zhuang Differential Revision: D25592970 Pulled By: pdillinger fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912	2020-12-18 14:31:03 -08:00
Burton Li	2021392e25	Do not full scan obsolete files on compaction busy (#7739 ) Summary: When ConcurrentTaskLimiter is enabled and there are too many outstanding compactions, BackgroundCompaction returns Status::Busy(), which shouldn't be treat as compaction failure. This caused performance issue when outstanding compactions reached the limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7739 Reviewed By: cheng-chang Differential Revision: D25508319 Pulled By: ltamasi fbshipit-source-id: 3b181b16ada0ca3393cfa3a7412985764e79c719	2020-12-15 13:51:10 -08:00
Peter Dillinger	003e72b201	Use size_t for filter APIs, protect against overflow (#7726 ) Summary: Deprecate CalculateNumEntry and replace with ApproximateNumEntries (better name) using size_t instead of int and uint32_t, to minimize confusing casts and bad overflow behavior (possible though probably not realistic). Bloom sizes are now explicitly capped at max size supported by implementations: just under 4GiB for fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This hardening could help to set up for fuzzing. Also, since RocksDB only uses this information as an approximation for trying to hit certain sizes for partitioned filters, it's more important that the function be reasonably fast than for it to be completely accurate. It's hard enough to be 100% accurate for Ribbon (currently reversing CalculateSpace) that adding optimize_filters_for_memory into the mix is just not worth trying to be 100% accurate for num entries for bytes. Also: - Cleaned up filter_policy.h to remove MSVC warning handling and potentially unsafe use of exception for "not implemented" - Correct the number of entries limit beyond which current Ribbon implementation falls back on Bloom instead. - Consistently use "num_entries" rather than "num_entry" - Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially obsolete from general implementation BuiltinFilterBitsBuilder::CalculateNumEntries. - Fix filter_bench to skip some tests that don't make sense when only one or a small number of filters has been generated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726 Test Plan: expanded existing unit tests for CalculateSpace / ApproximateNumEntries. Also manually used filter_bench to verify Legacy and fv=5 Bloom size caps work (much too expensive for unit test). Note that the actual bits per key is below requested due to space cap. $ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate ... Total size (MB): 511.992 Bits/key stored: 16.777 ... $ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000 ... Total size (MB): 4096 Bits/key stored: 17.1799 ... $ Reviewed By: jay-zhuang Differential Revision: D25239800 Pulled By: pdillinger fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4	2020-12-11 22:18:12 -08:00
anand76	8a1488efbf	Ensure that MultiGet works properly with compressed cache (#7756 ) Summary: Ensure that when direct IO is enabled and a compressed block cache is configured, MultiGet inserts compressed data blocks into the compressed block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7756 Test Plan: Add unit test to db_basic_test Reviewed By: cheng-chang Differential Revision: D25416240 Pulled By: anand1976 fbshipit-source-id: 75d57526370c9c0a45ff72651f3278dbd8a9086f	2020-12-09 17:01:13 -08:00
Cheng Chang	70f2e0916a	Write min_log_number_to_keep to MANIFEST during atomic flush under 2 phase commit (#7570 ) Summary: When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570 Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds. Reviewed By: riversand963 Differential Revision: D24394222 Pulled By: cheng-chang fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116	2020-12-03 19:22:24 -08:00
Jay Zhuang	7fec715db4	Make CompactRange and GetApproximateSizes work with timestamp (#7684 ) Summary: Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684 Test Plan: make check Reviewed By: riversand963 Differential Revision: D25015421 Pulled By: jay-zhuang fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9	2020-12-02 13:00:53 -08:00
Jay Zhuang	9e1640403a	Exclude timestamp from prefix extractor (#7668 ) Summary: Timestamp should not be included in prefix extractor, as we discussed here: https://github.com/facebook/rocksdb/pull/7589#discussion_r511068586 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7668 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D24966265 Pulled By: jay-zhuang fbshipit-source-id: 0dae618c333d4b7942a40d556535a1795e060aea	2020-12-01 14:07:15 -08:00
Andrew Kryczka	eb65d673fe	Fix kPointInTimeRecovery handling of truncated WAL (#7701 ) Summary: WAL may be truncated to an incomplete record due to crash while writing the last record or corruption. In the former case, no hole will be produced since no ACK'd data was lost. In the latter case, a hole could be produced without this PR since we proceeded to recover the next WAL as if nothing happened. This PR changes the record reading code to always report a corruption for incomplete records in `kPointInTimeRecovery` mode, and the upper layer will only ignore them if the next WAL has consecutive seqnum (i.e., we are guaranteed no hole). While this solves the hole problem for the case of incomplete records, the possibility is still there if the WAL is corrupted by truncation to an exact record boundary. This PR also regresses how much data can be recovered when writes are mixed with/without `WriteOptions::disableWAL`, as then we can not distinguish between a seqnum gap caused by corruption and a seqnum gap caused by a `disableWAL` write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7701 Test Plan: Interestingly there already was a test for this case (`DBWALTestWithParams.kPointInTimeRecovery`); it just had a typo bug in the verification that prevented it from noticing holes in recovery. Reviewed By: anand1976 Differential Revision: D25111765 Pulled By: ajkr fbshipit-source-id: 5e330b13b1ee2b5be096cea9d0ff6075843e57b6	2020-11-30 18:11:38 -08:00
Cheng Chang	5c585e1908	Ship the track WAL in MANIFEST feature (#7689 ) Summary: Updates the option description and HISTORY. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7689 Test Plan: N/A Reviewed By: zhichao-cao Differential Revision: D25056238 Pulled By: cheng-chang fbshipit-source-id: 6af1ef6f8dcf2173cbc0fccadc0e06cefd92bcae	2020-11-19 14:45:54 -08:00
Cheng Chang	8a97f35619	Call out a bug in HISTORY (#7690 ) Summary: It's worth mentioning the corner case bug fixed in PR 7621. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7690 Test Plan: N/A Reviewed By: zhichao-cao Differential Revision: D25056678 Pulled By: cheng-chang fbshipit-source-id: 1ab42ec080f3ffe21f5d97acf65ee0af993112ba	2020-11-18 14:54:22 -08:00
Yanqin Jin	84a700819e	Fix the logic of setting read_amp_bytes_per_bit from OPTIONS file (#7680 ) Summary: Instead of using `EncodeFixed32` which always serialize a integer to little endian, we should use the local machine's endianness when populating a native data structure during options parsing. Without this fix, `read_amp_bytes_per_bit` may be populated incorrectly on big-endian machines. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7680 Test Plan: make check Reviewed By: pdillinger Differential Revision: D24999166 Pulled By: riversand963 fbshipit-source-id: dc603cff6e17f8fa32479ce6df93b93082e6b0c4	2020-11-17 00:44:30 -08:00
Andrew Kryczka	1c5f13f2a5	Fail early when `merge_operator` not configured (#7667 ) Summary: An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667 Reviewed By: riversand963 Differential Revision: D24933360 Pulled By: ajkr fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f	2020-11-16 20:39:01 -08:00
Ramkumar Vadivelu	5bd1258381	Update release history to 6.15 (#7673 ) Summary: Update release history to 6.15 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7673 Test Plan: No code change Reviewed By: ajkr Differential Revision: D24971069 Pulled By: ramvadiv fbshipit-source-id: 5cb3f5cbc1b19beb580ea8095acdef72cc092905	2020-11-15 12:37:24 -08:00
Peter Dillinger	60af964372	Experimental (production candidate) SST schema for Ribbon filter (#7658 ) Summary: Added experimental public API for Ribbon filter: NewExperimentalRibbonFilterPolicy(). This experimental API will take a "Bloom equivalent" bits per key, and configure the Ribbon filter for the same FP rate as Bloom would have but ~30% space savings. (Note: optimize_filters_for_memory is not yet implemented for Ribbon filter. That can be added with no effect on schema.) Internally, the Ribbon filter is configured using a "one_in_fp_rate" value, which is 1 over desired FP rate. For example, use 100 for 1% FP rate. I'm expecting this will be used in the future for configuring Bloom-like filters, as I expect people to more commonly hold constant the filter accuracy and change the space vs. time trade-off, rather than hold constant the space (per key) and change the accuracy vs. time trade-off, though we might make that available. ### Benchmarking ``` $ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 34.1341 Number of filters: 1993 Total size (MB): 238.488 Reported total allocated memory (MB): 262.875 Reported internal fragmentation: 10.2255% Bits/key stored: 10.0029 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 18.7508 Random filter net ns/op: 258.246 Average FP rate %: 0.968672 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 130.851 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 58.4523 Random filter net ns/op: 363.717 Average FP rate %: 0.952978 ---------------------------- Done. (For more info, run with -legend or -help.) ``` 168.166 / 238.488 = 0.705 -> 29.5% space reduction 130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction) ### Working around a hashing "flaw" bloom_test discovered a flaw in the simple hashing applied in StandardHasher when num_starts == 1 (num_slots == 128), showing an excessively high FP rate. The problem is that when many entries, on the order of number of hash bits or kCoeffBits, are associated with the same start location, the correlation between the CoeffRow and ResultRow (for efficiency) can lead to a solution that is "universal," or nearly so, for entries mapping to that start location. (Normally, variance in start location breaks the effective association between CoeffRow and ResultRow; the same value for CoeffRow is effectively different if start locations are different.) Without kUseSmash and with num_starts > 1 (thus num_starts ~= num_slots), this flaw should be completely irrelevant. Even with 10M slots, the chances of a single slot having just 16 (or more) entries map to it--not enough to cause an FP problem, which would be local to that slot if it happened--is 1 in millions. This spreadsheet formula shows that: =1/(10000000(1 - POISSON(15, 1, TRUE))) As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is intended for CPU efficiency of filters with many more entries/slots than kCoeffBits, a very reasonable work-around is to disallow num_starts==1 when !kUseSmash, by making the minimum non-zero number of slots 2kCoeffBits. This is the work-around I've applied. This also means that the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not space-efficient for less than a few hundred entries. Because of this, I have made it fall back on constructing a Bloom filter, under existing schema, when that is more space efficient for small filters. (We can change this in the future if we want.) TODO: better unit tests for this case in ribbon_test, and probably update StandardHasher for kUseSmash case so that it can scale nicely to small filters. ### Other related changes * Add Ribbon filter to stress/crash test * Add Ribbon filter to filter_bench as -impl=3 * Add option string support, as in "filter_policy=experimental_ribbon:5.678;" where 5.678 is the Bloom equivalent bits per key. * Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom * Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on binary searching CalculateSpace (inefficient), so that subclasses (especially experimental ones) don't have to provide an efficient implementation inverting CalculateSpace. * Minor refactor FastLocalBloomBitsBuilder for new base class XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder, which allows the latter to fall back on Bloom construction in some extreme cases. * Mostly updated bloom_test for Ribbon filter, though a test like FullBloomTest::Schema is a next TODO to ensure schema stability (in case this becomes production-ready schema as it is). * Add some APIs to ribbon_impl.h for configuring Ribbon filters. Although these are reasonably covered by bloom_test, TODO more unit tests in ribbon_test * Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data for constructing the linear approximations in GetNumSlotsFor95PctSuccess. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658 Test Plan: Some unit tests updated but other testing is left TODO. This is considered experimental but laying down schema compatibility as early as possible in case it proves production-quality. Also tested in stress/crash test. Reviewed By: jay-zhuang Differential Revision: D24899349 Pulled By: pdillinger fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8	2020-11-12 20:46:14 -08:00
Yanqin Jin	2400cd69e3	Update HISTORY.md for PR6069 (#7663 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7663 Reviewed By: ajkr Differential Revision: D24913081 Pulled By: riversand963 fbshipit-source-id: 704f427812f2b4f92e16d6cbc93be64d730d1cf9	2020-11-12 08:38:41 -08:00
Andrew Kryczka	ec346da98c	Always apply bottommost_compression_opts when enabled (#7633 ) Summary: Previously, even when `bottommost_compression_opts`'s `enabled` flag was set, it only took effect when `bottommost_compression` was also set to something other than `kDisableCompressionOption`. This wasn't documented and, if we kept the old behavior, it'd make things complicated like the migration instructions in https://github.com/facebook/rocksdb/issues/7619. We can simplify the API by making `bottommost_compression_opts` always take effect when its `enabled` flag is set. Fixes https://github.com/facebook/rocksdb/issues/7631. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7633 Reviewed By: ltamasi Differential Revision: D24710358 Pulled By: ajkr fbshipit-source-id: bbbdf9c1b53c63a4239d902cc3f5a11da1874647	2020-11-11 20:32:28 -08:00
mrambacher	c442f6809f	Create a Customizable class to load classes and configurations (#6590 ) Summary: The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID. Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and a method to instantiate them (TableFactory::CreateFromString). Customizable objects can be registered with the ObjectRegistry and created dynamically. Future PRs will make more types of objects extend Customizable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590 Reviewed By: cheng-chang Differential Revision: D24841553 Pulled By: zhichao-cao fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155	2020-11-11 15:10:41 -08:00
Jay Zhuang	18aee7db7e	Fix a seek issue with prefix extractor and timestamp (#7644 ) Summary: During seek, prefix compare should not include timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7644 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D24772066 Pulled By: jay-zhuang fbshipit-source-id: 3982655a8bf8da256a738e8497b73b3d9bdac92e	2020-11-10 14:53:13 -08:00
Yanqin Jin	b6d8e36741	Compute NeedCompact() after table builder Finish() (#7627 ) Summary: In `BuildTable()`, we call `builder->Finish()` before evaluating `builder->NeedCompact()`. However, we call `builder->NeedCompact()` before `builder->Finish()` in compaction job. This can be wrong because the table properties collectors may rely on the success of `Finish()` to provide correct result for `NeedCompact()`. Test plan (on devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/7627 Reviewed By: ajkr Differential Revision: D24728741 Pulled By: riversand963 fbshipit-source-id: 5a0dce244e14eb1106c4f87021e6bebca82b486e	2020-11-04 10:44:56 -08:00
Yanqin Jin	fde0cd7ced	Add API to verify whole sst file checksum (#7578 ) Summary: Existing API `VerifyChecksum()` allows application to verify sst files' block checksums. Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new API to verify sst files' file checksums. ``` // Compute table file checksums if applicable and compare with MANIFEST. // Returns OK if no file has mismatching whole-file checksum. Status DB::VerifyFileChecksums(const ReadOptions& /read_options/); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578 Test Plan: make check Reviewed By: pdillinger Differential Revision: D24436783 Pulled By: riversand963 fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3	2020-11-03 20:34:56 -08:00
Jay Zhuang	881e0dcc09	Fix MultiGet unable to query timestamp data issue (#7589 ) Summary: The filter query key should not contain timestamp. The timestamp is stripped for Get(), but not MultiGet(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7589 Reviewed By: riversand963 Differential Revision: D24494661 Pulled By: jay-zhuang fbshipit-source-id: fc5ff40f9d683a89a760c6ff0ab3aed05a70c317	2020-11-03 09:45:41 -08:00
Andrew Kryczka	1adbceb581	Expand effect of dictionary settings in `ColumnFamilyOptions::compression_opts` (#7619 ) Summary: In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it for bottom level under the assumption that the vast majority of data is stored there. At that time, there was no such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just hardcoded disabling dictionary compression in flush and compactions to non-bottommost level. Now, we have users who generate all their files through flush and are considering using dictionary compression. To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to additionally include flushed files and files generated by compaction to a non-bottommost level. Users can still get the old behavior by moving their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts` and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619 Reviewed By: ltamasi Differential Revision: D24665610 Pulled By: ajkr fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c	2020-11-02 19:21:11 -08:00
Andrew Kryczka	a388c8cc6b	Add recent fixes to HISTORY.md (#7617 ) Summary: The recently reverted behavior changes were released to at least one place internally, so we should mention the reverts in release notes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7617 Reviewed By: akankshamahajan15 Differential Revision: D24654343 Pulled By: ajkr fbshipit-source-id: eb64b2797d8508cd95a2dc2698122c1be29ce817	2020-10-30 14:03:35 -07:00
Yanqin Jin	6134ce6444	Perform post-flush updates of memtable list in a callback (#6069 ) Summary: Currently, the following interleaving of events can lead to SuperVersion containing both immutable memtables as well as the resulting L0. This can cause Get to return incorrect result if there are merge operands. This may also affect other operations such as single deletes. ``` time main_thr bg_flush_thr bg_compact_thr compact_thr set_opts_thr 0 \| WriteManifest:0 1 \| issue compact 2 \| wait 3 \| Merge(counter) 4 \| issue flush 5 \| wait 6 \| WriteManifest:1 7 \| wake up 8 \| write manifest 9 \| wake up 10 \| Get(counter) 11 \| remove imm V ``` The reason behind is that: one bg flush thread's installing new `Version` can be batched and performed by another thread that is the "leader" MANIFEST writer. This bg thread removes the memtables from current super version only after `LogAndApply` returns. After the leader MANIFEST writer signals (releasing mutex) this bg flush thread, it is possible that another thread sees this cf with both memtables (whose data have been flushed to the newest L0) and the L0 before this bg flush thread removes the memtables. To address this issue, each bg flush thread can pass a callback function to `LogAndApply`. The callback is responsible for removing the memtables. Therefore, the leader MANIFEST writer can call this callback and remove the memtables before releasing the mutex. Test plan (devserver) ``` $make merge_test $./merge_test --gtest_filter=MergeTest.MergeWithCompactionAndFlush $make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6069 Reviewed By: cheng-chang Differential Revision: D18790894 Pulled By: riversand963 fbshipit-source-id: e41bd600c0448b4f4b2deb3f7677f95e3076b4ed	2020-10-26 18:23:01 -07:00
Akanksha Mahajan	eef27d0048	Bug fix to remove function calling in assert statement (#7581 ) Summary: Remove function calling in assert statement as assert is a no op in opt build and that function might not be called. This causes hang in closing RocksDB when refit level is set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7581 Test Plan: make check -j64 Reviewed By: riversand963 Differential Revision: D24466420 Pulled By: akankshamahajan15 fbshipit-source-id: 97db4ec5a95ae693c3290e176a3c12a9b1ad2f6d	2020-10-21 20:18:06 -07:00
anand76	00751e4292	Add a host location property to TableProperties (#7479 ) Summary: This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479 Test Plan: Add new unit tests Reviewed By: pdillinger Differential Revision: D24340671 Pulled By: anand1976 fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2	2020-10-19 11:38:48 -07:00
Jay Zhuang	c87c3a48af	Add a missing bug fix in HISTORY.md (#7549 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7549 Reviewed By: ajkr, zhichao-cao Differential Revision: D24292032 Pulled By: jay-zhuang fbshipit-source-id: 0442283386ae20d10410a8d013a431d7cd282b22	2020-10-13 18:00:17 -07:00
Andrew Kryczka	3dc823212d	add missing release notes to HISTORY.md (#7545 ) Summary: These notes existed on the release branches where they were backported, but were never added on master branch. Added them now and mentioned what minor release the fix originally appeared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7545 Reviewed By: riversand963 Differential Revision: D24281759 Pulled By: ajkr fbshipit-source-id: 7422e984b667793d6260dd32a7492afcb2ff1c4b	2020-10-13 12:13:47 -07:00
Andrew Kryczka	75d3b6fdf0	Redesign block cache pinning API (#7520 ) Summary: The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520 Test Plan: - new unit test - added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0` Reviewed By: pdillinger Differential Revision: D24200034 Pulled By: ajkr fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e	2020-10-11 14:58:24 -07:00
Akanksha Mahajan	9dd25487cc	Update release history 6.14 (#7525 ) Summary: Update release history for 6.14 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7525 Test Plan: No code change Reviewed By: jay-zhuang Differential Revision: D24224690 Pulled By: akankshamahajan15 fbshipit-source-id: 95441aefde96672fea5a6af5d7e67cdafb1ebdd2	2020-10-09 16:05:22 -07:00
Akanksha Mahajan	38d0a365e3	Add Stats for MultiGet (#7366 ) Summary: Add following stats for MultiGet in Histogram to get more insight on MultiGet. 1. Number of index and filter blocks read from file as part of MultiGet request per level. 2. Number of data blocks read from file per level. 3. Number of SST files loaded from file system per level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366 Reviewed By: anand1976 Differential Revision: D24127040 Pulled By: akankshamahajan15 fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b	2020-10-07 13:28:48 -07:00
Jay Zhuang	8891e9a0eb	Disallow trivial move if BottommostLevelCompaction is kForce* (#7368 ) Summary: If `BottommostLevelCompaction.kForce*` is set, compaction should avoid trivial move and always compact the sst to the target size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7368 Reviewed By: ajkr Differential Revision: D23629525 Pulled By: jay-zhuang fbshipit-source-id: 79f23c79ecb31587e0593b28cce43131107bbcd0	2020-10-07 13:19:31 -07:00
Peter Dillinger	9082771b86	Add is_full_compaction to CompactionJobStats, cleanup (#7451 ) Summary: This exposes to the listener interface whether a compaction was full or not. Also cleaned up API comment for CompactionJobInfo::stats, which is not of a nullable type. And since CompactionJob is always created with non-null CompactionJobStats, removed conditionals on it being nullptr and instead assert non-null. TODO later: update C and Java interfaces Pull Request resolved: https://github.com/facebook/rocksdb/pull/7451 Test Plan: updated existing unit tests to check new field, make check Reviewed By: ltamasi Differential Revision: D23977796 Pulled By: pdillinger fbshipit-source-id: 1ae7e26cb949631c2b2fb9e696710daf53cc378d	2020-10-01 12:52:58 -07:00
sdong	7508175558	Introduce options.check_flush_compaction_key_order (#7467 ) Summary: Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated. Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467 Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled. Reviewed By: riversand963 Differential Revision: D24010683 fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c	2020-10-01 10:10:26 -07:00
Peter Dillinger	ddbc5dad05	Enable force_consistency_checks by default (#7446 ) Summary: This has been running in production on some key workloads, so we believe it to be safe and extremely low cost. Nevertheless, I've added code to ensure that "force_consistency_checks" is mentioned in any corruption reports so that people know how to disable in case of false positive corruption reports. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7446 Test Plan: make check, CI, temporary debug print new message with ./version_builder_test Reviewed By: ajkr Differential Revision: D23972101 Pulled By: pdillinger fbshipit-source-id: 9623e400f3752577c0ecf977e6d0915562cf9968	2020-09-30 11:57:32 -07:00
Akanksha Mahajan	9d212d3f0e	Provide users with option to opt-in to get corrupt data in logs/messages (#7420 ) Summary: Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data. By default value is set false to prevent users data to be exposed in the messages. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420 Test Plan: 1. make check -j64 2. Add a new test case Reviewed By: ajkr Differential Revision: D23835028 Pulled By: akankshamahajan15 fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1	2020-09-29 23:17:45 -07:00
Ramkumar Vadivelu	c203e01773	reset refitting_level_ flag to false in error paths (#7403 ) Summary: Reset refitting_level_ flag to false in error paths in DBImpl::ReFitLevel() Pull Request resolved: https://github.com/facebook/rocksdb/pull/7403 Reviewed By: ajkr Differential Revision: D23909028 Pulled By: ramvadiv fbshipit-source-id: 521ad9aadc1b734bef9ef9119d1e1ee1fa8126e9	2020-09-28 11:37:00 -07:00
Peter Dillinger	9d8eb77c4d	Less I/O for incremental backups, slightly better corruption detection (#7413 ) Summary: Two relatively simple functional changes to incremental backup behavior, integrated with a minor refactoring to reduce code redundancy and improve error/log message. There are nuances to the impact of these changes, but I believe they are fundamentally good and generally safe. Those functional changes: * Incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless `share_files_with_checksum` is used with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file checksums are needed to determine file naming. * Justification: incremental backups should not need to read the whole DB, especially without rate limiting. (Although other BackupEngine reads are not rate limited either, other non-trivial reads are generally limited by a corresponding write, as in copying files.) Also, the fact that this is not already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110. * When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch. * Justification: a random related fix that also helps to cover a small hole in corruption checking uncovered by the other functional change: * For `share_table_files` without "checksum" (not recommended), the other change regresses in detecting fundamentally unsafe use of this option combination: when you might generate different versions of same SST file number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this regression is greatly mitigated by the new file size checking. Nevertheless, almost no reason to use `share_files_with_checksum=false` should remain, and comments are updated appropriately. Also, this change renames internal function `CalculateChecksum` to `ReadFileAndComputeChecksum` to make the performance impact of this function clear in code reviews. It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it cannot be true for a DB with unique file names (like DBImpl). Nevertheless, I've tried to keep its functionality intact when `true` to minimize risk for now, despite having no unit tests for which it is true. Select impact details (much more in unit tests): For `share_files_with_checksum`, I am confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time. (With computed checksums in names, a recently corrupted file just looked like a different file vs. what was already backed up.) Even in the hypothetical case of DB session id collision (~100 bits entropy collision), file size in name and/or our file size check add an extra layer of protection against false success in creating an accurate new backup. (Unit test included.) `DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking are still able to catch corruptions that `CreateNewBackup` does not. Note that when custom file checksum support is added to BackupEngine, that will essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`. We could add options for `CreateNewBackup` to cover some of what would be caught by `VerifyBackup` with checksum checking. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413 Test Plan: Two new unit tests included, both of which fail without these changes. Although we don't test the I/O improvement directly, we test it indirectly in DB corruption detection power that was inadvertently unlocked with new backup file naming PLUS computing current content checksums (now removed). (I don't think that case of DB corruption detection justifies reading the whole DB on incremental backup.) Reviewed By: zhichao-cao Differential Revision: D23818480 Pulled By: pdillinger fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac	2020-09-21 16:19:24 -07:00
Peter Dillinger	52691703fc	Update HISTORY.md for #7346 (#7417 ) Summary: Copied from Andrew's entry for 6.12.3. Inserted here retroactive to 6.12 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7417 Test Plan: no code change Reviewed By: jay-zhuang Differential Revision: D23815980 Pulled By: pdillinger fbshipit-source-id: 3c8a052cdb61be1215d311556c9487f9ea5c8cb0	2020-09-21 09:47:36 -07:00
Peter Dillinger	b475a83f9d	Postponing custom checksum support in BackupEngine (#7411 ) Summary: This change reverts BackupEngine to 6.12 state to accommodate a higher-priority fix that does not easily merge with this custom checksum support. We intend to reinstate this support soon, by merging a revert of this change. For backupable_db_test, I've removed the tests depending on this feature. I've also removed relevant HISTORY.md entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411 Test Plan: unit tests Reviewed By: ajkr Differential Revision: D23793835 Pulled By: pdillinger fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052	2020-09-18 15:27:03 -07:00
Zhichao Cao	c268628c25	Map retryable IO error during Flush without WAL to soft error and no switch memtable during resume (#7310 ) Summary: In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310 Test Plan: adding new unit test, pass make check. Reviewed By: anand1976 Differential Revision: D23710892 Pulled By: zhichao-cao fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9	2020-09-17 20:25:45 -07:00
anand76	b9750c7c3c	Update HISTORY.md with IO fencing error code (#7402 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7402 Reviewed By: pdillinger Differential Revision: D23761689 Pulled By: anand1976 fbshipit-source-id: 59e10f0aaa80f6c0f5a46dc99467138c4cee0511	2020-09-17 11:31:24 -07:00
Peter Dillinger	93719fc953	Restore file size in backup table file names (and other cleanup) (#7400 ) Summary: Prior to 6.12, backup files using share_files_with_checksum had the file size encoded in the file name, after the last '\_' and before the last '.'. We considered this an implementation detail subject to change, and indeed removed this information from the file name (with an option to use old behavior) because it was considered ineffective/inefficient for file name uniqueness. However, some downstream RocksDB users were relying on this information since the file size is not explicitly in the backup manifest file. This primary purpose of this change is "retrofitting" the 6.12 release (not yet a public release) to simultaneously support the benefits of the new naming scheme (I/O performance and data correctness at scale) and preserve the file size information, both as default behaviors. With this change, we are essentially making the file size information encoded in the file name an official, though obscure, extension of the backup meta file format. We preserve an option (kLegacyCrc32cAndFileSize) to use the original "legacy" naming scheme, with its caveats, and make it easy to omit the file size information (no kFlagIncludeFileSize), for more compact file names. But note that changing the naming scheme used on an existing db and backup directory can lead to transient space amplification, as some files will be stored under two names in the shared_checksum directory. Because some backups were saved using the original 6.12 naming scheme, we offer two ways of dealing with those files: SST files generated by older 6.12 versions can either use the default naming scheme in effect when the SST files were generated (kFlagMatchInterimNaming, default, no transient space amplification) or can use a new naming scheme (no kFlagMatchInterimNaming, potential space amplification because some already stored files getting a new name). We don't have a natural way to detect which files were generated by previous 6.12 versions, but this change hacks one in by changing DB session ids to now use a more concise encoding, reducing file name length, saving ~dozen bytes from SST files, and making them visually distinct from DB ids so that they are less likely to be mixed up. Two final auxiliary notes: Recognizing that the backup file names have become a de facto part of the backup meta schema, this change makes them easier to parse and extend by putting a distinct marker, 's', before DB session ids embedded in the name. When we extend this to allow custom checksums in the name, they can get their own marker to ensure safe parsing. For backward compatibility, file size does not get a marker but is assumed for `_[0-9]+[.]` Another change from initial 6.12 default behavior is never including file custom checksum in the file name. Looking ahead to 6.13, we do not want the default behavior to cause backup space amplification for someone turning on file custom checksum checking in BackupEngine; we want that to be an easy decision. When implemented, including file custom checksums in backup file names will be a non-default option. Actual file name patterns and priorities, as regexes: kLegacyCrc32cAndFileSize OR pre-6.12 SST file -> [0-9]+_[0-9]+_[0-9]+[.]sst kFlagMatchInterimNaming set (default) AND early 6.12 SST file -> [0-9]+_[0-9a-fA-F-]+[.]sst kUseDbSessionId AND NOT kFlagIncludeFileSize -> [0-9]+_s[0-9A-Z]{20}[.]sst kUseDbSessionId AND kFlagIncludeFileSize (default) -> [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst We might add opt-in options for more '\_' separated data in the name, but embedded file size, if present, will always be after last '\_' and before '.sst'. This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400 Test Plan: unit tests included. Sync point callbacks are used to mimic previous version SST files. Reviewed By: ajkr Differential Revision: D23759587 Pulled By: pdillinger fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025	2020-09-17 10:24:22 -07:00
Peter Dillinger	7780a360eb	Fix HISTORY.md and check_format_compatible.sh for 6.13 branch (#7401 ) Summary: Make "unreleased" section for HISTORY.md with things misplaced into 6.12 and 6.13 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7401 Test Plan: see how it goes, and `git diff origin/6.13.fb HISTORY.md` Reviewed By: jay-zhuang Differential Revision: D23759740 Pulled By: pdillinger fbshipit-source-id: fc441916c7ff2bbb8d5384137653b340d4c47674	2020-09-17 09:00:13 -07:00
mrambacher	67bd5401e9	Changes to EncryptedEnv public API (#7279 ) Summary: Cleaned up the public API to use the EncryptedEnv. This change will allow providers to be developed and added to the system easier in the future. It will also allow better integration in the future with the OPTIONS file. - The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header. Short-cut constructors were added to provide the original API functionality. - The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations. - CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API. Additionally, there was a code duplication in the NewXXXFile methods. This common code was moved under a templatized function. A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv. The idea was that different providers may use different cipher keys or different versions/algorithms. The EncryptedEnv should have some means of picking different providers based on information. The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279 Reviewed By: jay-zhuang Differential Revision: D23709440 Pulled By: zhichao-cao fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2	2020-09-15 17:14:10 -07:00
mrambacher	7d472accdc	Bring the Configurable options together (#5753 ) Summary: This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753 Reviewed By: ajkr Differential Revision: D23385030 Pulled By: zhichao-cao fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a	2020-09-14 17:01:01 -07:00
Peter Dillinger	ecc8ffe17b	Update master to version 6.13 (#7378 ) Summary: for release fork Pull Request resolved: https://github.com/facebook/rocksdb/pull/7378 Test Plan: make check + CI Reviewed By: jay-zhuang Differential Revision: D23669163 Pulled By: pdillinger fbshipit-source-id: 14cbf95b32717c28418c71cc8e10f06733bbc49f	2020-09-12 13:18:09 -07:00
Yanqin Jin	205e577694	Cancel tombstone skipping during bottommost compaction (#7356 ) Summary: During bottommost compaction, RocksDB cannot simply drop a tombstone if this tombstone is not in the earliest snapshot. The current behavior is: RocksDB skips other internal keys (of the same user key) in the same snapshot range. In the meantime, RocksDB should check for the `shutting_down` flag. Otherwise, it is possible for a bottommost compaction that has already started running to take a long time to finish, even if the application has tried to cancel all background jobs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7356 Test Plan: make check Reviewed By: ltamasi Differential Revision: D23663241 Pulled By: riversand963 fbshipit-source-id: 25f8e9b51bc3bfa3353cdf87557800f9d90ee0b5	2020-09-11 17:45:43 -07:00
Peter Dillinger	92639b93a6	Fix checkpoint file deletion race with avoid_unnecessary_blocking_io (#7369 ) Summary: https://github.com/facebook/rocksdb/issues/3341 guaranteed that upon return of `GetSortedWalFiles` after `DisableFileDeletions`, all pending purges of previously obsolete WAL files will have finished. However, the addition of avoid_unnecessary_blocking_io in https://github.com/facebook/rocksdb/issues/5043 opened a hole in the code making that assurance, which can lead to files to be copied for checkpoint or backup going missing before being copied, with that option enabled. This change patches the hole. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7369 Test Plan: apparent fix to backups in crash test observed. Will work on a unit test for another commit Reviewed By: ajkr Differential Revision: D23620258 Pulled By: pdillinger fbshipit-source-id: bea36b461a5b719c3e3ef802f967bc3e8ae71614	2020-09-10 22:35:25 -07:00
Yanqin Jin	8307d4400c	Update HISTORY.md for PR7329 (#7355 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7355 Reviewed By: pdillinger Differential Revision: D23566635 Pulled By: riversand963 fbshipit-source-id: f8d846bcff637e7617b764b7bfb9a948ea18d195	2020-09-08 11:10:25 -07:00
Andrew Kryczka	5746767387	add `ldb unsafe_remove_sst_file` subcommand (#7335 ) Summary: This is adapted from https://github.com/facebook/rocksdb/issues/6678 but takes a different approach, avoiding opening a read-write DB and avoiding the `DeleteFile()` API. First, this PR refactors how options variables are initialized in `ldb` so it can be reused in a subcommand that doesn't open a DB: - Separated remaining option initialization logic out of `OpenDB()`. The new `PrepareOptions()` function initializes the full options state. - Fixed an old TODO about applying the subcommand CF option overrides to the proper `ColumnFamilyOptions` object. Second, this PR adds the `ldb unsafe_remove_sst_file` subcommand. It uses the `VersionSet`-level APIs to remove the file with the specified number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7335 Test Plan: played with interactive python and this file removal command. Verified openability/correct results in case of multiple column families, multiple levels, etc. Reviewed By: pdillinger Differential Revision: D23454575 Pulled By: ajkr fbshipit-source-id: 039b7a8cbfc42fd123dcb25821eef51d61148afe	2020-09-03 16:54:51 -07:00
Andrew Kryczka	40e97b02be	add warning on `DeleteFile()` API (#7337 ) Summary: Since we can't land https://github.com/facebook/rocksdb/issues/7336 until the next major release, added a strong warning against the `DeleteFile()` API in the meantime. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7337 Reviewed By: pdillinger Differential Revision: D23459728 Pulled By: ajkr fbshipit-source-id: 326cb9b18190386080c35c761a8736d8a877dafb	2020-09-03 16:42:01 -07:00
Andrew Kryczka	af54c4092a	fix SstFileWriter with dictionary compression (#7323 ) Summary: In block-based table builder, the cut-over from buffered to unbuffered mode involves sampling the buffered blocks and generating a dictionary. There was a bug where `SstFileWriter` passed zero as the `target_file_size` causing the cutover to happen immediately, so there were no samples available for generating the dictionary. This PR changes the meaning of `target_file_size == 0` to mean buffer the whole file before cutting over. It also adds dictionary compression support to `sst_dump --command=recompress` for easy evaluation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7323 Reviewed By: cheng-chang Differential Revision: D23412158 Pulled By: ajkr fbshipit-source-id: 3b232050e70ef3c2ee85a4b5f6fadb139c569873	2020-09-03 15:49:57 -07:00
Hiep	d0c1a01c1b	Avoid converting MERGES to PUTS when allow_ingest_behind is true (#7166 ) Summary: - Closes https://github.com/facebook/rocksdb/issues/6490 - Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166 Test Plan: - Automated: `make all check` - Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges. Reviewed By: cheng-chang Differential Revision: D23325425 Pulled By: ajkr fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68	2020-09-03 14:39:58 -07:00
Andrew Kryczka	177f8bd063	Bound L0->Lbase fanout in dynamic leveled compaction (#7325 ) Summary: L0 score is based on size target and number of files. The size target used is `max_bytes_for_level_base`. However, the base level's size can dynamically expand in write burst mode. In fact, it can expand so much that L0->Lbase becomes the highest fanout in target sizes. This doesn't make sense from an efficiency perspective, so this PR bounds the L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based on file count remains unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325 Test Plan: contrived benchmark that exhibits the problem: ``` $ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000 ``` Results: - "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark - "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction Branch \| Burst W-Amp \| Total W-Amp \| fillrandom (MB/s) -- \| -- \| -- \| -- master \| 20.2 \| 21.5 \| 4.7 dynamic-l0-score \| 12.6 \| 14.1 \| 7.2 Reviewed By: siying Differential Revision: D23412935 Pulled By: ajkr fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e	2020-09-01 19:34:01 -07:00
Akanksha Mahajan	963314ffd6	Add unit test for max_write_buffer_size_to_maintain (#7311 ) Summary: Add a unit test case to check memory usage when max_write_buffer_size_to_maintain is set if flushed immutable memtables are trimmed timely or not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311 Test Plan: Compared the results with before bug fix. Reviewed By: ltamasi Differential Revision: D23321702 Pulled By: akankshamahajan15 fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e	2020-08-28 17:38:05 -07:00
Jay Zhuang	c2485f2d81	Add buffer prefetch support for non directIO usecase (#7312 ) Summary: A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312 Reviewed By: anand1976 Differential Revision: D23329847 Pulled By: jay-zhuang fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527	2020-08-27 18:16:53 -07:00
Peter Dillinger	9aad24da55	Real fix for race in backup custom checksum checking (#7309 ) Summary: This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294. To get DB checksum info for live files, we now read the manifest file that will become part of the checkpoint/backup. This requires a little extra handling in taking a custom checkpoint, including only reading the manifest file up to the size prescribed by the checkpoint. This moves GetFileChecksumsFromManifest from backup code to file_checksum_helper.{h,cc} and removes apparently unnecessary checking related to column families. Updated HISTORY.md and warned potential future users of DB::GetLiveFilesChecksumInfo() Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309 Test Plan: updated unit test, before and after Reviewed By: ajkr Differential Revision: D23311994 Pulled By: pdillinger fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5	2020-08-26 10:39:20 -07:00
sdong	722814e357	Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297 ) Summary: Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297 Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it. Reviewed By: anand1976 Differential Revision: D23257950 fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03	2020-08-25 19:01:05 -07:00
Zhichao Cao	d51f88c9e4	Pass SST file checksum information through OnTableFileCreated (#7108 ) Summary: When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108 Test Plan: make check, listener_test. Reviewed By: ajkr Differential Revision: D22470240 Pulled By: zhichao-cao fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae	2020-08-25 10:46:11 -07:00
Connor1996	416943bf28	Eliminates a no-op compaction upon snapshot release when disabling auto compactions (#7267 ) Summary: After releasing a snapshot, it checks whether it is suitable to trigger bottom compactions. When disabling auto compactions, it may still schedule compaction when releasing a snapshot. Whereas no compaction job will be actually handled, so the state of LSM is not changed and compaction will be triggered again and again every time releasing a snapshot. Too frequent compactions lead to high CPU usage and high db_mutex lock contention which affects foreground write duration finally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7267 Test Plan: - make check - manual test Reviewed By: akankshamahajan15 Differential Revision: D23252880 Pulled By: ajkr fbshipit-source-id: 4431e071a35d9912a2a3592875db27bae521434b	2020-08-24 22:06:45 -07:00
Akanksha Mahajan	3844612625	Bug Fix for memtables not trimmed down. (#7296 ) Summary: When a memtable is trimmed in MemTableListVersion, the memtable is only added to delete list if it is the last reference. However it is not the last reference as it is held by the super version. But the super version would not be switched if the delete list is empty. So the memtable is never destroyed and memory usage increases beyond write_buffer_size + max_write_buffer_size_to_maintain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296 Test Plan: 1. ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot Reviewed By: ltamasi Differential Revision: D23267395 Pulled By: akankshamahajan15 fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650	2020-08-21 13:29:05 -07:00
Peter Dillinger	a1b5484811	Work around a backup bug with DB custom checksums (#7294 ) Summary: On a read-write DB configured with DBOptions::file_checksum_gen_factory, BackupEngine::CreateNewBackup can fail intermittently, with non-OK status. This is due to a race between GetLiveFiles and GetLiveFilesChecksumInfo in creating backups. For patching 6.12 release (as this commit is intended for, except this is a forward-merged version), we can simply treat files for which we falsely failed to get checksum info as legacy files lacking checksum info. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7294 Test Plan: unit test reproducer included Reviewed By: ajkr Differential Revision: D23253489 Pulled By: pdillinger fbshipit-source-id: 9e4945dad120b776ad3e753be10b962f61f28e14	2020-08-21 08:16:04 -07:00
Andrew Kryczka	5d5ff82408	Disable `recycle_log_file_num` with `kTolerateCorruptedTailRecords` (#7271 ) Summary: The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271 Reviewed By: riversand963 Differential Revision: D23169923 Pulled By: ajkr fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50	2020-08-17 18:21:10 -07:00
Yanqin Jin	92593d511a	Add a new EntryType for deletion with timestamp (#7195 ) Summary: Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195 Test Plan: make check Reviewed By: ajkr Differential Revision: D22914704 Pulled By: riversand963 fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39	2020-08-17 16:26:06 -07:00
sdong	1760637539	CompactRange() refit level should confirm destination level is not empty (#7261 ) Summary: There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261 Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds. Reviewed By: ajkr Differential Revision: D23142269 fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91	2020-08-17 14:21:53 -07:00
Jay Zhuang	69760b4d05	Introduce a global StatsDumpScheduler for stats dumping (#7223 ) Summary: Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223 Reviewed By: riversand963 Differential Revision: D23056737 Pulled By: jay-zhuang fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba	2020-08-14 20:12:44 -07:00
Andrew Kryczka	a1aa3f8385	Disable manual compaction during `ReFitLevel()` (#7250 ) Summary: Manual compaction with `CompactRangeOptions::change_levels` set could refit to a level targeted by another manual compaction. If force_consistency_checks were disabled, it could be possible for overlapping files to be written at that target level. This PR prevents the possibility by calling `DisableManualCompaction()` prior to `ReFitLevel()`. It also improves the manual compaction disabling mechanism to wait for pending manual compactions to complete before returning, and support disabling from multiple threads. Fixes https://github.com/facebook/rocksdb/issues/6432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250 Test Plan: crash test command that repro'd the bug reliably: ``` $ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0 ``` Reviewed By: ltamasi Differential Revision: D23090800 Pulled By: ajkr fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13	2020-08-14 11:29:52 -07:00
Zitan Chen	b578ca2e4d	BackupEngine supports custom file checksums (#7085 ) Summary: A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups. Tests are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085 Test Plan: Passed make check Reviewed By: pdillinger Differential Revision: D22390756 Pulled By: gg814 fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f	2020-08-12 13:31:09 -07:00
sdong	41c328fe57	Fix a perf regression that caused every key to go through upper bound check (#7209 ) Summary: https://github.com/facebook/rocksdb/pull/5289 introduces a performance regression that caused an upper bound check within every BlockBasedTableIterator::Next(). This is unnecessary if we've checked the boundary key for current block and it is within upper bound. Fix the bug. Also rename the boolean to a enum so that the code is slightly better readable. The original regression was probably to fix a bug that the block upper bound check status is not reset after a new block is created. Fix it bug so that the regression can be avoided without hitting the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7209 Test Plan: Run all existing tests. Will run atomic black box crash test for a while. Reviewed By: anand1976 Differential Revision: D22859246 fbshipit-source-id: cbdad1f5e656c55fd8b71726d5a4f6cb53ff9140	2020-08-04 11:30:09 -07:00
Yanqin Jin	a38f04ac26	Update HISTORY and version for 6.12 release (#7194 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7194 Reviewed By: gg814 Differential Revision: D22810654 Pulled By: riversand963 fbshipit-source-id: 01f13089fa2b7e31b827da3e30c90e5c62c41380	2020-07-29 10:13:21 -07:00
Tomas Kolda	cd4592c220	SST Partitioner interface that allows to split SST files (#6957 ) Summary: SST Partitioner interface that allows to split SST files during compactions. It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957 Reviewed By: ajkr Differential Revision: D22461239 fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e	2020-07-24 13:44:49 -07:00
Cheng Chang	7af1fab443	Update HISTORY (#7158 ) Summary: Mention the MultiRead bug in HISTORY. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7158 Test Plan: N/A Reviewed By: siying Differential Revision: D22670565 Pulled By: cheng-chang fbshipit-source-id: 16abf0192957be66511f6a08e00157bfd37b189f	2020-07-23 08:47:13 -07:00
Jay Zhuang	b0c5ecd6b3	Make max_subcompactions dynamically changeable (#7159 ) Summary: Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159 Reviewed By: siying Differential Revision: D22671238 Pulled By: jay-zhuang fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42	2020-07-22 18:32:52 -07:00
mrambacher	d44cbc5314	Add hash of key/value checks when paranoid_file_checks=true (#7134 ) Summary: When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected. Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134 Reviewed By: cheng-chang Differential Revision: D22646149 fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f	2020-07-22 11:04:40 -07:00
Haosen Wen	dbc51adbac	Use steady_clock instead of system_clock in FileOperationInfo::TimePoint (#7153 ) Summary: Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem. `steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153 Test Plan: make check. Reviewed By: riversand963 Differential Revision: D22654136 Pulled By: roghnin fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85	2020-07-22 08:55:02 -07:00
Zitan Chen	b923dc720b	BackupEngine computes table checksums only once if db session ids are available (#7110 ) Summary: BackupEngine requires computing table checksums twice when backing up table files to the `shared_checksum` directory. The repeated computation can be avoided by utilizing the db session id stored as a part of the table properties. Filenames of table files in the `shared_checksum` directory depend on the following conditions: 1. the naming scheme is `kOptionalChecksumAndDbSessionId`, 2. `db_session_id` is not empty, 3. checksum is available in the DB manifest. If 1,2,3 are satisfied, then the filenames will be of the form `<file_number>_<checksum>_<db_session_id>.sst`. If 1,2 are satisfied, then the filenames will be of the form `<file_number>_<db_session_id>.sst`. In all other cases, the filenames are of the form `<file_number>_<checksum>_<size>.sst`. Additionally, if `kOptionalChecksumAndDbSessionId` is used (and not falling back to `kChecksumAndFileSize`), the `<checksum>` appeared in the filenames is hexadecimally encoded, instead of being plain `uint32_t` value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7110 Test Plan: backupable_db_test and manual tests. Reviewed By: ajkr Differential Revision: D22508992 Pulled By: gg814 fbshipit-source-id: 5669f0ea9ad5a097f69f6d87aca4abba15032389	2020-07-21 10:35:40 -07:00
Andrew Kryczka	9a83fd21e6	stagger first DumpMallocStats after opening DB (#7145 ) Summary: Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145 Reviewed By: riversand963 Differential Revision: D22593031 Pulled By: ajkr fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6	2020-07-17 16:13:26 -07:00
Zhichao Cao	a10f12eda1	Auto resume the DB from Retryable IO Error (#6765 ) Summary: In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765 Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check Reviewed By: anand1976 Differential Revision: D21916789 Pulled By: zhichao-cao fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b	2020-07-15 11:03:58 -07:00
Yanqin Jin	27735dea9a	Report corrupted keys during compaction (#7124 ) Summary: Currently, RocksDB lets compaction to go through even in case of corrupted keys, the number of which is reported in CompactionJobStats. However, RocksDB does not check this value. We should let compaction run in a stricter mode. Temporarily disable two tests that allow corrupted keys in compaction. With this PR, the two tests will assert(false) and terminate. Still need to investigate what is the recommended google-test way of doing it. Death test (EXPECT_DEATH) in gtest has warnings now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124 Test Plan: make check Reviewed By: ajkr Differential Revision: D22530722 Pulled By: riversand963 fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6	2020-07-14 17:18:17 -07:00
Andrew Kryczka	82611ee25a	save key comparisons in BlockIter::BinarySeek (#7068 ) Summary: This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068 Test Plan: ran readrandom with mostly default settings and counted key comparisons using `PerfContext`. before: `user_key_comparison_count = 28881965` after: `user_key_comparison_count = 27823245` setup command: ``` $ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000 ``` benchmark command: ``` $ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3 ``` Reviewed By: anand1976 Differential Revision: D22357032 Pulled By: ajkr fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24	2020-07-09 12:27:20 -07:00
Akanksha Mahajan	54f171fe90	Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7096 ) Summary: When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block. Fix: 1. After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition. 2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if seperator_is_key_plus_seq_ is set to true so that subsequent partitions can also use internal key mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096 Test Plan: make check -j64 Reviewed By: ajkr Differential Revision: D22416598 Pulled By: akankshamahajan15 fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac	2020-07-08 21:03:04 -07:00
wenh	226d1f9c73	extend listener callback functions to more file I/O operations (#7055 ) Summary: Currently, `EventListener` in listner.h only have callback functions for file read and write. One may favor extended callback functions for more file I/O operations like flush, sync and close. This PR tries to add those interface and have them called when appropriate throughout the code base. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7055 Test Plan: Write an experimental listener with those new callback functions with log output in them; run experiments and check logs to see those functions are actually called. Default test suits `make check` should also be included. Reviewed By: riversand963 Differential Revision: D22380624 Pulled By: roghnin fbshipit-source-id: 4121491d45c2c2aae8c255e7998090559a241c6a	2020-07-07 18:21:18 -07:00
Andrew Kryczka	dd29ad4223	Separate internal and user key comparators in `BlockIter` (#6944 ) Summary: Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944 Test Plan: benchmark setup -- single file DBs, in-memory, no compression. "normal_db" created by regular flush; "ingestion_db" created by ingesting a file. Both DBs have same contents. ``` $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000 $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst \| awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}') $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ ``` benchmark run command: ``` $ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false ``` results: perf improved marginally for ingestion_db and did not change significantly for normal_db: SEEK_NEXT \| DB \| code \| ops/sec \| % change -- \| -- \| -- \| -- \| -- 0 \| normal_db \| master \| 350880 \| 0 \| normal_db \| PR6944 \| 351040 \| 0.0 0 \| ingestion_db \| master \| 343255 \| 0 \| ingestion_db \| PR6944 \| 349424 \| 1.8 10 \| normal_db \| master \| 218711 \| 10 \| normal_db \| PR6944 \| 217892 \| -0.4 10 \| ingestion_db \| master \| 220334 \| 10 \| ingestion_db \| PR6944 \| 226437 \| 2.8 Reviewed By: pdillinger Differential Revision: D21924676 Pulled By: ajkr fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538	2020-07-07 17:26:16 -07:00
Zitan Chen	373d5ac485	BackupEngine verifies table file checksums on creating new backups (#7015 ) Summary: When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory. After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process. No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function. In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files. Corresponding tests are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015 Test Plan: Passed make check Reviewed By: pdillinger Differential Revision: D22165732 Pulled By: gg814 fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec	2020-07-02 18:15:12 -07:00
Peter Dillinger	a680a7ea37	Un-revert #7049 , revert #7022 (#7071 ) Summary: Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071 Reviewed By: akankshamahajan15 Differential Revision: D22362857 Pulled By: pdillinger fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0	2020-07-02 13:30:41 -07:00
Akanksha Mahajan	5edfe3a3d8	Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022 ) Summary: When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block. Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022 Test Plan: 1. make check -j64 2. Added one unit test case Reviewed By: ajkr Differential Revision: D22197734 Pulled By: akankshamahajan15 fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e	2020-07-01 14:58:08 -07:00
Zitan Chen	6a243b3ade	Generalize BackupEngine naming option for share_files_with_checksum SSTs and revert BackupEngine::VerifyBackup to check only file sizes by default (#7032 ) Summary: `bool BackupableDBOptions::new_naming_for_backup_files` is updated to `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming`, where `BackupTableNameOption` is an `enum` type with two enumerators `kChecksumAndFileSize` and `kChecksumAndFileSize`. This opens up possibilities of extenting the current naming scheme for backup table files. By default, `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is set to `kChecksumAndDbSessionId`. Revert `BackupEngine::VerifyBackup` to only check file sizes by default. Also fix the construction of the `SstFileDumper` in `GetFileDbIdentities` by setting a proper `Env` of the `Options` passed in the constructor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7032 Test Plan: make check Reviewed By: ajkr Differential Revision: D22237763 Pulled By: gg814 fbshipit-source-id: 466902a4e731babd64e30f0e82ca1aa82962e52e	2020-06-30 18:47:16 -07:00
Burton Li	5be2cb6948	Compaction filter support for BlobDB (#6850 ) Summary: Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850 Reviewed By: siying Differential Revision: D22263487 Pulled By: ltamasi fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8	2020-06-29 17:32:14 -07:00
Zitan Chen	1569dc48f5	`BackupEngine::VerifyBackup` verifies checksum by default (#7014 ) Summary: A parameter `verify_with_checksum` is added to `BackupEngine::VerifyBackup`, which is true by default. So now `BackupEngine::VerifyBackup` verifies backup files with checksum AND file size by default. When `verify_with_checksum` is false, `BackupEngine::VerifyBackup` only compares file sizes to verify backup files. Also add a test for the case when corruption does not change the file size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7014 Test Plan: Passed backupable_db_test Reviewed By: zhichao-cao Differential Revision: D22165590 Pulled By: gg814 fbshipit-source-id: 606a7450714e868bceb38598c89fd356c6004f4f	2020-06-26 11:42:12 -07:00
Zitan Chen	95fbb62c44	Update HISTORY.md to include the Public API Change for DB::OpenForReadonly introduced earlier (#7023 ) Summary: `DB::OpenForReadOnly()` now returns `Status::NotFound` when the specified DB directory does not exist. Previously the error returned depended on the underlying `Env`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7023 Reviewed By: ajkr Differential Revision: D22207845 Pulled By: gg814 fbshipit-source-id: f35830811a0e67efb0ee82eda3a9739bc526baba	2020-06-25 06:14:29 -07:00
Zitan Chen	be41c61f22	Add a new option for BackupEngine to store table files under shared_checksum using DB session id in the backup filenames (#6997 ) Summary: `BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`. Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true. Three new test cases are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997 Test Plan: Passed make check. Reviewed By: ajkr Differential Revision: D22098895 Pulled By: gg814 fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76	2020-06-24 19:31:25 -07:00
Yanqin Jin	e66199d848	First step towards handling MANIFEST write error (#6949 ) Summary: This PR provides preliminary support for handling IO error during MANIFEST write. File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted. One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach. If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled. Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled. Possible future directions: - Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added. Test plan (dev server): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949 Reviewed By: anand1976 Differential Revision: D22026020 Pulled By: riversand963 fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8	2020-06-24 19:07:08 -07:00
Peter Dillinger	5b2bbacb6f	Minimize memory internal fragmentation for Bloom filters (#6427 ) Summary: New experimental option BBTO::optimize_filters_for_memory builds filters that maximize their use of "usable size" from malloc_usable_size, which is also used to compute block cache charges. Rather than always "rounding up," we track state in the BloomFilterPolicy object to mix essentially "rounding down" and "rounding up" so that the average FP rate of all generated filters is the same as without the option. (YMMV as heavily accessed filters might be unluckily lower accuracy.) Thus, the option near-minimizes what the block cache considers as "memory used" for a given target Bloom filter false positive rate and Bloom filter implementation. There are no forward or backward compatibility issues with this change, though it only works on the format_version=5 Bloom filter. With Jemalloc, we see about 10% reduction in memory footprint (and block cache charge) for Bloom filters, but 1-2% increase in storage footprint, due to encoding efficiency losses (FP rate is non-linear with bits/key). Why not weighted random round up/down rather than state tracking? By only requiring malloc_usable_size, we don't actually know what the next larger and next smaller usable sizes for the allocator are. We pick a requested size, accept and use whatever usable size it has, and use the difference to inform our next choice. This allows us to narrow in on the right balance without tracking/predicting usable sizes. Why not weight history of generated filter false positive rates by number of keys? This could lead to excess skew in small filters after generating a large filter. Results from filter_bench with jemalloc (irrelevant details omitted): (normal keys/filter, but high variance) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.6278 Number of filters: 5516 Total size (MB): 200.046 Reported total allocated memory (MB): 220.597 Reported internal fragmentation: 10.2732% Bits/key stored: 10.0097 Average FP rate %: 0.965228 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.5104 Number of filters: 5464 Total size (MB): 200.015 Reported total allocated memory (MB): 200.322 Reported internal fragmentation: 0.153709% Bits/key stored: 10.1011 Average FP rate %: 0.966313 (very few keys / filter, optimization not as effective due to ~59 byte internal fragmentation in blocked Bloom filter representation) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.5649 Number of filters: 162950 Total size (MB): 200.001 Reported total allocated memory (MB): 224.624 Reported internal fragmentation: 12.3117% Bits/key stored: 10.2951 Average FP rate %: 0.821534 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 31.8057 Number of filters: 159849 Total size (MB): 200 Reported total allocated memory (MB): 208.846 Reported internal fragmentation: 4.42297% Bits/key stored: 10.4948 Average FP rate %: 0.811006 (high keys/filter) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.7017 Number of filters: 164 Total size (MB): 200.352 Reported total allocated memory (MB): 221.5 Reported internal fragmentation: 10.5552% Bits/key stored: 10.0003 Average FP rate %: 0.969358 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.7131 Number of filters: 160 Total size (MB): 200.928 Reported total allocated memory (MB): 200.938 Reported internal fragmentation: 0.00448054% Bits/key stored: 10.1852 Average FP rate %: 0.963387 And from db_bench (block cache) with jemalloc: $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ (for FILE in /dev/shm/dbbench.no_optimize/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17063835 $ (for FILE in /dev/shm/dbbench/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17430747 $ #^ 2.1% additional filter storage $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8440400 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 21087528 rocksdb.bloom.filter.useful COUNT : 4963889 rocksdb.bloom.filter.full.positive COUNT : 1214081 rocksdb.bloom.filter.full.true.positive COUNT : 1161999 $ #^ 1.04 % observed FP rate $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8448592 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 18220328 rocksdb.bloom.filter.useful COUNT : 5360933 rocksdb.bloom.filter.full.positive COUNT : 1321315 rocksdb.bloom.filter.full.true.positive COUNT : 1262999 $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters (Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427 Test Plan: unit test added, 'make check' with gcc, clang and valgrind Reviewed By: siying Differential Revision: D22124374 Pulled By: pdillinger fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e	2020-06-22 13:32:07 -07:00
Matthew Von-Maszewski	1092f19d95	Make EncryptEnv inheritable (#6830 ) Summary: EncryptEnv class is both declared and defined within env_encryption.cc. This makes it really tough to derive new classes from that base. This branch moves declaration of the class to rocksdb/env_encryption.h. The change facilitates making new encryption modules (such as an upcoming openssl AES CTR pull request) possible / easy. The only coding change was to add the EncryptEnv object to env_basic_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6830 Reviewed By: riversand963 Differential Revision: D21706593 Pulled By: ajkr fbshipit-source-id: 64d2da95a1569ceeb9b1549c3bec5404cf4c89f0	2020-06-22 13:27:16 -07:00
sdong	d6b7b7712f	Fix a bug that causes iterator to return wrong result in a rare data race (#6973 ) Summary: The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong. Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973 Test Plan: Will run stress tests for a while to make sure there is no general regression. Reviewed By: ajkr Differential Revision: D22029348 fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1	2020-06-18 10:16:38 -07:00
Yanqin Jin	569b87e8c7	Fail recovery when MANIFEST record checksum mismatch (#6996 ) Summary: https://github.com/facebook/rocksdb/issues/5411 refactored `VersionSet::Recover` but introduced a bug, explained as follows. Before, once a checksum mismatch happens, `reporter` will set `s` to be non-ok. Therefore, Recover will stop processing the MANIFEST any further. ``` // Correct // Inside Recover LogReporter reporter; reporter.status = &s; log::Reader reader(..., reporter); while (reader.ReadRecord() && s.ok()) { ... } ``` The bug is that, the local variable `s` in `ReadAndRecover` won't be updated by `reporter` while reading the MANIFEST. It is possible that the reader sees a checksum mismatch in a record, but `ReadRecord` retries internally read and finds the next valid record. The mismatched record will be ignored and no error is reported. ``` // Incorrect // Inside Recover LogReporter reporter; reporter.status = &s; log::Reader reader(..., reporter); s = ReadAndRecover(reader, ...); // Inside ReadAndRecover Status s; // Shadows the s in Recover. while (reader.ReadRecord() && s.ok()) { ... } ``` `LogReporter` can use a separate `log_read_status` to track the errors while reading the MANIFEST. RocksDB can process more MANIFEST entries only if `log_read_status.ok()`. Test plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6996 Reviewed By: ajkr Differential Revision: D22105746 Pulled By: riversand963 fbshipit-source-id: b22f717a423457a41ca152a242abbb64cf91fc38	2020-06-18 10:09:12 -07:00

... 2 3 4 5 6 ...

1105 Commits