rocksdb

Author	SHA1	Message	Date
Zhichao Cao	d1c510baec	Handoff checksum Implementation (#7523 ) Summary: in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0	2021-02-10 22:20:32 -08:00
Levi Tamasi	974458891c	Revert "Turn on memtable bloom filter by default. (#6584 )" (#7939 ) Summary: This reverts commit `ee79a28963`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939 Reviewed By: siying Differential Revision: D26298564 Pulled By: ltamasi fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd	2021-02-06 22:34:30 -08:00
sdong	ee79a28963	Turn on memtable bloom filter by default. (#6584 ) Summary: Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584 Test Plan: Run all existing tests. Reviewed By: pdillinger Differential Revision: D20626739 fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9	2021-02-05 12:59:46 -08:00
mrambacher	c1a65a4de4	Make StringEnv, StringSink, StringSource use FS classes (#7786 ) Summary: Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes. Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786 Reviewed By: jay-zhuang Differential Revision: D25761460 Pulled By: anand1976 fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f	2021-01-04 16:01:01 -08:00
Yanqin Jin	9aa1b1dc19	Hack to load OPTIONS file for read_amp_bytes_per_bit (#7659 ) Summary: A temporary hack to work around a bug in 6.10, 6.11, 6.12, 6.13 and 6.14. The bug will write out 8 bytes to OPTIONS file from the starting address of BlockBasedTableOptions.read_amp_bytes_per_bit which is actually a uint32. Consequently, the value of read_amp_bytes_per_bit written in the OPTIONS file is wrong. From 6.15, RocksDB will try to parse the read_amp_bytes_per_bit from OPTIONS file as a uint32. To be able to load OPTIONS file generated by affected releases before the fix, we need to manually parse read_amp_bytes_per_bit with this hack. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7659 Test Plan: Generate a db with current 6.14.fb (head at `b6db05dbb5`). Maybe use db_stress. Checkout this PR, run ``` ~/rocksdb/ldb --db=. --try_load_options --ignore_unknown_options idump --count_only ``` Expect success, and should not see ``` Failed: Invalid argument: Error parsing read_amp_bytes_per_bit:17179869184 ``` Also make check Reviewed By: anand1976 Differential Revision: D24954752 Pulled By: riversand963 fbshipit-source-id: c7b802fc3e52acd050a4fc1cd475016122234394	2020-11-13 11:52:50 -08:00
Akanksha Mahajan	e300ce211d	Update option "allow_data_in_errors" in BuildOptions (#7665 ) Summary: "allow_data_in_errors" is not updated in BuildOptions. So it would assume default value when BuildOptions is called. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7665 Test Plan: make check -j64 Reviewed By: zhichao-cao Differential Revision: D24929100 Pulled By: akankshamahajan15 fbshipit-source-id: dd6225a6c9f13b20027ff1b6de8e79801b57b3f7	2020-11-12 22:09:17 -08:00
Peter Dillinger	60af964372	Experimental (production candidate) SST schema for Ribbon filter (#7658 ) Summary: Added experimental public API for Ribbon filter: NewExperimentalRibbonFilterPolicy(). This experimental API will take a "Bloom equivalent" bits per key, and configure the Ribbon filter for the same FP rate as Bloom would have but ~30% space savings. (Note: optimize_filters_for_memory is not yet implemented for Ribbon filter. That can be added with no effect on schema.) Internally, the Ribbon filter is configured using a "one_in_fp_rate" value, which is 1 over desired FP rate. For example, use 100 for 1% FP rate. I'm expecting this will be used in the future for configuring Bloom-like filters, as I expect people to more commonly hold constant the filter accuracy and change the space vs. time trade-off, rather than hold constant the space (per key) and change the accuracy vs. time trade-off, though we might make that available. ### Benchmarking ``` $ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 34.1341 Number of filters: 1993 Total size (MB): 238.488 Reported total allocated memory (MB): 262.875 Reported internal fragmentation: 10.2255% Bits/key stored: 10.0029 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 18.7508 Random filter net ns/op: 258.246 Average FP rate %: 0.968672 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 130.851 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 58.4523 Random filter net ns/op: 363.717 Average FP rate %: 0.952978 ---------------------------- Done. (For more info, run with -legend or -help.) ``` 168.166 / 238.488 = 0.705 -> 29.5% space reduction 130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction) ### Working around a hashing "flaw" bloom_test discovered a flaw in the simple hashing applied in StandardHasher when num_starts == 1 (num_slots == 128), showing an excessively high FP rate. The problem is that when many entries, on the order of number of hash bits or kCoeffBits, are associated with the same start location, the correlation between the CoeffRow and ResultRow (for efficiency) can lead to a solution that is "universal," or nearly so, for entries mapping to that start location. (Normally, variance in start location breaks the effective association between CoeffRow and ResultRow; the same value for CoeffRow is effectively different if start locations are different.) Without kUseSmash and with num_starts > 1 (thus num_starts ~= num_slots), this flaw should be completely irrelevant. Even with 10M slots, the chances of a single slot having just 16 (or more) entries map to it--not enough to cause an FP problem, which would be local to that slot if it happened--is 1 in millions. This spreadsheet formula shows that: =1/(10000000(1 - POISSON(15, 1, TRUE))) As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is intended for CPU efficiency of filters with many more entries/slots than kCoeffBits, a very reasonable work-around is to disallow num_starts==1 when !kUseSmash, by making the minimum non-zero number of slots 2kCoeffBits. This is the work-around I've applied. This also means that the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not space-efficient for less than a few hundred entries. Because of this, I have made it fall back on constructing a Bloom filter, under existing schema, when that is more space efficient for small filters. (We can change this in the future if we want.) TODO: better unit tests for this case in ribbon_test, and probably update StandardHasher for kUseSmash case so that it can scale nicely to small filters. ### Other related changes * Add Ribbon filter to stress/crash test * Add Ribbon filter to filter_bench as -impl=3 * Add option string support, as in "filter_policy=experimental_ribbon:5.678;" where 5.678 is the Bloom equivalent bits per key. * Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom * Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on binary searching CalculateSpace (inefficient), so that subclasses (especially experimental ones) don't have to provide an efficient implementation inverting CalculateSpace. * Minor refactor FastLocalBloomBitsBuilder for new base class XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder, which allows the latter to fall back on Bloom construction in some extreme cases. * Mostly updated bloom_test for Ribbon filter, though a test like FullBloomTest::Schema is a next TODO to ensure schema stability (in case this becomes production-ready schema as it is). * Add some APIs to ribbon_impl.h for configuring Ribbon filters. Although these are reasonably covered by bloom_test, TODO more unit tests in ribbon_test * Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data for constructing the linear approximations in GetNumSlotsFor95PctSuccess. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658 Test Plan: Some unit tests updated but other testing is left TODO. This is considered experimental but laying down schema compatibility as early as possible in case it proves production-quality. Also tested in stress/crash test. Reviewed By: jay-zhuang Differential Revision: D24899349 Pulled By: pdillinger fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8	2020-11-12 20:46:14 -08:00
Levi Tamasi	bbbb5a280d	Add options for integrated blob GC (#7661 ) Summary: This patch simply adds a couple of options that will enable users to configure garbage collection when using the integrated BlobDB implementation. The actual GC logic will be added in a separate step. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7661 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D24906544 Pulled By: ltamasi fbshipit-source-id: ee0e056a712a4b4475cd90de8b27d969bd61b7e1	2020-11-12 18:58:44 -08:00
mrambacher	c442f6809f	Create a Customizable class to load classes and configurations (#6590 ) Summary: The Customizable class is an extension of the Configurable class and allows instances to be created by a name/ID. Classes that extend customizable can define their Type (e.g. "TableFactory", "Cache") and a method to instantiate them (TableFactory::CreateFromString). Customizable objects can be registered with the ObjectRegistry and created dynamically. Future PRs will make more types of objects extend Customizable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6590 Reviewed By: cheng-chang Differential Revision: D24841553 Pulled By: zhichao-cao fbshipit-source-id: d0c2132bd932e971cbfe2c908ca2e5db30c5e155	2020-11-11 15:10:41 -08:00
Andrew Kryczka	1adbceb581	Expand effect of dictionary settings in `ColumnFamilyOptions::compression_opts` (#7619 ) Summary: In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it for bottom level under the assumption that the vast majority of data is stored there. At that time, there was no such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just hardcoded disabling dictionary compression in flush and compactions to non-bottommost level. Now, we have users who generate all their files through flush and are considering using dictionary compression. To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to additionally include flushed files and files generated by compaction to a non-bottommost level. Users can still get the old behavior by moving their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts` and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619 Reviewed By: ltamasi Differential Revision: D24665610 Pulled By: ajkr fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c	2020-11-02 19:21:11 -08:00
mrambacher	30beecef8c	Return NotFound from TableFactory configuration errors during options loading (#7615 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7615 Reviewed By: riversand963 Differential Revision: D24637054 Pulled By: ajkr fbshipit-source-id: 7da20d44289eaa2387af4edf8c3c48057425cc1c	2020-10-29 18:44:24 -07:00
mrambacher	7eb2824e3f	Revert LoadLatestOptions handling of ignore_unknown_options if versions differ (#7612 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7612 Reviewed By: zhichao-cao Differential Revision: D24627054 Pulled By: riversand963 fbshipit-source-id: 451b4da742e3e84c7442bc7cc4959d39089b89d0	2020-10-29 13:46:26 -07:00
mrambacher	1eda625eab	Revert `Status`es returned from pre-`Configurable` options functions (#7563 ) Summary: Further refinement of the earlier PR. Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563 Reviewed By: zhichao-cao Differential Revision: D24422491 Pulled By: ajkr fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67	2020-10-20 11:53:28 -07:00
anand76	00751e4292	Add a host location property to TableProperties (#7479 ) Summary: This PR adds support for writing a location identifier of the DB host to SST files as a table property. By default, the hostname is used, but can be overridden by the user. There have been some recent corruptions in files written by ```SstFileWriter``` before checksumming, so this property can be used to trace it back to the writing host and checking the host for hardware isues. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7479 Test Plan: Add new unit tests Reviewed By: pdillinger Differential Revision: D24340671 Pulled By: anand1976 fbshipit-source-id: 2038949fd8d160c0633ccb4f9da77740f19fa2a2	2020-10-19 11:38:48 -07:00
mrambacher	a8c89cc969	Test for LoadLatestOptions (#7554 ) Summary: Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554 Reviewed By: akankshamahajan15 Differential Revision: D24298985 Pulled By: zhichao-cao fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3	2020-10-14 22:28:55 -07:00
Andrew Kryczka	75d3b6fdf0	Redesign block cache pinning API (#7520 ) Summary: The old flag-based APIs (`BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache` and `BlockBasedTableOptions::pin_top_level_index_and_filter`) were insufficient for our needs. For example, it was impossible to pin only unpartitioned meta-blocks, which could prevent block cache contention when turning on dictionary compression or during a migration to partitioned indexes/filters. It was also impossible to pin all meta-blocks in memory while having predictable memory usage via block cache. If we had continued adding flags to address these scenarios, they would have had significant overlap causing confusion. Instead, this PR deprecates the flags and starts a new API with non-overlapping options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7520 Test Plan: - new unit test - added new options to stress/crash test and ran for a while: `$ python tools/db_crashtest.py blackbox --simple --max_key=1000000 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0` Reviewed By: pdillinger Differential Revision: D24200034 Pulled By: ajkr fbshipit-source-id: 3fa7cfc71e7960f7a867511dd6ae5834dd73b13e	2020-10-11 14:58:24 -07:00
Cheng Chang	12b78e40bd	Track WAL in MANIFEST: add option track_and_verify_wals_in_manifest (#7275 ) Summary: This option determines whether WALs will be tracked in MANIFEST and verified on recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275 Test Plan: db_options_test options_test Reviewed By: pdillinger Differential Revision: D23181418 Pulled By: cheng-chang fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547	2020-10-09 16:42:19 -07:00
sdong	7508175558	Introduce options.check_flush_compaction_key_order (#7467 ) Summary: Introduce an new option options.check_flush_compaction_key_order, by default set to true, which checks key order of flush and compaction, and fail the operation if the order is violated. Also did minor refactor hash checking code, which consolidates the hashing logic to a vlidation class, where the key ordering logic is added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7467 Test Plan: Add unit tests to validate the check can catch reordering in flush and compaction, and can be properly disabled. Reviewed By: riversand963 Differential Revision: D24010683 fbshipit-source-id: 8dd6292d2cda8006054e9ded7cfa4bf405f0527c	2020-10-01 10:10:26 -07:00
Akanksha Mahajan	9d212d3f0e	Provide users with option to opt-in to get corrupt data in logs/messages (#7420 ) Summary: Add a new Option "allow_data_in_errors". When it's set by users, it allows them to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data. By default value is set false to prevent users data to be exposed in the messages. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7420 Test Plan: 1. make check -j64 2. Add a new test case Reviewed By: ajkr Differential Revision: D23835028 Pulled By: akankshamahajan15 fbshipit-source-id: 8d2eba8fb898e79fcf1fccc07295065a75eb59b1	2020-09-29 23:17:45 -07:00
mrambacher	5d6ff69375	Fix valgrind issues with configurable_test (#7424 ) Summary: Valgrind was reporting a problem with the configurable_test in some GTEST code. This problem was caused by using a std::function as a GTEST parameter. This change changes the test to use a string as a function parameter (backed by a map) and fixes the valgrind issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7424 Reviewed By: ajkr Differential Revision: D23855540 Pulled By: pdillinger fbshipit-source-id: 2f2be03f7f92d96644aa9fa6481e4f37f2cfa5f5	2020-09-23 09:34:01 -07:00
mrambacher	7d472accdc	Bring the Configurable options together (#5753 ) Summary: This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753 Reviewed By: ajkr Differential Revision: D23385030 Pulled By: zhichao-cao fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a	2020-09-14 17:01:01 -07:00
Levi Tamasi	b9bb59d49d	Add initial set of options for integrated blob write path (#7280 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7280 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23195192 Pulled By: ltamasi fbshipit-source-id: 743b382de391963e62ba86119e9fbd0233ea3b3a	2020-08-18 18:32:37 -07:00
anand76	832b056a30	Enable IO timeouts for iterators (#7161 ) Summary: Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for Iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161 Test Plan: New unit tests in db_basic_test Reviewed By: riversand963 Differential Revision: D22687352 Pulled By: anand1976 fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec	2020-08-07 12:01:08 -07:00
Yingchun Lai	67bbac3621	Remove duplicate colon in Status message (#7041 ) Summary: A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2), it's not needed to append a colon explicitly to 'msg'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041 Reviewed By: ajkr Differential Revision: D22292801 fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec	2020-08-06 15:18:04 -07:00
Tomas Kolda	cd4592c220	SST Partitioner interface that allows to split SST files (#6957 ) Summary: SST Partitioner interface that allows to split SST files during compactions. It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957 Reviewed By: ajkr Differential Revision: D22461239 fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e	2020-07-24 13:44:49 -07:00
Jay Zhuang	b0c5ecd6b3	Make max_subcompactions dynamically changeable (#7159 ) Summary: Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159 Reviewed By: siying Differential Revision: D22671238 Pulled By: jay-zhuang fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42	2020-07-22 18:32:52 -07:00
mrambacher	ec711b2315	Add Support for saving CompressionOptions to Options File (#6817 ) Summary: This PR does a few things: - The "compression_opts" and "bottom_compression_opts" can now be read/written as name/value pairs of options (instead of only a colon-separated list; - These options can now be read/written to the Options file; - The parallel_threads value can now be set (either in the colon or name-value format). The compression options are now stored and treated as a OptionTypeInfo::Struct by the options system, meaning they can be read and written like the other structs. This change allows them to be read/written easily to the options file. Additionally, the colon-format was extended to allow support for setting parallel threads. Tests were added to test all of the option settings via the optional parameters in the colon format. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6817 Reviewed By: ajkr Differential Revision: D22396004 Pulled By: zhichao-cao fbshipit-source-id: 38bcf74b7e9cd5bc2a84540fac2e9ba4f765b2c8	2020-07-16 19:06:31 -07:00
Zhichao Cao	a10f12eda1	Auto resume the DB from Retryable IO Error (#6765 ) Summary: In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765 Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check Reviewed By: anand1976 Differential Revision: D21916789 Pulled By: zhichao-cao fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b	2020-07-15 11:03:58 -07:00
rockeet	b649d8cb97	Fixed Factory construct just for calling .Name() (#7080 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7080 Reviewed By: riversand963 Differential Revision: D22412352 Pulled By: ajkr fbshipit-source-id: 1d7f4c1621040a0130245139b52c3f4d3deac865	2020-07-08 11:54:00 -07:00
sdong	9cc25190e1	Test CircleCI with CLANG-10 (#7025 ) Summary: It's useful to build RocksDB using a more recent clang version in CI. Add a CircleCI build and fix some issues with it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7025 Test Plan: See all tests pass. Reviewed By: pdillinger Differential Revision: D22215700 fbshipit-source-id: 914a729c2cd3f3ac4a627cc0ac58d4691dca2168	2020-06-24 16:22:49 -07:00
Peter Dillinger	5b2bbacb6f	Minimize memory internal fragmentation for Bloom filters (#6427 ) Summary: New experimental option BBTO::optimize_filters_for_memory builds filters that maximize their use of "usable size" from malloc_usable_size, which is also used to compute block cache charges. Rather than always "rounding up," we track state in the BloomFilterPolicy object to mix essentially "rounding down" and "rounding up" so that the average FP rate of all generated filters is the same as without the option. (YMMV as heavily accessed filters might be unluckily lower accuracy.) Thus, the option near-minimizes what the block cache considers as "memory used" for a given target Bloom filter false positive rate and Bloom filter implementation. There are no forward or backward compatibility issues with this change, though it only works on the format_version=5 Bloom filter. With Jemalloc, we see about 10% reduction in memory footprint (and block cache charge) for Bloom filters, but 1-2% increase in storage footprint, due to encoding efficiency losses (FP rate is non-linear with bits/key). Why not weighted random round up/down rather than state tracking? By only requiring malloc_usable_size, we don't actually know what the next larger and next smaller usable sizes for the allocator are. We pick a requested size, accept and use whatever usable size it has, and use the difference to inform our next choice. This allows us to narrow in on the right balance without tracking/predicting usable sizes. Why not weight history of generated filter false positive rates by number of keys? This could lead to excess skew in small filters after generating a large filter. Results from filter_bench with jemalloc (irrelevant details omitted): (normal keys/filter, but high variance) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.6278 Number of filters: 5516 Total size (MB): 200.046 Reported total allocated memory (MB): 220.597 Reported internal fragmentation: 10.2732% Bits/key stored: 10.0097 Average FP rate %: 0.965228 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.5104 Number of filters: 5464 Total size (MB): 200.015 Reported total allocated memory (MB): 200.322 Reported internal fragmentation: 0.153709% Bits/key stored: 10.1011 Average FP rate %: 0.966313 (very few keys / filter, optimization not as effective due to ~59 byte internal fragmentation in blocked Bloom filter representation) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.5649 Number of filters: 162950 Total size (MB): 200.001 Reported total allocated memory (MB): 224.624 Reported internal fragmentation: 12.3117% Bits/key stored: 10.2951 Average FP rate %: 0.821534 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 31.8057 Number of filters: 159849 Total size (MB): 200 Reported total allocated memory (MB): 208.846 Reported internal fragmentation: 4.42297% Bits/key stored: 10.4948 Average FP rate %: 0.811006 (high keys/filter) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.7017 Number of filters: 164 Total size (MB): 200.352 Reported total allocated memory (MB): 221.5 Reported internal fragmentation: 10.5552% Bits/key stored: 10.0003 Average FP rate %: 0.969358 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.7131 Number of filters: 160 Total size (MB): 200.928 Reported total allocated memory (MB): 200.938 Reported internal fragmentation: 0.00448054% Bits/key stored: 10.1852 Average FP rate %: 0.963387 And from db_bench (block cache) with jemalloc: $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ (for FILE in /dev/shm/dbbench.no_optimize/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17063835 $ (for FILE in /dev/shm/dbbench/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17430747 $ #^ 2.1% additional filter storage $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8440400 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 21087528 rocksdb.bloom.filter.useful COUNT : 4963889 rocksdb.bloom.filter.full.positive COUNT : 1214081 rocksdb.bloom.filter.full.true.positive COUNT : 1161999 $ #^ 1.04 % observed FP rate $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8448592 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 18220328 rocksdb.bloom.filter.useful COUNT : 5360933 rocksdb.bloom.filter.full.positive COUNT : 1321315 rocksdb.bloom.filter.full.true.positive COUNT : 1262999 $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters (Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427 Test Plan: unit test added, 'make check' with gcc, clang and valgrind Reviewed By: siying Differential Revision: D22124374 Pulled By: pdillinger fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e	2020-06-22 13:32:07 -07:00
Peter Dillinger	88b4210701	Remove racially charged terms "whitelist" and "blacklist" (#7008 ) Summary: We don't need them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7008 Test Plan: "make check" and ensure "make crash_test" starts Reviewed By: ajkr Differential Revision: D22143838 Pulled By: pdillinger fbshipit-source-id: 72c8e16603abc59f4954e304466bc4dc1f58f94e	2020-06-19 15:27:32 -07:00
Andrew Kryczka	02db03af8d	make L0 index/filter pinned memory usage predictable (#6911 ) Summary: Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911 Test Plan: unit test Reviewed By: siying Differential Revision: D21835818 Pulled By: ajkr fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4	2020-06-09 16:51:23 -07:00
Yanqin Jin	3020df9df5	Remove unnecessary inclusion of version_edit.h in env (#6952 ) Summary: In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`. Test plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952 Reviewed By: zhichao-cao Differential Revision: D21925341 Pulled By: riversand963 fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2	2020-06-07 21:56:55 -07:00
mrambacher	0a17d95357	Add OptionTypeInfo::Vector to parse/serialize vectors (#6424 ) Summary: The OptionTypeInfo::Vector method allows a vector<T> to be converted to/from strings via the options. The kVectorInt and kVectorCompressionType vectors were replaced with this methodology. As part of this change, the NextToken method was added to the OptionTypeInfo. This method was refactored from code within the StringToMap function. Future types that could use this functionality include the EventListener vectors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6424 Reviewed By: cheng-chang Differential Revision: D21832368 Pulled By: zhichao-cao fbshipit-source-id: e1ca766faff139d54e6e8407a9ec09ece6517439	2020-06-03 12:23:07 -07:00
Akanksha Mahajan	bcefc59e9f	Allow MultiGet users to limit cumulative value size (#6826 ) Summary: 1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key. 2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826 Test Plan: 1. make check -j64 2. Add a new unit test case Reviewed By: anand1976 Differential Revision: D21471483 Pulled By: akankshamahajan15 fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900	2020-05-27 13:07:14 -07:00
Peter Dillinger	803a517b48	Misc things for ASSERT_STATUS_CHECKED, also gcc 4.8.5 (#6871 ) Summary: * Print stack trace on status checked failure * Make folly_synchronization_distributed_mutex_test a parallel test * Disable ldb_test.py and rocksdb_dump_test.sh with ASSERT_STATUS_CHECKED (broken) * Fix shadow warning in random_access_file_reader.h reported by gcc 4.8.5 (ROCKSDB_NO_FBCODE), also https://github.com/facebook/rocksdb/issues/6866 * Work around compiler bug on max_align_t for gcc < 4.9 * Remove an apparently wrong comment in status.h * Use check_some in Travis config (for proper diagnostic output) * Fix ignored Status in loop in options_helper.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/6871 Test Plan: manual, CI Reviewed By: ajkr Differential Revision: D21706619 Pulled By: pdillinger fbshipit-source-id: daf6364173d6689904eb394461a69a11f5bee2cb	2020-05-23 06:53:37 -07:00
Peter Dillinger	35a25a3fb9	Fix/expand ASSERT_STATUS_CHECKED build, add to Travis (#6870 ) Summary: Fixed some option handling code that recently broke the ASSERT_STATUS_CHECKED build for options_test. Added all other existing tests that pass under ASSERT_STATUS_CHECKED to the whitelist. Added a Travis configuration to run all whitelisted tests with ASSERT_STATUS_CHECKED. (Someday we might enable this check by default in debug builds.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6870 Test Plan: ASSERT_STATUS_CHECKED=1 make check, Travis Reviewed By: ajkr Differential Revision: D21704374 Pulled By: pdillinger fbshipit-source-id: 15daef98136a19d7a6843fa0c9ec08738c2ac693	2020-05-22 11:17:29 -07:00
mrambacher	38be686160	Add Struct Type to OptionsTypeInfo (#6425 ) Summary: Added code for generically handing structs to OptionTypeInfo. A struct is a collection of variables handled by their own map of OptionTypeInfos. Examples of structs include Compaction and Cache options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6425 Reviewed By: siying Differential Revision: D21668789 Pulled By: zhichao-cao fbshipit-source-id: 064b110de39dadf82361ed4663f7ac1a535b0b07	2020-05-21 10:58:39 -07:00
Andrew Kryczka	1c84660457	prototype status check enforcement (#6798 ) Summary: Tried making Status object enforce that it is checked in some way. In cases it is not checked, `PermitUncheckedError()` must be called explicitly. Added a way to run tests (`ASSERT_STATUS_CHECKED=1 make -j48 check`) on a whitelist. The effort appears significant to get each test to pass with this assertion, so I only fixed up enough to get one test (`options_test`) working and added it to the whitelist. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6798 Reviewed By: pdillinger Differential Revision: D21377404 Pulled By: ajkr fbshipit-source-id: 73236f9c8df38f01cf24ecac4a6d1661b72d077e	2020-05-08 12:40:43 -07:00
mrambacher	394f2bbd13	Add OptionTypeInfo::Enum and related methods (#6423 ) Summary: Add methods and constructors for handling enums to the OptionTypeInfo. This change allows enums to be converted/compared without adding a special "type" to the OptionType. This change addresses a couple of issues: - It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods) - It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes - It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values - It reduces the number of types and cases that need to be handled in the various OptionType methods Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423 Reviewed By: siying Differential Revision: D21408713 Pulled By: zhichao-cao fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626	2020-05-05 15:04:04 -07:00
sdong	6504ae0c4e	Remove the support of setting CompressionOptions.parallel_threads from string for now (#6782 ) Summary: The current way of implementing CompressionOptions.parallel_threads introduces a format change. We plan to change CompressionOptions's serailization format to a new JSON-like format, which would be another format change. We would like to consolidate the two format changes into one, rather than making some users to change twice. Hold CompressionOptions.parallel_threads from being supported by option string for now. Will add it back after the general CompressionOptions's format change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6782 Test Plan: Run all existing tests. Reviewed By: zhichao-cao Differential Revision: D21338614 fbshipit-source-id: bca2dac3cb37d4e6e64b52cbbe8ea749cd848685	2020-04-30 17:01:17 -07:00
Derrick Pallas	5272305437	Fix FilterBench when RTTI=0 (#6732 ) Summary: The dynamic_cast in the filter benchmark causes release mode to fail due to no-rtti. Replace with static_cast_with_check. Signed-off-by: Derrick Pallas <derrick@pallas.us> Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732 Reviewed By: ltamasi Differential Revision: D21304260 Pulled By: pdillinger fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c	2020-04-29 13:09:23 -07:00
mrambacher	618bf638aa	Add Functions to OptionTypeInfo (#6422 ) Summary: Added functions for parsing, serializing, and comparing elements to OptionTypeInfo. These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map. Using these functions, every type can be handled via the map rather than special cased. By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422 Test Plan: pass make check Reviewed By: siying Differential Revision: D21269005 Pulled By: zhichao-cao fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077	2020-04-28 18:04:26 -07:00
Cheng Chang	4cd859edf1	Fix build under LITE (#6758 ) Summary: GetSupportedCompressions needs to be defined under LITE. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6758 Test Plan: build under LITE Reviewed By: zhichao-cao Differential Revision: D21247937 Pulled By: cheng-chang fbshipit-source-id: 880e59d3e107cdd736d16427a68c5641d1318fb4	2020-04-27 16:55:14 -07:00
Cheng Chang	40497a875a	Reduce memory copies when fetching and uncompressing blocks from SST files (#6689 ) Summary: In https://github.com/facebook/rocksdb/pull/6455, we modified the interface of `RandomAccessFileReader::Read` to be able to get rid of memcpy in direct IO mode. This PR applies the new interface to `BlockFetcher` when reading blocks from SST files in direct IO mode. Without this PR, in direct IO mode, when fetching and uncompressing compressed blocks, `BlockFetcher` will first copy the raw compressed block into `BlockFetcher::compressed_buf_` or `BlockFetcher::stack_buf_` inside `RandomAccessFileReader::Read` depending on the block size. then during uncompressing, it will copy the uncompressed block into `BlockFetcher::heap_buf_`. In this PR, we get rid of the first memcpy and directly uncompress the block from `direct_io_buf_` to `heap_buf_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6689 Test Plan: A new unit test `block_fetcher_test` is added. Reviewed By: anand1976 Differential Revision: D21006729 Pulled By: cheng-chang fbshipit-source-id: 2370b92c24075692423b81277415feb2aed5d980	2020-04-24 15:32:56 -07:00
mrambacher	4cbc19d2a1	Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389 ) Summary: The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389 Reviewed By: siying Differential Revision: D21163707 Pulled By: zhichao-cao fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd	2020-04-21 17:38:17 -07:00
anand76	c1ccd6b6af	Implement deadline support for MultiGet (#6710 ) Summary: Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This implementation enforces the deadline in DBImpl, which is fairly high level. Its best effort and may not check the deadline after every key lookup, but may do so after a batch of keys. In subsequent stages, we will extend this to passing a timeout down to the FileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710 Test Plan: Add new unit tests Reviewed By: riversand963 Differential Revision: D21149158 Pulled By: anand1976 fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e	2020-04-21 14:51:51 -07:00
Akanksha Mahajan	03a1d95db0	Set max_background_flushes dynamically (#6701 ) Summary: 1. Add changes so that max_background_flushes can be set dynamically. 2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the max_background_flushes dynamically using SetDBOptions. TestPlan: 1. make -j64 check 2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701 Reviewed By: ajkr Differential Revision: D21028010 Pulled By: akankshamahajan15 fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1	2020-04-20 16:19:02 -07:00
anand76	3d6d7bcf17	Log CompactOnDeletionCollectorFactory parameters on DB open (#6686 ) Summary: Log it in the info log to help in troubleshooting. It is logged as follows - ``` 2020/04/10-10:51:39.886662 7ffff7fef340 Options.table_properties_collectors: CompactOnDeletionCollector (Sliding window size = 100 Deletion trigger = 90); ``` Tests: make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6686 Reviewed By: ltamasi Differential Revision: D21002442 Pulled By: anand1976 fbshipit-source-id: 7adf0dbae7f1febcb00ce61fea5097118ede5c6a	2020-04-13 19:58:04 -07:00
Huisheng Liu	9e89ffb776	make iterator return versions between timestamp bounds (#6544 ) Summary: (Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window. SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `e860f8840`): seekrandom : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found) This PR: seekrandom : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found) db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544 Reviewed By: ltamasi Differential Revision: D20844069 Pulled By: riversand963 fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74	2020-04-10 09:51:58 -07:00
sdong	e860f8840a	Fix memory corruption caused by new test in options_settable_test (#6676 ) Summary: https://github.com/facebook/rocksdb/pull/6668 added some new test code but it has a risk of memory corruption. Fix it Pull Request resolved: https://github.com/facebook/rocksdb/pull/6676 Test Plan: Run the test under ASAN and see it passes. Reviewed By: ajkr Differential Revision: D20937108 fbshipit-source-id: 22cc96bb02030df0a37a02e67a2cc37ca31ba22d	2020-04-09 11:23:32 -07:00
sdong	94f90ac6bc	compression related options are not copied back from MutableCFOptions… (#6668 ) Summary: … to CFOptions https://github.com/facebook/rocksdb/pull/6615 made several compression related options dynamically changeable. They are moved to MutableCFOptions. However, they are not copied back to ColumnFamilyOptions, so the changed values are not written to option files and for some other uses. Fix it by copying them back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6668 Test Plan: Add a unit test to make sure that when a MutableCFOptions is converted to CFOptions and back to MutableCFOptions, they stay the same. This test would fail without the fix. Reviewed By: ajkr Differential Revision: D20923999 fbshipit-source-id: c3bccd6923b00d677764e2269bed6a95ad7ed780	2020-04-08 14:40:46 -07:00
mrambacher	259b6ec8da	Move the OptionTypeMap code closer to home (#6198 ) Summary: This is a predecessor to the Configurable PR. This change moves the OptionTypeInfo maps closer to where they will be used. When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198 Reviewed By: siying Differential Revision: D20778108 Pulled By: zhichao-cao fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa	2020-04-03 10:52:38 -07:00
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	2020-04-01 16:40:18 -07:00
sdong	80979f81c7	Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615 ) Summary: These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615 Test Plan: Add a unit test to make sure that SetOptions() can change the options. Reviewed By: riversand963 Differential Revision: D20755951 fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff	2020-03-31 12:11:42 -07:00
Zhichao Cao	e8d332d97e	Use FileChecksumGenFactory for SST file checksum (#6600 ) Summary: In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600 Test Plan: tested with make asan_check Reviewed By: riversand963 Differential Revision: D20717670 Pulled By: zhichao-cao fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6	2020-03-29 15:58:46 -07:00
anand76	a9d168cfd7	Simplify migration to FileSystem API (#6552 ) Summary: The current Env/FileSystem API separation has a couple of issues - 1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation. 2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes. This PR solves the above issues and simplifies the migration in the following ways - 1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```. 1a. - This also makes it more robust by ensuring that even if RocksDB has some stray calls to Env APIs rather than FileSystem, they will go through the same object and thus there is no risk of getting out of sync. 2. Provide a ```NewCompositeEnv()``` API that can be used to construct a PosixEnv with a custom FileSystem implementation. This eliminates an indirection to call Env APIs, and relieves the FileSystem developer of the burden of having to implement wrappers for the Env APIs. 3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and ```NewLogger()``` Tests: 1. New unit tests 2. make check and make asan_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552 Reviewed By: riversand963 Differential Revision: D20592038 Pulled By: anand1976 fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7	2020-03-23 21:54:21 -07:00
Yanqin Jin	fb09ef05dc	Attempt to recover from db with missing table files (#6334 ) Summary: There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status. This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version. `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will not be performed. To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush. Test plan (on devserver): ``` $make check $COMPILE_WITH_ASAN=1 make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334 Reviewed By: anand1976 Differential Revision: D19778960 Pulled By: riversand963 fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8	2020-03-20 19:30:48 -07:00
Cheng Chang	6dea7530b5	Remove copy of pairs from the for range loop (#6514 ) Summary: Remove copy of pairs from the for range loop Pull Request resolved: https://github.com/facebook/rocksdb/pull/6514 Test Plan: make check Differential Revision: D20389688 Pulled By: cheng-chang fbshipit-source-id: 1c772091f955be33267514010f3596c61a6f46b5	2020-03-11 21:38:09 -07:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
sdong	594e815e32	Make clang analyze happy with options_test (#6398 ) Summary: clang analysis shows following warning: options/options_test.cc:1554:24: warning: The left operand of '-' is a garbage value (file_size - 1) / readahead_size + 1); ~~~~~~~~~ ^ Explicitly initialize file_size and add an assertion to make clang analysis happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6398 Test Plan: Run "make analysis" and see the warning goes away. Differential Revision: D19819662 fbshipit-source-id: 1589ea91c0c8f78242538f01448e4ad0e5fbc219	2020-02-10 15:50:25 -08:00
sdong	b2bc1da561	Try to fix some analysis failures Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6384 Test Plan: Wait and see the analysis result. Differential Revision: D19781072 fbshipit-source-id: 75e7cb6ee619ebd289841eaabea03dd075c09d3b	2020-02-10 13:37:14 -08:00
sdong	876c2dbff4	Allow readahead when reading option files. (#6372 ) Summary: Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372 Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace. Differential Revision: D19727739 fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583	2020-02-07 15:18:26 -08:00
Mike Kolupaev	637e64b9ac	Add an option to prevent DB::Open() from querying sizes of all sst files (#6353 ) Summary: When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open. If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be really slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes. We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353 Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble. Differential Revision: D19656425 Pulled By: al13n321 fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82	2020-02-04 01:27:26 -08:00
sdong	8f2bee6747	Add ReadOptions.auto_prefix_mode (#6314 ) Summary: Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314 Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run. Differential Revision: D19458717 fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce	2020-01-28 14:44:05 -08:00
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	2019-12-13 14:48:41 -08:00
Peter Dillinger	57f3032285	Allow fractional bits/key in BloomFilterPolicy (#6092 ) Summary: There's no technological impediment to allowing the Bloom filter bits/key to be non-integer (fractional/decimal) values, and it provides finer control over the memory vs. accuracy trade-off. This is especially handy in using the format_version=5 Bloom filter in place of the old one, because bits_per_key=9.55 provides the same accuracy as the old bits_per_key=10. This change not only requires refining the logic for choosing the best num_probes for a given bits/key setting, it revealed a flaw in that logic. As bits/key gets higher, the best num_probes for a cache-local Bloom filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a standard Bloom filter. For example, at 16 bits per key, the best num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%). This change fixes and refines that logic (for the format_version=5 Bloom filter only, just in case) based on empirical tests to find accuracy inflection points between each num_probes. Although bits_per_key is now specified as a double, the new Bloom filter converts/rounds this to "millibits / key" for predictable/precise internal computations. Just in case of unforeseen compatibility issues, we round to the nearest whole number bits / key for the legacy Bloom filter, so as not to unlock new behaviors for it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092 Test Plan: unit tests included Differential Revision: D18711313 Pulled By: pdillinger fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a	2019-11-26 15:59:34 -08:00
Peter Dillinger	4f17d33db4	Remove unused/undefined ImmutableCFOptions() (#6086 ) Summary: default constructor not used or even defined Pull Request resolved: https://github.com/facebook/rocksdb/pull/6086 Differential Revision: D18695669 Pulled By: pdillinger fbshipit-source-id: 6b6ac46029f4fb6edf1c11ee6ce1d9f172b2eaf2	2019-11-26 13:28:31 -08:00
sdong	e8263dbdaa	Apply formatter to recent 200+ commits. (#5830 ) Summary: Further apply formatter to more recent commits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830 Test Plan: Run all existing tests. Differential Revision: D17488031 fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab	2019-09-20 12:04:26 -07:00
sdong	c06b54d0c6	Apply formatter on recent 45 commits. (#5827 ) Summary: Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827 Test Plan: Run all existing tests. Differential Revision: D17483727 fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f	2019-09-19 12:34:17 -07:00
Maysam Yabandeh	6ec6a4a9a4	Remove snap_refresh_nanos option (#5826 ) Summary: The snap_refresh_nanos option didn't bring much benefit. Remove the feature to simplify the code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5826 Differential Revision: D17467147 Pulled By: maysamyabandeh fbshipit-source-id: 4f950b046990d0d1292d7fc04c2ccafaf751c7f0	2019-09-18 20:26:04 -07:00
sdong	b931f84e56	Divide file_reader_writer.h and .cc (#5803 ) Summary: file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/ Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803 Test Plan: Build whole project using make and cmake. Differential Revision: D17374550 fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987	2019-09-16 10:33:51 -07:00
Ronak Sisodia	d05c0fe4d1	Option to make write group size configurable (#5759 ) Summary: The max batch size that we can write to the WAL is controlled by a static manner. So if the leader write is less than 128 KB we will have the batch size as leader write size + 128 KB else the limit will be 1 MB. Both of them are statically defined. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5759 Differential Revision: D17329298 fbshipit-source-id: a3d910629d8d8ca84ea39ad89c2b2d284571ded5	2019-09-11 18:28:33 -07:00
Wilfried Goesgens	fbab9913e2	upgrade gtest 1.7.0 => 1.8.1 for json result writing Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332 Differential Revision: D17242232 fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64	2019-09-09 11:24:11 -07:00
Vijay Nadimpalli	979fbdc696	Persistent globally unique DB ID in manifest (#5725 ) Summary: Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong. The goal of this PR is solve this problem by moving the DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725 Test Plan: Added unit tests. Differential Revision: D16963840 Pulled By: vjnadimpalli fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9	2019-09-03 08:52:24 -07:00
Zhongyi Xie	2f41ecfe75	Refactor trimming logic for immutable memtables (#5022 ) Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5	2019-08-23 13:55:34 -07:00
Mark Rambacher	cfcf045acc	The ObjectRegistry class replaces the Registrar and NewCustomObjects.… (#5293 ) Summary: The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method). This change is necessary for a few reasons: - By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable. - By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not. When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program. Test plan (on riversand963's devserver) ``` $COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check ``` All tests pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293 Differential Revision: D16363396 Pulled By: riversand963 fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180	2019-07-23 17:13:05 -07:00
Eli Pozniansky	c129c75fb7	Added log_readahead_size option to control prefetching for Log::Reader (#5592 ) Summary: Added log_readahead_size option to control prefetching for Log::Reader. This is mostly useful for reading a remotely located log, as it can save the number of round-trips when reading it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5592 Differential Revision: D16362989 Pulled By: elipoz fbshipit-source-id: c5d4d5245a44008cd59879640efff70c091ad3e8	2019-07-19 12:00:19 -07:00
sdong	15fd3be07b	LRU Cache to enable mid-point insertion by default (#5508 ) Summary: Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508 Test Plan: Run all existing tests. Differential Revision: D16021179 fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb	2019-06-27 10:20:57 -07:00
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
Zhongyi Xie	671d15cbdd	Persistent Stats: persist stats history to disk (#5046 ) Summary: This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046 Differential Revision: D15863138 Pulled By: miasantreble fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b	2019-06-17 15:21:50 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Yanqin Jin	340ed4fac7	Add support for timestamp in Get/Put (#5079 ) Summary: It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now). Test plan (on devserver): ``` $COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` All tests must pass. We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled. ``` $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 ``` Repeat for 6 times for both versions. Results are as follows: ``` \| \| readrandom \| fillrandom \| \| master \| 16.77 MB/s \| 47.05 MB/s \| \| PR5079 \| 16.44 MB/s \| 47.03 MB/s \| ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079 Differential Revision: D15132946 Pulled By: riversand963 fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c	2019-06-05 23:10:47 -07:00
Maysam Yabandeh	ae05a83e19	Call ValidateOptions from SetOptions (#5368 ) Summary: Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368 Differential Revision: D15540101 Pulled By: maysamyabandeh fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4	2019-06-03 19:49:57 -07:00
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	2019-05-31 17:23:59 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
Zhongyi Xie	87fe4bcab8	Fix FIFO dynamic options sanitization (#5367 ) Summary: When dynamically setting options, we check the option type info and skip options that are marked deprecated. However this check is only done at top level, which results in bugs where SetOptions will corrupt option values and cause unexpected system behavior iff a deprecated second level option is set dynamically. For exmaple, the following call: ``` dbfull()->SetOptions( {{"compaction_options_fifo", "{allow_compaction=true;max_table_files_size=1024;ttl=731;}"}}); ``` was from pre 6.0 release when `ttl` was part of `compaction_options_fifo`. Now that it got moved out of `compaction_options_fifo`, this call will incorrectly set `compaction_options_fifo.max_table_files_size` to 731 (as `max_table_files_size` is the first one in `OptionsHelper::fifo_compaction_options_type_info` struct) and cause files to gett evicted much faster than expected. This PR adds verification to second level options like `compaction_options_fifo.ttl` or `compaction_options_fifo.max_table_files_size` when set dynamically, and filter out those marked as deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5367 Differential Revision: D15530998 Pulled By: miasantreble fbshipit-source-id: 818258be5c3abe09cd82d62f3c083572d70fecdd	2019-05-30 10:46:28 -07:00
Dave Rigby	8149bb9d6a	Pass OptionTypeInfo maps by const& (#5295 ) Summary: In options_helper.cc various functions take a const unordered_map of string -> TypeInfo for options handling. These functions pass by-value the (const) maps, resulting in unnecessary copies. Change to pass by reference. This results in a noticable reduction in the amount of time spent parsing options - in my case a set of unit tests using RocksDB which call SetOptions() to modify options see a ~25% runtime reduction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5295 Differential Revision: D15296334 Pulled By: riversand963 fbshipit-source-id: 4d4be3db635264943607911b296dda27fd7ce1a7	2019-05-15 14:25:57 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Zhongyi Xie	d51eb0b583	set snappy compression only when supported (#4325 ) Summary: Right now `OptimizeLevelStyleCompaction` may set compression type to Snappy even when Snappy is not supported, this may cause errors like "no snappy compression support" Fixes https://github.com/facebook/rocksdb/issues/4283 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4325 Differential Revision: D15125542 Pulled By: miasantreble fbshipit-source-id: 70890b73ababe16752721555dbd290633c2aafac	2019-05-01 20:40:00 -07:00
Maysam Yabandeh	521d234bda	Revert snap_refresh_nanos feature (#5269 ) Summary: Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269 Differential Revision: D15151285 Pulled By: maysamyabandeh fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e	2019-05-01 10:07:30 -07:00
Maysam Yabandeh	506e8448be	Refresh snapshot list during long compactions (#5099 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099 Differential Revision: D15086710 Pulled By: maysamyabandeh fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12	2019-04-25 18:17:22 -07:00
Andrew Kryczka	6eb317bb4c	Option string/map/file can set env from object registry (#5237 ) Summary: - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object. - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup. - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237 Differential Revision: D15056360 Pulled By: siying fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e	2019-04-25 11:35:09 -07:00
Andrew Kryczka	8272a6de57	Optionally wait on bytes_per_sync to smooth I/O (#5183 ) Summary: The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways. My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes. Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE \| SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it. There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically). The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183 Differential Revision: D14953553 Pulled By: siying fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9	2019-04-22 11:51:39 -07:00
Mike Kolupaev	df38c1ce66	Add BlockBasedTableOptions::index_shortening (#5174 ) Summary: Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174 Differential Revision: D14884185 Pulled By: al13n321 fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534	2019-04-22 08:20:35 -07:00

1 2 3 4 5 ...

254 Commits