rocksdb

Author	SHA1	Message	Date
Mark Callaghan	326670d265	Add new db_bench --benchmarks options for controlling compaction (#8027 ) Summary: The new options are: * compact0 - compact L0 into L1 using one thread * compact1 - compact L1 into L2 using one thread * flush - flush memtable * waitforcompaction - wait for compaction to finish These are useful for reproducible benchmarks to help get the LSM tree shape into a deterministic state. I wrote about this at: http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027 Reviewed By: riversand963 Differential Revision: D27053861 Pulled By: ajkr fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e	2021-03-17 09:12:27 -07:00
stefan-zobel	8d9088464b	Java-API: Fix minor Javadoc copy-paste errors (#8034 ) Summary: Fixes 3 minor Javadoc copy-paste errors in the `RocksDB#newIterator()` and `Transaction#getIterator()` variants that take a column family handle but are talking about iterating over "the database" or "the default column family". Pull Request resolved: https://github.com/facebook/rocksdb/pull/8034 Reviewed By: jay-zhuang Differential Revision: D26877667 Pulled By: mrambacher fbshipit-source-id: 95dd95b667c496e389f221acc9a91b340e4b63bf	2021-03-16 18:07:09 -07:00
mrambacher	1a343bc393	Make ChRootEnv, EncryptedEnv, and TimedEnv into FileSystems (#7968 ) Summary: These classes were wraps of Env that provided only extensions to the FileSystem functionality. Changed the classes to be FileSystems and the wraps to be of the CompositeEnvWrapper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7968 Reviewed By: anand1976 Differential Revision: D26900253 Pulled By: mrambacher fbshipit-source-id: 94001d8024a3c54a1c11adadca2bac66c3af2a77	2021-03-15 19:50:11 -07:00
Yanqin Jin	0304352882	Fix a bug in key comparison when index type is kBinarySearchWithFirstKey (#8062 ) Summary: When timestamp is enabled, key comparison should take this into account. In `BlockBasedTableReader::Get()`, `BlockBasedTableReader::MultiGet()`, assume the target key is `key`, and the timestamp upper bound is `ts`. The highest key in current block is (key, ts1), while the lowest key in next block is (key, ts2). If ``` ts1 > ts > ts2 ``` then ``` (key, ts1) < (key, ts) < (key, ts2) ``` It can be shown that if `Compare()` is used, then we will mistakenly skip the next block. Instead, we should use `CompareWithoutTimestamp()`. The majority of this PR makes some existing tests in `db_with_timestamp_basic_test.cc` parameterized so that different index types can be tested. A new unit test is also added for more coverage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8062 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27057557 Pulled By: riversand963 fbshipit-source-id: c1062fa7c159ed600a1ad7e461531d52265021f1	2021-03-15 17:44:52 -07:00
Yanqin Jin	85d4f2c8b3	Move a test file to a better location (#8054 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8054 Test Plan: make check Reviewed By: mrambacher Differential Revision: D27017955 Pulled By: riversand963 fbshipit-source-id: 829497d507bc89afbe982f8a8cf3555e52fd7098	2021-03-15 15:03:27 -07:00
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	2021-03-15 04:34:11 -07:00
Andrew Kryczka	b8f40f7f7b	Deflake tests of compaction based on compensated file size (#8036 ) Summary: CompactionDeletionTriggerReopen was observed to be flaky recently: https://app.circleci.com/pipelines/github/facebook/rocksdb/6030/workflows/787af4f3-b9f7-4645-8e8d-1fb0ebf05539/jobs/101451. I went through it and the related tests and arrived at different conclusions on what constraints we can expect on DB size. Some constraints got looser and some got tighter. The particular constraint that flaked got a lot looser so at least the flake linked above would have been prevented. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8036 Reviewed By: riversand963 Differential Revision: D26862566 Pulled By: ajkr fbshipit-source-id: 3512b86b4fb41aeecae32e1c7382c03916d88d88	2021-03-14 20:25:42 -07:00
Levi Tamasi	b708b166dc	Fix a harmless data race affecting two test cases (#8055 ) Summary: `DBTest.GetLiveBlobFiles` and `ObsoleteFilesTest.BlobFiles` both modify the current `Version` in their setup phase, implicitly assuming that no other threads would touch the `Version` while this is happening. The periodic stats dumper thread violates this assumption; the patch fixes this by disabling it in the affected test cases. (Note: the data race is harmless in the sense that it only affects test code.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8055 Test Plan: ``` COMPILE_WITH_TSAN=1 make db_test -j24 gtest-parallel --repeat=10000 ./db_test --gtest_filter="GetLiveBlobFiles" COMPILE_WITH_TSAN=1 make obsolete_files_test -j24 gtest-parallel --repeat=10000 ./obsolete_files_test --gtest_filter="BlobFiles" ``` Reviewed By: riversand963 Differential Revision: D27022715 Pulled By: ltamasi fbshipit-source-id: b6cc77ed63d8bc1cbe0603522ff1a572182fc9ab	2021-03-12 16:44:35 -08:00
Peter Dillinger	01c2ec3fcb	Add ROCKSDB_GTEST_BYPASS (#8048 ) Summary: This is for cases that do not meet the Facebook criteria for SKIP (see new comments). Also made ROCKSDB_GTEST_{SKIP,BYPASS} print the message because gtest doesn't ever seem to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8048 Test Plan: manual inspection of ./ribbon_test output, CI Reviewed By: mrambacher Differential Revision: D26953688 Pulled By: pdillinger fbshipit-source-id: c914eaffe7d419db6ab90a193d474531e23582e5	2021-03-12 16:02:06 -08:00
Peter Dillinger	119dda2195	Instantiate tests DBIteratorTestForPinnedData (#8051 ) Summary: a trial gtest upgrade discovered some parameterized tests missing instantiation. By some miracle, they still pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8051 Test Plan: thisisthetest Reviewed By: mrambacher Differential Revision: D27003684 Pulled By: pdillinger fbshipit-source-id: cde1cab1551fb282f67d462d46574bd30bd5e61f	2021-03-12 12:31:29 -08:00
Peter Dillinger	589ea6bec2	Add BackupEngine API for backup file details (#8042 ) Summary: This API can be used for things like determining how much space can be freed up by deleting a particular backup, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042 Test Plan: validation of the API added to many existing backup unit tests Reviewed By: mrambacher Differential Revision: D26936577 Pulled By: pdillinger fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a	2021-03-12 11:03:54 -08:00
Yanqin Jin	82b3888433	Enable backward iterator for keys with user-defined timestamp (#8035 ) Summary: This PR does the following: - Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet. - Introduces a new helper API `Comparator::EqualWithoutTimestamp()`. - Fix a typo in `SetTimestamp()`. - Add/update unit tests Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps. 1. Disable timestamp, run: ``` ./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5 ``` Results: > Baseline > - seekrandom [AVG 6 runs] : 96115 ops/sec; 53.2 MB/sec > - seekrandom [MEDIAN 6 runs] : 98075 ops/sec; 54.2 MB/sec > > This PR > - seekrandom [AVG 6 runs] : 95521 ops/sec; 52.8 MB/sec > - seekrandom [MEDIAN 6 runs] : 96338 ops/sec; 53.3 MB/sec 2. Enable timestamp, run: ``` ./db_bench -user_timestamp_size=8 -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5 ``` Result: > Baseline: not supported > > This PR > - seekrandom [AVG 6 runs] : 90514 ops/sec; 50.1 MB/sec > - seekrandom [MEDIAN 6 runs] : 90834 ops/sec; 50.2 MB/sec Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035 Reviewed By: ltamasi Differential Revision: D26926668 Pulled By: riversand963 fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5	2021-03-10 11:15:46 -08:00
Yanqin Jin	64517d184a	Make secondary instance use ManifestTailer (#7998 ) Summary: This PR - adds a class `ManifestTailer` that inherits from `VersionEditHandlerPointInTime`. `ManifestTailer::Iterate()` can be called multiple times to tail the primary instance's MANIFEST and apply the changes to the secondary, - updates the implementation of `ReactiveVersionSet::ReadAndApply` to use this class, - removes unused code in version_set.cc, - updates existing tests, e.g. removing deleted sync points from unit tests, - adds a new test to address the bug in https://github.com/facebook/rocksdb/issues/7815. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7998 Test Plan: make check Existing and newly-added tests in version_set_test.cc and db_secondary_test.cc Reviewed By: jay-zhuang Differential Revision: D26926641 Pulled By: riversand963 fbshipit-source-id: 8d4dd15db0ba863c213f743e33b5a207e948c980	2021-03-10 10:59:44 -08:00
David CARLIER	7a3444bf1f	Mac M1 crc32 intrinsics ARM64 check support proposal (#7893 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7893 Reviewed By: ajkr Differential Revision: D26050966 Pulled By: jay-zhuang fbshipit-source-id: 9df2bb65d82defd7fad49d5369979b03e22d39c2	2021-03-10 09:05:56 -08:00
stefan-zobel	cc34da75b5	Java-API: byteCompressionType should be declared as primitive type byte (#7981 ) Summary: The variable `byteCompressionType` is only assigned values of primitive type and is never 'null', but it is declared with the boxed type 'Byte'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7981 Reviewed By: ajkr Differential Revision: D26546600 Pulled By: jay-zhuang fbshipit-source-id: 07b579cdfcfc2262a448ca3626e216416fd05892	2021-03-09 22:05:16 -08:00
qinzuoyan	6fad38ebe8	Fix compile error (#7908 ) Summary: OS: Ubuntu 14.04 Compiler: GCC 4.9.4 Compile error: ``` db/forward_iterator.cc:996:62: error: declaration of ‘key’ shadows a member of 'this' [-Werror=shadow] auto cmp = [&](const FileMetaData* f, const Slice& key) -> bool { ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7908 Reviewed By: jay-zhuang Differential Revision: D26899986 Pulled By: ajkr fbshipit-source-id: 66b0b97aefd0f13a085e063491f8207366a9f848	2021-03-09 20:53:33 -08:00
Hans Holmberg	670567db09	Add support for custom file systems to ldb and sst_dump (#8010 ) Summary: This PR adds support for custom file systems to ldb and sst_dump by adding command line options for specifying --fs_uri and --backup_fs uri (for ldb backup/restore commands). fs_uri is already supported in db_bench and db_stress, and there is already support in ldb and db stress for specifying customized envs. The PR also fixes what looks like a bug in the ldb backup/restore commands. As it is right now, backups can only be made from and to the same environment/file system which does not seem to be the intended behavior. This PR makes it possible to do/restore backups between different envs/file systems. Example: `./ldb backup --fs_uri=zenfs://dev:nvme2n1 --backup_fs_uri=posix:// --backup_dir=/tmp/my_rocksdb_backup --db=rocksdbtest/dbbench ` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8010 Reviewed By: jay-zhuang Differential Revision: D26904654 Pulled By: ajkr fbshipit-source-id: 9b695ed8b944fcc6b27c4daaa9f52e87ee2c1fb4	2021-03-09 20:49:15 -08:00
Ed rodriguez	7381dad1b1	make:Fix c header prototypes (#7994 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7994 Reviewed By: jay-zhuang Differential Revision: D26904603 Pulled By: ajkr fbshipit-source-id: 0af92a51de895b40c7faaa4f0870b3f63279fe21	2021-03-09 20:44:23 -08:00
Peter Dillinger	4b18c46d10	Refactor: add LineFileReader and Status::MustCheck (#8026 ) Summary: Removed confusing, awkward, and undocumented internal API ReadOneLine and replaced with very simple LineFileReader. In refactoring backupable_db.cc, this has the side benefit of removing the arbitrary cap on the size of backup metadata files. Also added Status::MustCheck to make it easy to mark a Status as "must check." Using this, I can ensure that after LineFileReader::ReadLine returns false the caller checks GetStatus(). Also removed some excessive conditional compilation in status.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026 Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED Reviewed By: mrambacher Differential Revision: D26831687 Pulled By: pdillinger fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f	2021-03-09 20:12:38 -08:00
Peter Dillinger	847ca9f964	Make default share_files_with_checksum=true (#8020 ) Summary: New comment for share_files_with_checksum: // Only used if share_table_files is set to true. Setting to false is // DEPRECATED and potentially dangerous because in that case BackupEngine // can lose data if backing up databases with distinct or divergent // history, for example if restoring from a backup other than the latest, // writing to the DB, and creating another backup. Setting to true (default) // prevents these issues by ensuring that different table files (SSTs) with // the same number are treated as distinct. See // share_files_with_checksum_naming and ShareFilesNaming. I have also removed interim option kFlagMatchInterimNaming, which is no longer needed and was never needed for correct+compatible operation (just performance). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020 Test Plan: tests updated. Backward+forward compatibility verified with SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup options, and I manually verified shared_checksum in /tmp/rocksdb_format_compatible_peterd/bak/current/ after run. Reviewed By: ajkr Differential Revision: D26786331 Pulled By: pdillinger fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130	2021-03-09 16:27:13 -08:00
Peter Dillinger	0028e3398b	Make format_version=5 new default (#8017 ) Summary: Haven't seen any production issues with new Bloom filter and it's now > 1 year old (added in 6.6.0). Updated check_format_compatible.sh and HISTORY.md Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017 Test Plan: tests updated (or prior bugs fixed) Reviewed By: ajkr Differential Revision: D26762197 Pulled By: pdillinger fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2	2021-03-09 12:42:53 -08:00
stefan-zobel	430842f948	Java-API: Missing space in string literal (#7982 ) Summary: `TtlDB.open()`: missing space after 'column' `AdvancedColumnFamilyOptionsInterface.setLevelCompactionDynamicLevelBytes()`: missing space after 'cause' Pull Request resolved: https://github.com/facebook/rocksdb/pull/7982 Reviewed By: ajkr Differential Revision: D26546632 Pulled By: jay-zhuang fbshipit-source-id: 885dedcaa2200842764fbac9ce3766d54e1c8914	2021-03-09 11:30:29 -08:00
xinyuliu	8643d63bb4	Add $(ARTIFACT_SUFFIX} to benchmark tools built with cmake (#8016 ) Summary: Add ${ARTIFACT_SUFFIX} to benchmark tool names to enable differentiating jemalloc and non-jemalloc versions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8016 Reviewed By: jay-zhuang Differential Revision: D26907007 Pulled By: ajkr fbshipit-source-id: 78d3b3372b5454d52d5b663ea982135ea9cf7bf8	2021-03-09 10:38:22 -08:00
fanrui03	67d72fb5dc	Fix checkpoint stuck (#7921 ) Summary: ## 1. Bug description: When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method. ## 2. Simple analysis of the reasons: ### 2.1 Configuration parameters: ```yaml Compaction Style : Universal max_write_buffer_number : 4 min_write_buffer_number_to_merge : 3 ``` Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code. ```c++ // method: GetWriteStallConditionAndCause if (mutable_cf_options.max_write_buffer_number> 3 && num_unflushed_memtables >= mutable_cf_options.max_write_buffer_number-1) { return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit}; } ``` code link: `fbed72f03c/db/column_family.cc (L847)` Checkpoint thought there was a FlushJob, but it didn't. So will always wait. ### 2.2 solution: Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`. If there are other better solutions, you can correct me. ### 2.3 Code that can reproduce the problem: https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java ## 3. Interesting point This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`. Because there is a break in WaitUntilFlushWouldNotStallWrites. ```c++ if (cfd->imm()->NumNotFlushed() < cfd->ioptions()->min_write_buffer_number_to_merge && vstorage->l0_delay_trigger_count() < mutable_cf_options.level0_file_num_compaction_trigger) { break; } ``` code link: `fbed72f03c/db/db_impl/db_impl_compaction_flush.cc (L1974)` Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7921 Reviewed By: jay-zhuang Differential Revision: D26900559 Pulled By: ajkr fbshipit-source-id: 133c1252dad7393753f04a47590b68c7d8e670df	2021-03-09 02:21:25 -08:00
kshair	d2e9eab1ea	Fix mis-spelling (#8001 ) Summary: concurrnet -> concurrent Pull Request resolved: https://github.com/facebook/rocksdb/pull/8001 Reviewed By: ajkr Differential Revision: D26659381 Pulled By: riversand963 fbshipit-source-id: 890d102d1cf836ed3b183da66d3d56a3158017d0	2021-03-09 01:19:18 -08:00
jsteemann	02974c9437	make PerfStepTimer struct smaller by reordering members (#7931 ) Summary: On x86_64, this makes the struct 8 bytes smaller, so creating a PerfStepTimer on the stack will use slightly less stack space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7931 Reviewed By: jay-zhuang Differential Revision: D26529470 Pulled By: ajkr fbshipit-source-id: bbe2e843167152ffa05a5946f1add6621c9849f7	2021-03-08 21:33:15 -08:00
Andrew Kryczka	ef392fb04e	use `LIB_MODE=shared` on Travis `make` commands (#8043 ) Summary: We were seeing intermittent `ld` failures due to `No space left on device` such as https://travis-ci.org/github/facebook/rocksdb/jobs/761905070. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8043 Reviewed By: pdillinger Differential Revision: D26889711 Pulled By: ajkr fbshipit-source-id: 010b7617d339bddc30026586bfde41539632fb2d	2021-03-08 17:21:24 -08:00
Andrew Kryczka	0ff0b625a1	Deflake DBTest2.PartitionedIndexUserToInternalKey on ppc64le (#8044 ) Summary: For some reason I still cannot figure out, the manual flush in this test was sometimes producing a third tiny file. I saw it a bunch of times on ppc64le, but even running a qemu system with that architecture (and playing with various other options) could not repro. However we did get an instrumented Travis run to confirm the problem is indeed a third tiny file - https://travis-ci.org/github/facebook/rocksdb/jobs/761986592. We can avoid it by filling memtables less full and using manual flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8044 Reviewed By: akankshamahajan15 Differential Revision: D26892635 Pulled By: ajkr fbshipit-source-id: 775c04176931cf01d07cc78fb82cfe3a11beebcf	2021-03-08 14:47:56 -08:00
Peter Dillinger	ce391ff84b	Clarifying comments for Read() APIs (#8029 ) Summary: I recently discovered the confusing, undocumented semantics of Read() functions in the FileSystem and Env APIs. I have added clarification to the best of my reverse-engineered understanding, and made a note in HISTORY.md for implementors to check their implementations, as a subtly non-adherent implementation could lead to RocksDB quietly ignoring some portion of a file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8029 Test Plan: no code changes Reviewed By: anand1976 Differential Revision: D26831698 Pulled By: pdillinger fbshipit-source-id: 208f97ff6037bc13bb2ef360b987c2640c79bd03	2021-03-05 14:42:19 -08:00
Levi Tamasi	cb25bc1128	Update compaction statistics to include the amount of data read from blob files (#8022 ) Summary: The patch does the following: 1) Exposes the amount of data (number of bytes) read from blob files from `BlobFileReader::GetBlob` / `Version::GetBlob`. 2) Tracks the total number and size of blobs read from blob files during a compaction (due to garbage collection or compaction filter usage) in `CompactionIterationStats` and propagates this data to `InternalStats::CompactionStats` / `CompactionJobStats`. 3) Updates the formulae for write amplification calculations to include the amount of data read from blob files. 4) Extends the compaction stats dump with a new column `Rblob(GB)` and a new line containing the total number and size of blob files in the current `Version` to complement the information about the shape and size of the LSM tree that's already there. 5) Updates `CompactionJobStats` so that the number of files and amount of data written by a compaction are broken down per file type (i.e. table/blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26801199 Pulled By: ltamasi fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2	2021-03-04 00:43:48 -08:00
matthewvon	4126bdc0e1	Feature: add SetBufferSize() so that managed size can be dynamic (#7961 ) Summary: This PR adds SetBufferSize() to the WriteBufferManager object. This enables user code to adjust the global budget for write_buffers based upon other memory conditions such as growth in table reader memory as the dataset grows. The buffer_size_ member variable is now atomic to match design of other changeable size_t members within WriteBufferManager. This change is useful as is. However, this change is also essential if someone decides they wanted to enable db_write_buffer_size modifications through the DB::SetOptions() API, i.e. no waste taking this as is. Any format / spacing changes are due to clang-format as required by check-in automation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7961 Reviewed By: ajkr Differential Revision: D26639075 Pulled By: akankshamahajan15 fbshipit-source-id: 0604348caf092d35f44e85715331dc920e5c1033	2021-03-03 14:22:11 -08:00
Yanqin Jin	72d1e258cd	Possibly bump NUMBER_OF_RESEEKS_IN_ITERATION (#8015 ) Summary: When changing db iterator direction, we may perform a reseek. Therefore, we should bump the NUMBER_OF_RESEEKS_IN_ITERATION counter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8015 Test Plan: make check Reviewed By: ltamasi Differential Revision: D26755415 Pulled By: riversand963 fbshipit-source-id: 211f51f1a454bcda768fc46c0dce51edeb7f05fe	2021-03-02 22:41:04 -08:00
Peter Dillinger	a9046f3c45	Revamp check_format_compatible.sh (#8012 ) Summary: * Adds backup/restore forward/backward compatibility testing * Adds forward/backward compatibility testing to sst ingestion * More structure sharing and comments for the lists of branches comprising each group * Less reliant on invariants between groups with de-duplication logic * Restructured for n+1 branch checkout+build steps rather than something like 3n. Should be much faster despite more checks. And to make manual runs easier * On success, restores working trees to original working branch (aborts early if uncommitted changes) and deletes temporary branch & remote * Adds SHORT_TEST=1 mode that uses only the oldest version for each * Adds USE_SSH=1 to use ssh instead of https for github group Pull Request resolved: https://github.com/facebook/rocksdb/pull/8012 Test Plan: a number of manual tests, mostly with SHORT_TEST=1. Using one version older for any of the groups (except I didn't check db_backward_only_refs) fails. Changing default format_version to 5 (planned) without updating this script fails as it should, and passes with appropriate update. Full local run passed (had to remove "2.7.fb.branch" due to compiler issues, also before this change). Reviewed By: riversand963 Differential Revision: D26735840 Pulled By: pdillinger fbshipit-source-id: 1320c22de5674760657e385aa42df9fade8b6fff	2021-03-02 11:42:27 -08:00
Levi Tamasi	a46f080cce	Break down the amount of data written during flushes/compactions per file type (#8013 ) Summary: The patch breaks down the "bytes written" (as well as the "number of output files") compaction statistics into two, so the values are logged separately for table files and blob files in the info log, and are shown in separate columns (`Write(GB)` for table files, `Wblob(GB)` for blob files) when the compaction statistics are dumped. This will also come in handy for fixing the write amplification statistics, which currently do not consider the amount of data read from blob files during compaction. (This will be fixed by an upcoming patch.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26742156 Pulled By: ltamasi fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5	2021-03-02 09:48:00 -08:00
Akanksha Mahajan	f19612970d	Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003 ) Summary: The checkpointing logic supports passing file level checksums to the copy_file_cb callback function which is used by the backup code for detecting corruption during file copies. However, this is currently implemented only for table files. This PR extends the checksum retrieval to blob files as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003 Test Plan: Add new test units Reviewed By: ltamasi Differential Revision: D26680701 Pulled By: akankshamahajan15 fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5	2021-03-01 20:07:07 -08:00
Yanqin Jin	1f11d07f24	Enable compact filter for blob in dbstress and dbbench (#8011 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8011 Test Plan: ``` ./db_bench -enable_blob_files=1 -use_keep_filter=1 -disable_auto_compactions=1 /db_stress -enable_blob_files=1 -enable_compaction_filter=1 -acquire_snapshot_one_in=0 -compact_range_one_in=0 -iterpercent=0 -test_batches_snapshots=0 -readpercent=10 -prefixpercent=20 -writepercent=55 -delpercent=15 -continuous_verification_interval=0 ``` Reviewed By: ltamasi Differential Revision: D26736061 Pulled By: riversand963 fbshipit-source-id: 1c7834903c28431ce23324c4f259ed71255614e2	2021-03-01 17:24:47 -08:00
Yanqin Jin	9fdc9fbeea	Still use SystemClock* instead of shared_ptr in StepPerfTimer (#8006 ) Summary: This is likely a temp fix before we figure out a better way. PerfStepTimer is used intensively in certain benchmarking/testings. https://github.com/facebook/rocksdb/issues/7858 stores a `shared_ptr` to system clock in PerfStepTimer which gets created each time a `PerfStepTimer` object is created. The atomic operations in `shared_ptr` may add overhead in CPU cycles. Therefore, we change it back to a raw `SystemClock*` for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8006 Test Plan: make check Reviewed By: pdillinger Differential Revision: D26703560 Pulled By: riversand963 fbshipit-source-id: 519d0769b28da2334bea7d86c848fcc26ee8a17f	2021-02-26 20:57:18 -08:00
Peter Dillinger	a8b3b9a20c	Refine Ribbon configuration, improve testing, add Homogeneous (#7879 ) Summary: This change only affects non-schema-critical aspects of the production candidate Ribbon filter. Specifically, it refines choice of internal configuration parameters based on inputs. The changes are minor enough that the schema tests in bloom_test, some of which depend on this, are unaffected. There are also some minor optimizations and refactorings. This would be a schema change for "smash" Ribbon, to fix some known issues with small filters, but "smash" Ribbon is not accessible in public APIs. Unit test CompactnessAndBacktrackAndFpRate updated to test small and medium-large filters. Run with --thoroughness=100 or so for much better detection power (not appropriate for continuous regression testing). Homogenous Ribbon: This change adds internally a Ribbon filter variant we call Homogeneous Ribbon, in collaboration with Stefan Walzer. The expected "result" value for every key is zero, instead of computed from a hash. Entropy for queries not to be false positives comes from free variables ("overhead") in the solution structure, which are populated pseudorandomly. Construction is slightly faster for not tracking result values, and never fails. Instead, FP rate can jump up whenever and whereever entries are packed too tightly. For small structures, we can choose overhead to make this FP rate jump unlikely, as seen in updated unit test CompactnessAndBacktrackAndFpRate. Unlike standard Ribbon, Homogeneous Ribbon seems to scale to arbitrary number of keys when accepting an FP rate penalty for small pockets of high FP rate in the structure. For example, 64-bit ribbon with 8 solution columns and 10% allocated space overhead for slots seems to achieve about 10.5% space overhead vs. information-theoretic minimum based on its observed FP rate with expected pockets of degradation. (FP rate is close to 1/256.) If targeting a higher FP rate with fewer solution columns, Homogeneous Ribbon can be even more space efficient, because the penalty from degradation is relatively smaller. If targeting a lower FP rate, Homogeneous Ribbon is less space efficient, as more allocated overhead is needed to keep the FP rate impact of degradation relatively under control. The new OptimizeHomogAtScale tool in ribbon_test helps to find these optimal allocation overheads for different numbers of solution columns. And Ribbon widths, with 128-bit Ribbon apparently cutting space overheads in half vs. 64-bit. Other misc item specifics: * Ribbon APIs in util/ribbon_config.h now provide configuration data for not just 5% construction failure rate (95% success), but also 50% and 0.1%. * Note that the Ribbon structure does not exhibit "threshold" behavior as standard Xor filter does, so there is a roughly fixed space penalty to cut construction failure rate in half. Thus, there isn't really an "almost sure" setting. * Although we can extrapolate settings for large filters, we don't have a good formula for configuring smaller filters (< 2^17 slots or so), and efforts to summarize with a formula have failed. Thus, small data is hard-coded from updated FindOccupancy tool. * Enhances ApproximateNumEntries for public API Ribbon using more precise data (new API GetNumToAdd), thus a more accurate but not perfect reversal of CalculateSpace. (bloom_test updated to expect the greater precision) * Move EndianSwapValue from coding.h to coding_lean.h to keep Ribbon code easily transferable from RocksDB * Add some missing 'const' to member functions * Small optimization to 128-bit BitParity * Small refactoring of BandingStorage in ribbon_alg.h to support Homogeneous Ribbon * CompactnessAndBacktrackAndFpRate now has an "expand" test: on construction failure, a possible alternative to re-seeding hash functions is simply to increase the number of slots (allocated space overhead) and try again with essentially the same hash values. (Start locations will be different roundings of the same scaled hash values--because fastrange not mod.) This seems to be as effective or more effective than re-seeding, as long as we increase the number of slots (m) by roughly m += m/w where w is the Ribbon width. This way, there is effectively an expansion by one slot for each ribbon-width window in the banding. (This approach assumes that getting "bad data" from your hash function is as unlikely as it naturally should be, e.g. no adversary.) * 32-bit and 16-bit Ribbon configurations are added to ribbon_test for understanding their behavior, e.g. with FindOccupancy. They are not considered useful at this time and not tested with CompactnessAndBacktrackAndFpRate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7879 Test Plan: unit test updates included Reviewed By: jay-zhuang Differential Revision: D26371245 Pulled By: pdillinger fbshipit-source-id: da6600d90a3785b99ad17a88b2a3027710b4ea3a	2021-02-26 08:50:42 -08:00
Yanqin Jin	c370d8aa12	Remove unused/incorrect fwd declaration (#8002 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8002 Reviewed By: anand1976 Differential Revision: D26659354 Pulled By: riversand963 fbshipit-source-id: 6b464dbea9fd8240ead8cc5af393f0b78e8f9dd1	2021-02-25 23:07:31 -08:00
Yanqin Jin	cef4a6c49f	Compaction filter support for (new) BlobDB (#7974 ) Summary: Allow applications to implement a custom compaction filter and pass it to BlobDB. The compaction filter's custom logic can operate on blobs. To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method. Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in db/blob/db_blob_compaction_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974 Test Plan: make check Reviewed By: ltamasi Differential Revision: D26509280 Pulled By: riversand963 fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39	2021-02-25 16:32:35 -08:00
Akanksha Mahajan	2772eb7735	Update History.md for VerifyFileChecksums API supporting blob file (#7995 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7995 Reviewed By: ltamasi Differential Revision: D26625766 Pulled By: akankshamahajan15 fbshipit-source-id: d83c9e77695f4193da979b1ce7103b43bc1dd46c	2021-02-24 10:25:03 -08:00
xinyuliu	b085ee13e0	Append all characters not captured by xsputn() in overflow() function (#7991 ) Summary: In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is ``` int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) { // put a character return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch)); ``` As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow. When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991 Reviewed By: jay-zhuang Differential Revision: D26615692 Pulled By: ajkr fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468	2021-02-23 21:44:48 -08:00
Akanksha Mahajan	cd79a00903	Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951 ) Summary: RocksDB does auto-readahead for iterators on noticing more than two reads for a table file. The readahead starts at 8KB and doubles on every additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is 256*1024. This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be configured. If max_auto_readahead_size is set 0 then no implicit auto prefetching will be done. If max_auto_readahead_size provided is less than 8KB (which is initial readahead size used by rocksdb in case of auto-readahead), readahead size will remain same as max_auto_readahead_size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951 Test Plan: Add new unit test case. Reviewed By: anand1976 Differential Revision: D26568085 Pulled By: akankshamahajan15 fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af	2021-02-23 16:54:08 -08:00
sherriiiliu	e017af15c1	Fix testcase failures on windows (#7992 ) Summary: Fixed 5 test case failures found on Windows 10/Windows Server 2016 1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles. 2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes. 3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end. 4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case. 5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992 Reviewed By: jay-zhuang Differential Revision: D26615283 Pulled By: ajkr fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d	2021-02-23 14:35:06 -08:00
sherriiiliu	75c6ffb9de	Always expose WITH_GFLAGS option to user (#7990 ) Summary: WITH_GFLAGS option does not work on MSVC. I checked the usage of [CMAKE_DEPENDENT_OPTION](https://cmake.org/cmake/help/latest/module/CMakeDependentOption.html). It says if the `depends` condition is not true, it will set the `option` to the value given by `force` and hides the option from the user. Therefore, `CMAKE_DEPENDENT_OPTION(WITH_GFLAGS "build with GFlags" ON "NOT MSVC;NOT MINGW" OFF)` will hide WITH_GFLAGS option from user if it is running on MSVC or MINGW and always set WITH_GFLAGS to be OFF. To expose WITH_GFLAGS option to user, I removed CMAKE_DEPENDENT_OPTION and split the logic into if-else statements Pull Request resolved: https://github.com/facebook/rocksdb/pull/7990 Reviewed By: jay-zhuang Differential Revision: D26615755 Pulled By: ajkr fbshipit-source-id: 33ca39a73423d9516510c15aaf9efb5c4072cdf9	2021-02-23 14:31:27 -08:00
sherriiiliu	f91fd0c944	Extract test cases correctly in run_ci_db_test.ps1 script (#7989 ) Summary: Extract test cases correctly in run_ci_db_test.ps1 script. There are some new test group that are ended with # comments. Previously in the script when trying to extract test groups and test cases, the regex rule did not apply to this case so the concatenation of some test group and test case failed, see examples in comments. Also removed useless trailing whitespaces in the script. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7989 Reviewed By: jay-zhuang Differential Revision: D26615909 Pulled By: ajkr fbshipit-source-id: 8e68d599994f17d6fefde0daa925c3018179521a	2021-02-23 14:25:42 -08:00
Akanksha Mahajan	46cf5fbfdd	Extend VerifyFileChecksums API for blob files (#7979 ) Summary: Extend VerifyFileChecksums API to verify blob files in case of use_file_checksum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7979 Test Plan: New unit test db_blob_corruption_test Reviewed By: ltamasi Differential Revision: D26534040 Pulled By: akankshamahajan15 fbshipit-source-id: 7dc5951a3df9d265ea1265e0122b43c966856ade	2021-02-22 22:09:22 -08:00
Andrew Kryczka	daca92c17a	Pick samples for compression dictionary using prime number (#7987 ) Summary: The sample selection technique taken in https://github.com/facebook/rocksdb/issues/7970 was problematic because it had two code paths for sample selection depending on the number of data blocks, and one of those code paths involved an allocation. Using prime numbers, we can consolidate into one code path without allocation. The downside is there will be values of N (number of data blocks buffered) that suffer from poor spread in the selected samples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7987 Test Plan: `make check -j48` Reviewed By: pdillinger Differential Revision: D26586147 Pulled By: ajkr fbshipit-source-id: 62028e54336fadb6e2c7a7fe6747daa05a263d32	2021-02-22 17:43:03 -08:00
mrambacher	59d91796d2	Attempt to speed up tests by adding test to "slow" tests (#7973 ) Summary: I noticed tests frequently timing out on CircleCI when I submit a PR. I did some investigation and found the SeqAdvanceConcurrentTest suite (OneWriteQueue, TwoWriteQueues) tests were all taking a long time to complete (30 tests each taking at least 15K ms). This PR adds those test to the "slow reg" list in order to move them earlier in the execution sequence so that they are not the "long tail". For completeness, other tests that were also slow are: NumLevels/DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest : 12 tests all taking 12K+ ms ReadSequentialFileTest with ReadaheadSize: 8 tests all 12K+ ms WriteUnpreparedTransactionTest.RecoveryTest : 2 tests at 22K+ ms DBBasicTest.EmptyFlush: 1 test at 35K+ ms RateLimiterTest.Rate: 1 test at 23K+ ms BackupableDBTest.ShareTableFilesWithChecksumsTransition: 1 test at 16K+ ms MulitThreadedDBTest.MultitThreaded: 78 tests at 10K+ ms TransactionStressTest.DeadlockStress: 7 tests at 11K+ ms DBBasicTestDeadline.IteratorDeadline: 3 tests at 10K+ ms No effort was made to determine why the tests were slow. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7973 Reviewed By: jay-zhuang Differential Revision: D26519130 Pulled By: mrambacher fbshipit-source-id: 11555c9115acc207e45e210a7fc7f879170a3853	2021-02-22 05:27:51 -08:00
Akanksha Mahajan	6790a983eb	Fix for ASSERT_STATUS_CHECKED test failure (#7985 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7985 Test Plan: CircleCI ASSERT_STATUS_CHECKED test Reviewed By: jay-zhuang Differential Revision: D26568446 Pulled By: akankshamahajan15 fbshipit-source-id: bd0ab41f485942e313d82ce3895ce53e0967ba98	2021-02-20 19:13:55 -08:00

1 2 3 4 5 ...

9835 Commits