rocksdb

Author	SHA1	Message	Date
Zhichao Cao	efaef9b40a	cleanup error_handler related code (#9098 ) Summary: Remove code not in use, add comments, remove redundant code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9098 Test Plan: make check Reviewed By: anand1976 Differential Revision: D32027219 Pulled By: zhichao-cao fbshipit-source-id: 253aae926c87726268af6c027bf805dc9156c8a8	2021-11-08 15:49:17 -08:00
anand76	dddb791c18	Enable a few unit tests to use custom Env objects (#9087 ) Summary: Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087 Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc. Reviewed By: riversand963 Differential Revision: D32007222 Pulled By: anand1976 fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792	2021-11-08 11:05:59 -08:00
Jay Zhuang	5aad38f262	Deflake DBBasicTestWithTimestampCompressionSettings.PutAndGetWithComp… (#9136 ) Summary: …action ``` db/db_with_timestamp_basic_test.cc:2643: Failure db_->CompactFiles(compact_opt, handles_[cf], collector->GetFlushedFiles(), static_cast<int>(kNumTimestamps - i)) Invalid argument: A compaction must contain at least one file. ``` Able to be reproduced by run multiple test in parallel. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9136 Test Plan: ``` gtest-parallel ./db_with_timestamp_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampCompressionSettings.PutAndGetWithCompaction/12 -r 100 -w 100 ``` Reviewed By: riversand963 Differential Revision: D32197734 Pulled By: jay-zhuang fbshipit-source-id: aeb0d6e9b37312f577e203ca81bb7a0f14d4e7ce	2021-11-05 17:57:50 -07:00
Levi Tamasi	3fbddb1d27	Refactor and unify blob file saving and the logic that finds the oldest live blob file (#9122 ) Summary: The patch refactors and unifies the logic in `VersionBuilder::SaveBlobFilesTo` and `VersionBuilder::GetMinOldestBlobFileNumber` by introducing a generic helper that can "merge" the list of `BlobFileMetaData` in the base version with the list of `MutableBlobFileMetaData` representing the updated state after applying a sequence of `VersionEdit`s. This serves as groundwork for subsequent changes that will enable us to determine whether a blob file is live after applying a sequence of edits without calling `VersionBuilder::SaveTo`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9122 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D32151472 Pulled By: ltamasi fbshipit-source-id: 11622b475866de823334b8bc21b0e99d913af97e	2021-11-05 16:50:52 -07:00
Yanqin Jin	5237b39d2e	Fix assertion error during compaction with write-prepared txn enabled (#9105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105 The user contract of SingleDelete is that: a SingleDelete can only be issued to a key that exists and has NOT been updated. For example, application can insert one key `key`, and uses a SingleDelete to delete it in the future. The `key` cannot be updated or removed using Delete. In reality, especially when write-prepared transaction is being used, things can get tricky. For example, a prepared transaction already writes `key` to the memtable after a successful Prepare(). Afterwards, should the transaction rollback, it will insert a Delete into the memtable to cancel out the prior Put. Consider the following sequence of operations. ``` // operation sequence 1 Begin txn Put(key) Prepare() Flush() Rollback txn Flush() ``` There will be two SSTs resulting from above. One of the contains a PUT, while the second one contains a Delete. It is also known that releasing a snapshot can lead to an L0 containing only a SD for a particular key. Consider the following operations following the above block. ``` // operation sequence 2 db->Put(key) db->SingleDelete(key) Flush() ``` The operation sequence 2 can result in an L0 with only the SD. Should there be a snapshot for conflict checking created before operation sequence 1, then an attempt to compact the db may hit the assertion failure below, because ikey_.type is Delete (from a rollback). ``` else if (clear_and_output_next_key_) { assert(ikey_.type == kTypeValue \|\| ikey_.type == kTypeBlobIndex); } ``` To fix the assertion failure, we can skip the SingleDelete if we detect an earlier Delete in the same snapshot interval. Reviewed By: ltamasi Differential Revision: D32056848 fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748	2021-11-05 15:29:18 -07:00
Yanqin Jin	9b53f14a35	Fixed a bug in CompactionIterator when write-preared transaction is used (#9060 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060 RocksDB bottommost level compaction may zero out an internal key's sequence if the key's sequence is in the earliest_snapshot. In write-prepared transaction, checking the visibility of a certain sequence in a specific released snapshot may return a "snapshot released" result. Therefore, it is possible, after a certain sequence of events, a PUT has its sequence zeroed out, but a subsequent SingleDelete of the same key will still be output with its original sequence. This violates the ascending order of keys and leads to incorrect result. The solution is to use an extra variable `last_key_seq_zeroed_` to track the information about visibility in earliest snapshot. With this variable, we can know for sure that a SingleDelete is in the earliest snapshot even if the said snapshot is released during compaction before processing the SD. Reviewed By: ltamasi Differential Revision: D31813016 fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1	2021-11-03 15:55:00 -07:00
Jay Zhuang	29102641dd	Skip directory fsync for filesystem btrfs (#8903 ) Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 4 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncAfterFileRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is forced by set `IOOptions.force_dir_fsync = true` 4. renaming multiple files (like backup and checkpoint): dir-fsync is forced, the same as above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903 Test Plan: run tests on btrfs and non btrfs Reviewed By: ajkr Differential Revision: D30885059 Pulled By: jay-zhuang fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a	2021-11-03 12:21:27 -07:00
Levi Tamasi	081722780b	Refactor the detailed consistency checks and the SST saving logic in VersionBuilder (#9099 ) Summary: The patch refactors the parts of `VersionBuilder` that deal with SST file comparisons. Specifically, it makes the following changes: * Turns `NewestFirstBySeqNo` and `BySmallestKey` from free-standing functions into function objects. Note: `BySmallestKey` has a pointer to the `InternalKeyComparator`, while `NewestFirstBySeqNo` is completely stateless. * Eliminates the wrapper `FileComparator`, which was essentially an unnecessary DIY virtual function call mechanism. * Refactors `CheckConsistencyDetails` and `SaveSSTFilesTo` using helper function templates that take comparator/checker function objects. Using static polymorphism eliminates the need to make runtime decisions about which comparator to use. * Extends some error messages returned by the consistency checks and makes them more uniform. * Removes some incomplete/redundant consistency checks from `VersionBuilder` and `FilePicker`. * Improves const correctness in several places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9099 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D32027503 Pulled By: ltamasi fbshipit-source-id: 621326ae41f4f55f7ad6a91abbd6e666d5c7857c	2021-11-03 11:52:47 -07:00
Peter Dillinger	2b60621f16	Don't call OnTableFileCreated with OK for empty+deleted file (#9118 ) Summary: EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. This could lead to a stress test assertion failure added in https://github.com/facebook/rocksdb/issues/9054. This changes the status to Aborted, to align with the API doc: "... if the file is successfully created. Now it will also be called on failure case. User can check info.status to see if it succeeded or not." For internal purposes, this case is considered "success" but for listener purposes, no SST file is (successfully) created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9118 Test Plan: test case added + existing db_stress Reviewed By: ajkr, riversand963 Differential Revision: D32120232 Pulled By: pdillinger fbshipit-source-id: a804e2e0a52598018d3b182da97804d402ffcdfa	2021-11-03 08:43:27 -07:00
Peter Dillinger	21f8a57f2a	Fix TSAN report on MemPurge test (#9115 ) Summary: TSAN reported data race on count variables in MemPurgeBasic test. This suggests the test could fail if mempurges were slow enough that they don't complete before the count variables being checked, but injecting a long sleep into MemPurge (outside DB mutex) confirms that blocked writes ensure enough mempurges/flushes happen to make the test pass. All the possible different values on testing should be OK to make the test pass. So this change makes the variables atomic so that up-to-date value is always read and TSAN report suppressed. I have also used `.exchange(0)` to make the checking less stateful by "popping off" all the accumulated counts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9115 Test Plan: updated test, watch for any flakiness Reviewed By: riversand963 Differential Revision: D32114432 Pulled By: pdillinger fbshipit-source-id: c985609d39896a0d8f69ebc87b221e688609bdd8	2021-11-02 21:54:29 -07:00
mrambacher	f72c834eab	Make FileSystem a Customizable Class (#8649 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649 Reviewed By: zhichao-cao Differential Revision: D32036059 Pulled By: mrambacher fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295	2021-11-02 09:07:11 -07:00
sdong	a2b9be42b6	Try to start TTL earlier with kMinOverlappingRatio is used (#8749 ) Summary: Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles. When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed. In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749 Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end. Reviewed By: jay-zhuang Differential Revision: D30735261 fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e	2021-11-01 14:36:31 -07:00
leipeng	230c98f3ce	fix histogram NUM_FILES_IN_SINGLE_COMPACTION (#9026 ) Summary: currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026 Reviewed By: ajkr Differential Revision: D31668241 Pulled By: jay-zhuang fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af	2021-11-01 12:57:27 -07:00
Levi Tamasi	b1c27a52d2	Add a consistency check that prevents the overflow of garbage in blob files (#9100 ) Summary: The number or total size of garbage blobs in any given blob file can never exceed the number or total size of all blobs in the file. (This would be a similar error to e.g. attempting to remove from the LSM tree an SST file that has already been removed.) The patch builds on https://github.com/facebook/rocksdb/issues/9085 and adds a consistency check to `VersionBuilder` that prevents the above from happening. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9100 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D32048982 Pulled By: ltamasi fbshipit-source-id: 6f7e0793bf534ad04c3359cc0f696b8e4e5ef81c	2021-11-01 12:32:14 -07:00
leipeng	2b70224f82	remove bad extra RecordTick(stats_, WRITE_WITH_WAL) (#9064 ) Summary: This PR fix wrong ticker `WRITE_WITH_WAL`. `RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`. Fixes: 1. Delete these two extra `RecordTick(WRITE_WITH_WAL)` 2. Fix corresponding test case Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064 Reviewed By: ajkr Differential Revision: D31944459 Pulled By: riversand963 fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086	2021-11-01 11:43:14 -07:00
leipeng	01bd86ad35	InternalStats::DumpCFMapStat: fix sum.w_amp (#9065 ) Summary: sum `w_amp` will be a very large number`(bytes_written + bytes_written_blob)` when there is no any flush and ingest. This PR set sum `w_amp` to zero if there is no any flush and ingest, this is conform to per-level `w_amp` computation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9065 Reviewed By: ajkr Differential Revision: D31943994 Pulled By: riversand963 fbshipit-source-id: acbef5e331debebfad09e0e0d8d0885ebbc00609	2021-10-31 23:11:43 -07:00
Yanqin Jin	8e59a1dc9a	Attempt to deflake ListenerTest.MultiCF (#9084 ) Summary: EventListenerTest.MultiCF uses TestFlushListener which has members flushed_dbs_ and flushed_column_family_names_ that are not protected by locks. This implicitly indicates that we need to ensure the methods accessing these data structures in a single threaded way. In other tests, e.g. MultiDBMultiListeners, we use TEST_WaitForFlushMemtable() to wait until all memtables of a given column family are flushed, hence no pending flush threads will concurrently call OnFlushCompleted() and cause data race for flushed_dbs_. To fix a test failure, we should do the same for MultiCF. Example data race stack traces reported by TSAN ``` Read of size 8 at 0x7b6000002840 by main thread: #0 std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:655:40 https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:380:7 Previous write of size 8 at 0x7b6000002840 by thread T2: #0 void std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::_M_emplace_back_aux<rocksdb::DB* const&>(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/vector.tcc:442:26 https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:923:4 https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9084 Test Plan: ./listener_test --gtest_filter=EventListenerTest.MultiCF Reviewed By: jay-zhuang Differential Revision: D31952259 Pulled By: riversand963 fbshipit-source-id: 94a7f29e4e9466ead42418944eb2247fc32bd499	2021-10-31 22:12:15 -07:00
Yanqin Jin	8f4f302316	Attempt to deflake DBFlushTest.FireOnFlushCompletedAfterCommittedResult (#9083 ) Summary: DBFlushTest.FireOnFlushCompletedAfterCommittedResult uses test sync points to coordinate interleaving of different threads. Before this PR, the test writes some data to memtable, triggers a manual flush, and triggers a second manual flush after a first bg flush thread starts executing. Though unlikely, it is possible for the second bg flush thread to run faster than the first bg flush thread and deques flush queue first. In this case, the original test will fail. The fix is to wait until the first bg flush thread deques the flush queue before triggering second manual flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9083 Test Plan: ./db_flush_test --gtest_filter=DBFlushTest.FireOnFlushCompletedAfterCommittedResult Reviewed By: jay-zhuang Differential Revision: D31951239 Pulled By: riversand963 fbshipit-source-id: f32d7cdabe6ad6808fd18e54e663936dc0a9edb4	2021-10-31 22:08:48 -07:00
Levi Tamasi	44d04582cb	Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085 ) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3	2021-10-29 17:47:02 -07:00
Yanqin Jin	fdf2a0d7eb	Fix a compaction bug for write-prepared txn (#9061 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061 In write-prepared txn, checking a sequence's visibility in a released (old) snapshot may return "Snapshot released". Suppose we have two snapshots: ``` earliest_snap < earliest_write_conflict_snap ``` If we release `earliest_write_conflict_snap` but keep `earliest_snap` during bottommost level compaction, then it is possible that certain sequence of events can lead to a PUT being seq-zeroed followed by a SingleDelete of the same key. This violates the ascending order of keys, and will cause data inconsistency. Reviewed By: ltamasi Differential Revision: D31813017 fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285	2021-10-29 15:23:17 -07:00
Peter Dillinger	a7d4bea43a	Implement XXH3 block checksum type (#9069 ) Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. Existing functionality change: * reject bad ChecksumType in options with InvalidArgument This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069 Test Plan: tests updated, and substantially expanded. Unit tests now check that we don't accidentally change the values generated by the checksum algorithms ("schema test") and that we properly handle invalid/unrecognized checksum types in options or in file footer. DBTestBase::ChangeOptions (etc.) updated from two to one configuration changing from default CRC32c ChecksumType. The point of this test code is to detect possible interactions among features, and the likelihood of some bad interaction being detected by including configurations other than XXH3 and CRC32c--and then not detected by stress/crash test--is extremely low. Stress/crash test also updated (manual run long enough to see it accepts new checksum type). db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 \| grep 'micros/op' \| tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3 Reviewed By: mrambacher Differential Revision: D31905249 Pulled By: pdillinger fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1	2021-10-28 22:15:17 -07:00
Andrew Kryczka	f24c39ab3d	Prevent corruption with parallel manual compactions and `change_level == true` (#9077 ) Summary: The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well. Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction. The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps. Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default). The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled. Thanks to siying for finding the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077 Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions. Reviewed By: siying Differential Revision: D31921908 Pulled By: ajkr fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb	2021-10-27 23:08:56 -07:00
Jonathan Albrecht	e970248602	Add support for building on s390x platform (#8962 ) Summary: This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in. There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with? 1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package 2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image Not sure if there is more required for travis. Happy to help in any way I can. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962 Reviewed By: mrambacher Differential Revision: D31802198 Pulled By: pdillinger fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2	2021-10-22 10:13:15 -07:00
Yanqin Jin	f72fd58565	Fix atomic flush waiting forever for MANIFEST write (#9034 ) Summary: In atomic flush, concurrent background flush threads will commit to the MANIFEST one by one, in the order of the IDs of their picked memtables for all included column families. Each time, a background flush thread decides whether to wait based on two criteria: - Is db stopped? If so, don't wait. - Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go. When atomic flush was implemented, error writing to or syncing the MANIFEST would cause the db to be stopped. Therefore, this background thread does not have to check for the background error while waiting. If there has been such an error, `DBStopped()` would have been true, and this thread will not wait forever. After we improved error handling, RocksDB may map an IOError while writing to MANIFEST to a soft error, if there is no WAL. This requires the background threads to check for background error while waiting. Otherwise, a background flush thread may wait forever. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D31639225 Pulled By: riversand963 fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd	2021-10-20 21:34:47 -07:00
leipeng	0a73ada7b5	remove unused local obj and simpilify comple code (#9052 ) Summary: This PR does not change code sematics, it just changes for: 1. local obj `nonmem_w` and `lfile` are unused 2. null check for `delete ptr` is unnecessary 3. use `unique_ptr::reset` instead of `release` + `delete` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9052 Reviewed By: zhichao-cao Differential Revision: D31801661 Pulled By: anand1976 fbshipit-source-id: 16a77d45da8c8833bf5bf3bce546bb3711b335df	2021-10-20 14:08:05 -07:00
leipeng	0c53b41856	db_impl_write.cc: use stats_ instead of immutable_db_options_.stats (#9053 ) Summary: This PR has no semantic changes, just to make code shorter. `stats_` has value same with `immutable_db_options_.stats`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9053 Reviewed By: zhichao-cao Differential Revision: D31801603 Pulled By: anand1976 fbshipit-source-id: cbd8fe478d3e90ae078ace49b4f2eb9bb028ccf6	2021-10-20 14:04:59 -07:00
Andrew Kryczka	4217d1bce7	Support `GetMapProperty()` with "rocksdb.dbstats" (#9057 ) Summary: This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057 Test Plan: new unit test Reviewed By: zhichao-cao Differential Revision: D31781495 Pulled By: ajkr fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b	2021-10-20 13:17:00 -07:00
sdong	c66b4429ff	Incremental Space Amp Compactions in Universal Style (#8655 ) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb	2021-10-20 10:04:13 -07:00
sdong	f053851af6	Ignore non-overlapping levels when determinig grandparent files (#9051 ) Summary: Right now, when picking a compaction, grand parent files are from output_level + 1. This usually works, but if the level doesn't have any overlapping file, it will be more efficient to go further down. This is because the files are likely to be trivial moved further and might create a violation of max_compaction_bytes. This situation can naturally happen and might happen even more with TTL compactions. There is no harm to fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9051 Test Plan: Run existing tests and see it passes. Also briefly run crash test. Reviewed By: ajkr Differential Revision: D31748829 fbshipit-source-id: 52b99ab4284dc816d22f34406d528a3c98ff6719	2021-10-19 12:48:18 -07:00
Peter Dillinger	ad5325a736	Experimental support for SST unique IDs (#8990 ) Summary: * New public header unique_id.h and function GetUniqueIdFromTableProperties which computes a universally unique identifier based on table properties of table files from recent RocksDB versions. * Generation of DB session IDs is refactored so that they are guaranteed unique in the lifetime of a process running RocksDB. (SemiStructuredUniqueIdGen, new test included.) Along with file numbers, this enables SST unique IDs to be guaranteed unique among SSTs generated in a single process, and "better than random" between processes. See https://github.com/pdillinger/unique_id * In addition to public API producing 'external' unique IDs, there is a function for producing 'internal' unique IDs, with functions for converting between the two. In short, the external ID is "safe" for things people might do with it, and the internal ID enables more "power user" features for the future. Specifically, the external ID goes through a hashing layer so that any subset of bits in the external ID can be used as a hash of the full ID, while also preserving uniqueness guarantees in the first 128 bits (bijective both on first 128 bits and on full 192 bits). Intended follow-up: * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into the third 64-bit value of the unique ID.) * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990 Test Plan: Unit tests added, and checking of unique ids in stress test. NOTE in stress test we do not generate nearly enough files to thoroughly stress uniqueness, but the test trims off pieces of the ID to check for uniqueness so that we can infer (with some assumptions) stronger properties in the aggregate. Reviewed By: zhichao-cao, mrambacher Differential Revision: D31582865 Pulled By: pdillinger fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243	2021-10-18 23:32:01 -07:00
Giuseppe Ottaviano	f0841d4faf	Fix out-of-bounds access in MultiDBParallelOpenTest (#9046 ) Summary: `dbs` should not be cleared, as it is reused later when reopening the DBs, so we have an out-of-bounds access with `dbnames[dbnum]`. The values left in the vector don't need to be reset, as the db pointer is an out parameter for `DB::Open`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9046 Reviewed By: pdillinger Differential Revision: D31738263 Pulled By: ot fbshipit-source-id: c619e947b8d3dbc3d896f29971f093d3e3c794d3	2021-10-18 21:25:45 -07:00
Jay Zhuang	314de7e7de	Make `DB::Close()` thread-safe (#8970 ) Summary: If `DB::Close()` is called in multi-thread env, the resource could be double released, which causes exception or assert. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970 Test Plan: Test with multi-thread benchmark, with each thread try to close the DB at the end. Reviewed By: pdillinger Differential Revision: D31242042 Pulled By: jay-zhuang fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb	2021-10-18 20:32:35 -07:00
Jay Zhuang	53a0ab2bea	Deflaky ObsoleteFilesTest (#9049 ) Summary: WaitForFlushMemTable() may only wait for mem flush but not background flush finishing. The the obsoleted file may not be purged yet. `fcaa7ff638/db/db_impl/db_impl_compaction_flush.cc (L2200-L2203)` Use WaitForCompact() instead to wait for background flush job. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9049 Test Plan: `gtest-parallel ./obsolete_files_test --gtest_filter=ObsoleteFilesTest.DeleteObsoleteOptionsFile -r 1000` Reviewed By: zhichao-cao Differential Revision: D31737343 Pulled By: jay-zhuang fbshipit-source-id: 82276ebeae7c7c75a733d3e1fd1c130d45e4761f	2021-10-18 15:15:23 -07:00
Peter Dillinger	3ffb3baa0b	Add (Live)FileStorageInfo API (#8968 ) Summary: New classes FileStorageInfo and LiveFileStorageInfo and 'experimental' function DB::GetLiveFilesStorageInfo, which is intended to largely replace several fragmented DB functions needed to create checkpoints and backups. This function is now used to create checkpoints and backups, because it fixes many (probably not all) of the prior complexities of checkpoint not having atomic access to DB metadata. This also ensures strong functional test coverage of the new API. Specifically, much of the old CheckpointImpl::CreateCustomCheckpoint has been migrated to and updated in DBImpl::GetLiveFilesStorageInfo, with the former now calling the latter. Also, the class FileStorageInfo in metadata.h compatibly replaces BackupFileInfo and serves as a new base class for SstFileMetaData. Some old fields of SstFileMetaData are still provided (for now) but deprecated. Although FileStorageInfo::directory is accurate when using db_paths and/or cf_paths, these have never been supported by Checkpoint nor BackupEngine and still are not. This change does now detect these cases and return NotSupported when appropriate. (More work needed for support.) Somehow this change broke ProgressCallbackDuringBackup, but the progress_callback logic was dubious to begin with because it would call the callback based on copy buffer size, not size actually copied. Logic and test updated to track size actually copied per-thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968 Test Plan: tests updated. DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl. DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo, including reading the data after DB close. Added CheckpointTest.CheckpointWithDbPath (NotSupported). Reviewed By: siying Differential Revision: D31242045 Pulled By: pdillinger fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0	2021-10-16 10:04:32 -07:00
Giuseppe Ottaviano	4bfd415e34	Fix sequence number bump logic in multi-CF SST ingestion (#9005 ) Summary: The code in `IngestExternalFiles()` that bumps the DB's sequence number depending on what seqnos were assigned to the files has 3 bugs: 1) There is an assertion that the sequence number is increased in all the affected column families, but this is unnecessary, it is fine if some files can stick to a lower sequence number. It is very easy to hit the assertion: it is sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that doesn't (for example the CF is empty). The line added in the `IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail. 2) SetLastSequence() is called with the sum of all the bumps across CFs, but we should take the maximum instead, as all CFs start with the current seqno and bump it independently. 3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in optimized builds, so some files may be assigned seqnos from the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005 Test Plan: Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that triggers the assertion, verified that the test (and all the others) pass after the fix. Reviewed By: ajkr Differential Revision: D31597892 Pulled By: ot fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770	2021-10-12 20:39:52 -07:00
Levi Tamasi	7cc52cd8f5	Update HISTORY for PR 8994 (#9017 ) Summary: Also, expand on/clarify a comment in `VersionStorageInfoTest`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9017 Reviewed By: riversand963 Differential Revision: D31566130 Pulled By: ltamasi fbshipit-source-id: 1d30c7af084c4de7b2030bc6c768838d65746010	2021-10-12 10:19:56 -07:00
Giuseppe Ottaviano	22d4dc5066	Fix race in WriteBufferManager (#9009 ) Summary: EndWriteStall has a data race: `queue_.empty()` is checked outside of the mutex, so once we enter the critical section another thread may already have cleared the list, and accessing the `front()` is undefined behavior (and causes interesting crashes under high concurrency). This PR fixes the bug, and also rewrites the logic to make it easier to reason about it. It also fixes another subtle bug: if some writers are stalled and `SetBufferSize(0)` is called, which disables the WBM, the writer are not unblocked because of an early `enabled()` check in `EndWriteStall()`. It doesn't significantly change the locking behavior, as before writers won't lock unless entering a stall condition, and `FreeMem` almost always locks if stalling is allowed, but that is inevitable with the current design. Liveness is guaranteed by the fact that if some writes are blocked, eventually all writes will be blocked due to `stall_active_`, and eventually all memory is freed. While at it, do a couple of optimizations: - In `WBMStallInterface::Signal()` signal the CV only after releasing the lock. Signaling under the lock is a common pitfall, as it causes the woken-up thread to immediately go back to sleep because the mutex is still locked by the awaker. - Move all allocations and deallocations outside of the lock. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9009 Test Plan: ``` USE_CLANG=1 make -j64 all check ``` Reviewed By: akankshamahajan15 Differential Revision: D31550668 Pulled By: ot fbshipit-source-id: 5125387c3dc7ecaaa2b8bbc736e58c4156698580	2021-10-12 00:16:21 -07:00
Levi Tamasi	3e1bf771a3	Make it possible to force the garbage collection of the oldest blob files (#8994 ) Summary: The current BlobDB garbage collection logic works by relocating the valid blobs from the oldest blob files as they are encountered during compaction, and cleaning up blob files once they contain nothing but garbage. However, with sufficiently skewed workloads, it is theoretically possible to end up in a situation when few or no compactions get scheduled for the SST files that contain references to the oldest blob files, which can lead to increased space amp due to the lack of GC. In order to efficiently handle such workloads, the patch adds a new BlobDB configuration option called `blob_garbage_collection_force_threshold`, which signals to BlobDB to schedule targeted compactions for the SST files that keep alive the oldest batch of blob files if the overall ratio of garbage in the given blob files meets the threshold and all the given blob files are eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example, if the new option is set to 0.9, targeted compactions will get scheduled if the sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the oldest blob files, assuming all affected blob files are below the age-based cutoff.) The net result of these targeted compactions is that the valid blobs in the oldest blob files are relocated and the oldest blob files themselves cleaned up (since all SST files that rely on them get compacted away). These targeted compactions are similar to periodic compactions in the sense that they force certain SST files that otherwise would not get picked up to undergo compaction and also in the sense that instead of merging files from multiple levels, they target a single file. (Note: such compactions might still include neighboring files from the same level due to the need of having a "clean cut" boundary but they never include any files from any other level.) This functionality is currently only supported with the leveled compaction style and is inactive by default (since the default value is set to 1.0, i.e. 100%). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994 Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests. Reviewed By: riversand963 Differential Revision: D31489850 Pulled By: ltamasi fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab	2021-10-11 18:03:01 -07:00
Andrew Kryczka	a282eff3d1	Protect existing files in `FaultInjectionTest{Env,FS}::ReopenWritableFile()` (#8995 ) Summary: `FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage. The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995 Test Plan: - Verified it fixes the following failure: ``` $ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1 ... $ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1 ... Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound: Crash-recovery verification failed :( ... ``` - `make check -j48` Reviewed By: ltamasi Differential Revision: D31495388 Pulled By: ajkr fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385	2021-10-11 16:23:18 -07:00
Zhichao Cao	bcd049cd2d	Ingest external SST files with Temperature hints (#8949 ) Summary: Add the file temperature to `IngestExternalFileArg` such that when SST files are ingested, user is able to assign the temperature to each SST file. If the temperature vector is empty or its size does not match the file name vector size, all ingested SST files will be assigned with `Temperature::unKnown`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8949 Test Plan: add the new test and make check Reviewed By: siying Differential Revision: D31127852 Pulled By: zhichao-cao fbshipit-source-id: 141a81f0f7b473d88f4ab0cb2a21a114cbc6f83c	2021-10-08 10:32:24 -07:00
Andrew Kryczka	fcaa7ff638	Cancel manual compactions waiting on automatic compactions to drain (#8991 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991 Test Plan: the new test hangs forever without this fix and passes with this fix. Reviewed By: hx235 Differential Revision: D31456419 Pulled By: ajkr fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff	2021-10-07 15:23:55 -07:00
Kajetan Janiak	8717c26823	Warning about incompatible options with level_compaction_dynamic_level_bytes (#8329 ) Summary: This change introduces warnings instead of a silent override when trying to use level_compaction_dynamic_level_bytes with multiple cf_paths/db_paths. I have completed the CLA. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8329 Reviewed By: hx235 Differential Revision: D31399713 Pulled By: ajkr fbshipit-source-id: 29c6fe5258d1f739b4590ecd44aee44f55415595	2021-10-07 15:23:55 -07:00
Zhichao Cao	b632ed0c67	Add file temperature related counter and bytes stats to and io_stats (#8710 ) Summary: For tiered storage project, we need to know the block read count and read bytes of files with different temperature. Add FileIOByTemperature to IOStatsContext and collect the bytes read and read count from different temperature files through the RandomAccessFileReader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8710 Test Plan: make check, add the testing cases Reviewed By: siying Differential Revision: D30582400 Pulled By: zhichao-cao fbshipit-source-id: d83173de594374fc8404af5ce93a6a9be72c7141	2021-10-07 14:58:41 -07:00
Ramkumar Vadivelu	fe994bbd0b	Misc doc fixes (#8983 ) Summary: - Update few stale GitHub wiki link references from rocksdb.org - Update the API comments for ignore_range_deletions Pull Request resolved: https://github.com/facebook/rocksdb/pull/8983 Reviewed By: ajkr Differential Revision: D31355965 Pulled By: ramvadiv fbshipit-source-id: 245ac4a6913976dd82afa308bc4aae6bff3d788c	2021-10-07 11:22:17 -07:00
mrambacher	53e595d1f3	Cleanup multiple implementations of VectorIterator (#8901 ) Summary: There were three implementations of VectorIterator (util/vector_iterator, test_util/testutil.h and LoggingForwardVectorIterator). Merged them into one class to increase code coverage/testing and reduce duplication. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8901 Reviewed By: pdillinger Differential Revision: D31022673 Pulled By: mrambacher fbshipit-source-id: 8e3acbd2dfd60b4df609d02cc72846de2389d531	2021-10-06 07:48:31 -07:00
Stefan Roesch	a776406de3	Add file operation callbacks to SequentialFileReader (#8982 ) Summary: This change adds File IO Notifications to the SequentialFileReader The SequentialFileReader is extended with a listener parameter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8982 Test Plan: A new test EventListenerTest::OnWALOperationTest has been added. The test verifies that during restore the sequential file reader is called and the notifications are fired. Reviewed By: riversand963 Differential Revision: D31320844 Pulled By: shrfb fbshipit-source-id: 040b24da7c010d7c14ebb5c6460fae9a19b8c168	2021-10-05 10:51:59 -07:00
mrambacher	787229837e	Fix LITE mode builds on MacOs (#8981 ) Summary: On MacOS, there were errors building in LITE mode related to unused private member variables: In file included from ./db/compaction/compaction_job.h:20: ./db/blob/blob_file_completion_callback.h:87:19: error: private field ‘sst_file_manager_’ is not used [-Werror,-Wunused-private-field] SstFileManager* sst_file_manager_; ^ ./db/blob/blob_file_completion_callback.h:88:22: error: private field ‘mutex_’ is not used [-Werror,-Wunused-private-field] InstrumentedMutex* mutex_; ^ ./db/blob/blob_file_completion_callback.h:89:17: error: private field ‘error_handler_’ is not used [-Werror,-Wunused-private-field] ErrorHandler* error_handler_; This PR resolves those build issues by removing the values as members in LITE mode and fixing the constructor to ignore the input values in LITE mode (otherwise we get unused parameter warnings). Tested by validating compiles without warnings. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8981 Reviewed By: akankshamahajan15 Differential Revision: D31320141 Pulled By: mrambacher fbshipit-source-id: d67875ebbd39a9555e4f09b2d37159566dd8a085	2021-10-04 05:30:26 -07:00
Yanqin Jin	2cdaf5ca5b	Add additional checks for three existing unit tests (#8973 ) Summary: With test sync points, we can assert on the equality of iterator value in three existing unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8973 Test Plan: ``` gtest-parallel -r 1000 ./db_test2 --gtest_filter=DBTest2.IterRaceFlush2:DBTest2.IterRaceFlush1:DBTest2.IterRefreshRaceFlush ``` make check Reviewed By: akankshamahajan15 Differential Revision: D31256340 Pulled By: riversand963 fbshipit-source-id: a9440767ab383e0ec61bd43ffa8fbec4ba562ea2	2021-10-01 17:22:37 -07:00
anand76	532ff334d9	Don't ignore deletion rate limit if WAL dir is different (#8967 ) Summary: If WAL dir is different from the DB dir, we should still honor the SstFileManager deletion rate limit for SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8967 Test Plan: Add a new unit test in db_sst_test Reviewed By: pdillinger Differential Revision: D31220116 Pulled By: anand1976 fbshipit-source-id: bcde8a53a7d728e15e597fb5d07ee86c1b38bd28	2021-09-30 13:26:31 -07:00
Yanqin Jin	2acffecca1	Add comments for MultiGetBlob() and checks for MultiRead() (#8972 ) Summary: Add comments for MultiGetBlob() that input argument `offsets` must be sorted. In addition, add assertion for this condition in debug build. Repeat the same for RandomAccessFileReader::MultiRead(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8972 Test Plan: make check Reviewed By: pdillinger Differential Revision: D31253205 Pulled By: riversand963 fbshipit-source-id: 98758229b8052f3aeb319d5584026b4de2d220a2	2021-09-29 14:27:19 -07:00
mrambacher	13ae16c315	Cleanup includes in dbformat.h (#8930 ) Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d	2021-09-29 04:04:40 -07:00
Jay Zhuang	6b34eb0ebc	Add remote compaction read/write bytes statistics (#8939 ) Summary: Add basic read/write bytes statistics on the primary side: `REMOTE_COMPACT_READ_BYTES` `REMOTE_COMPACT_WRITE_BYTES` Fixed existing statistics missing some IO for remote compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8939 Test Plan: CI Reviewed By: ajkr Differential Revision: D31074672 Pulled By: jay-zhuang fbshipit-source-id: c57afdba369990185008ffaec7e3fe7c62e8902f	2021-09-28 14:00:37 -07:00
Hui Xiao	d6bd1a0291	Support "level_at_creation" in TablePropertiesCollectorFactory::Context (#8919 ) Summary: Context: Exposing the level of the sst file (i.e, table) where it is created in `TablePropertiesCollectorFactory::Context` allows users of `TablePropertiesCollectorFactory` to customize some implementation details of `TablePropertiesCollectorFactory` and `TablePropertiesCollector` based on the level of creation. For example, `TablePropertiesCollector::NeedCompact()` can return different values based on level of creation. - Declared an extra field `level_at_creation` in `TablePropertiesCollectorFactory::Context` - Allowed `level_at_creation` to be passed in as an argument in `IntTblPropCollectorFactory::CreateIntTblPropCollector()` and `UserKeyTablePropertiesCollectorFactory::CreateIntTblPropCollector()`, the latter of which is an internal wrapper of user's passed-in `TablePropertiesCollectorFactory::CreateTablePropertiesCollector()` used in table-building process - Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` passed into both `BlockBasedTableBuilder` and `PlainTableBuilder` - `PlainTableBuilder` previously did not capture `level_at_creation` from `TableBuilderOptions` in `PlainTableFactory`. In order for it to call the method with this parameter, this PR also made `PlainTableBuilder` capture `level_at_creation` as a required parameter - Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` its overridden functions in its derived classes, including `RegularKeysStartWithAFactory::CreateIntTblPropCollector()` in `table_properties_collector_test.cc`, `SstFileWriterPropertiesCollectorFactory::CreateIntTblPropCollector()` in `sst_file_writer_collectors.h` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8919 Test Plan: - Passed the added assertion for `context.level_at_creation` - Passed existing tests - Run `Make` to make sure adding a required parameter to `PlainTableBuilder`'s constructor does not break anything Reviewed By: anand1976 Differential Revision: D30951729 Pulled By: hx235 fbshipit-source-id: c4a0173b0d9344a4cf47e1b987d759c1c73cb474	2021-09-28 12:35:24 -07:00
mrambacher	7fd68b7c39	Make WalFilter, SstPartitionerFactory, FileChecksumGenFactory, and TableProperties Customizable (#8638 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8638 Reviewed By: zhichao-cao Differential Revision: D31024729 Pulled By: mrambacher fbshipit-source-id: 954c04ccab0b8dee64050a27aadf78ed119106c0	2021-09-28 05:32:02 -07:00
Akanksha Mahajan	78afb4d81e	Support SingleDelete for user-defined timestamps (#8921 ) Summary: Added support for SingleDelete for user-defined timestamps. Users can now Get and Iterate over keys deleted with SingleDelete. It also includes changes in CompactionIterator which preserves the same user key with different timestamps, unless the timestamp is below a certain threshold full_history_ts_low. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8921 Test Plan: Added new unit tests Reviewed By: riversand963 Differential Revision: D31098191 Pulled By: akankshamahajan15 fbshipit-source-id: 78a59ef4b4884ae324fcd10f56e62a27d5ee2f49	2021-09-27 11:51:07 -07:00
Peter Dillinger	0774d640c0	Fix some lint warnings reported on 6.25 (#8945 ) Summary: Fix some lint warnings Pull Request resolved: https://github.com/facebook/rocksdb/pull/8945 Test Plan: existing tests, linters Reviewed By: zhichao-cao Differential Revision: D31103824 Pulled By: pdillinger fbshipit-source-id: 4dd9b0c30fa50e588107ac6ed392b2dfb507a5d4	2021-09-27 11:43:20 -07:00
mrambacher	e0f697d2bd	Make SliceTransform into a Customizable class (#8641 ) Summary: Made SliceTransform into a Customizable class. Would be nice to write a test that stored and used a custom transform in an SST table. There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter. Is this expected? Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641 Reviewed By: zhichao-cao Differential Revision: D31142793 Pulled By: mrambacher fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1	2021-09-27 07:43:47 -07:00
Yanqin Jin	b92cef2d1d	Sort per-file blob read requests by offset (#8953 ) Summary: `RandomAccessFileReader::MultiRead()` tries to merge requests in direct IO, assuming input IO requests are sorted by offsets. Add a test in direct IO mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8953 Test Plan: make check Reviewed By: ltamasi Differential Revision: D31183546 Pulled By: riversand963 fbshipit-source-id: 5d043ec68e2daa47a3149066150afd41ee3d73e6	2021-09-24 22:14:30 -07:00
Hui Xiao	58444eadda	Make RateLimiter::GetTotalPendingRequest() non pure virtual for backward compability (#8938 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/8890 added a public API `RateLimiter::GetTotalPendingRequest()` but mistakenly marked it as pure virtual, forcing RateLimiter's derived classes to implement this function and breaking backward compatibility. This PR makes `RateLimiter::GetTotalPendingRequest()` as non-pure virtual method by providing a trivial implementation in rate_limiter.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/8938 Test Plan: Passing existing tests Reviewed By: pdillinger Differential Revision: D31100661 Pulled By: hx235 fbshipit-source-id: 06eff1005156a6e5a881e393b2c5b2ad706897d8	2021-09-21 21:29:26 -07:00
mrambacher	6924869867	Make SystemClock into a Customizable Class (#8636 ) Summary: Made SystemClock into a Customizable class, complete with CreateFromString. Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv). Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636 Reviewed By: zhichao-cao Differential Revision: D30483360 Pulled By: mrambacher fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1	2021-09-21 09:23:48 -07:00
Jay Zhuang	1c290c785d	RemoteCompaction support Fallback to local compaction (#8709 ) Summary: Add support for fallback to local compaction, the user can return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8709 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30560163 Pulled By: jay-zhuang fbshipit-source-id: 65d8905a4a1bc185a68daa120997f21d3198dbe1	2021-09-18 00:25:04 -07:00
Yanqin Jin	b512f4bc76	Batch blob read IO for MultiGet (#8699 ) Summary: In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()` to read the blobs instead of issuing multiple `Read()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8699 Test Plan: ``` make check ``` Reviewed By: ltamasi Differential Revision: D31030861 Pulled By: riversand963 fbshipit-source-id: a0df6060cbfd54cff9515a4eee08807b1dbcb0c8	2021-09-17 19:23:13 -07:00
Akanksha Mahajan	d6aa8c49f8	Expose blob file information through the EventListener interface (#8675 ) Summary: 1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files. 2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted. 3. Test OnFile*Finish operations notifications with Blob Files. 4. Log the blob file creation/deletion events through EventLogger in Log file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675 Test Plan: Add new unit tests in listener_test Reviewed By: ltamasi Differential Revision: D30412613 Pulled By: akankshamahajan15 fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07	2021-09-16 17:23:36 -07:00
Jay Zhuang	b97c53b629	Add compaction priority information in RemoteCompaction (#8707 ) Summary: Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8707 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30548401 Pulled By: jay-zhuang fbshipit-source-id: b30446511fb31b4583c49edd8565d496cf013a34	2021-09-16 15:09:35 -07:00
Peter Dillinger	f4a1d10668	Fix flaky WALTrashCleanupOnOpen (#8917 ) Summary: Test did not consider that slower deletion rate only kicks in after a file is deleted Fixes https://github.com/facebook/rocksdb/issues/7546 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8917 Test Plan: no longer reproduces using buck test mode/dev //internal_repo_rocksdb/repo:db_sst_test -- --exact 'internal_repo_rocksdb/repo:db_sst_test - DBWALTestWithParam/DBWALTestWithParam.WALTrashCleanupOnOpen/0' --jobs 40 --stress-runs 600 --record-results Reviewed By: siying Differential Revision: D30949127 Pulled By: pdillinger fbshipit-source-id: 5d0607f8f548071b07410fe8f532b4618cd225e5	2021-09-15 21:31:20 -07:00
Peter Dillinger	2819c7840e	Fix PrepopulateBlockCache::kFlushOnly (#8750 ) Summary: kFlushOnly currently means "always" except in the case of remote compaction. This makes it flushes only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8750 Test Plan: test updated Reviewed By: akankshamahajan15 Differential Revision: D30968034 Pulled By: pdillinger fbshipit-source-id: 5dbd24dde18852a0e937a540995fba9bfbe89037	2021-09-15 15:33:20 -07:00
sdong	12d798ac06	Always iniitalize ArenaWrappedDBIter::db_iter_ to nullptr (#8889 ) Summary: ArenaWrappedDBIter::db_iter_ should never be nullptr. However, when debugging a segfault, it's hard to distinguish it is not initialized (not possible) and other corruption. Add this nullptr to help distinguish the case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8889 Test Plan: Run existing unit tests. Reviewed By: pdillinger Differential Revision: D30814756 fbshipit-source-id: 4b1f36896a33dc203d4f1f424ded9554927d61ba	2021-09-14 14:33:15 -07:00
Andrew Kryczka	d648cb47b9	Adapt key-value checksum for timestamp-suffixed keys (#8914 ) Summary: After https://github.com/facebook/rocksdb/issues/8725, keys added to `WriteBatch` may be timestamp-suffixed, while `WriteBatch` has no awareness of the timestamp size. Therefore, `WriteBatch` can no longer calculate timestamp checksum separately from the rest of the key's checksum in all cases. This PR changes the definition of key in KV checksum to include the timestamp suffix. That way we do not need to worry about where the timestamp begins within the key. I believe the only practical effect of this change is now `AssignTimestamp()` requires recomputing the whole key checksum (`UpdateK()`) rather than just the timestamp portion (`UpdateT()`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8914 Test Plan: run stress command that used to fail ``` $ ./db_stress --batch_protection_bytes_per_key=8 -clear_column_family_one_in=0 -test_batches_snapshots=1 ``` Reviewed By: riversand963 Differential Revision: D30925715 Pulled By: ajkr fbshipit-source-id: c143f7ccb46c0efb390ad57ef415c250d754deff	2021-09-14 13:14:39 -07:00
eharry	0b6be7eb68	Fix WAL log data corruption #8723 (#8746 ) Summary: Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together (https://github.com/facebook/rocksdb/issues/8723) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8746 Reviewed By: ajkr Differential Revision: D30758468 Pulled By: riversand963 fbshipit-source-id: 07c20899d5f2447dc77861b4845efc68a59aa4e8	2021-09-13 20:15:59 -07:00
Peter Dillinger	7bef598440	Bypass unused parameterization in ExternalSSTFileBasicTest.IngestExte… (#8910 ) Summary: Facebook infrastructure doesn't like continuously skipping tests, so fixing this permanently disabled parameterization to BYPASS instead of SKIP. (Internal ref: T100525285) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8910 Test Plan: manual Reviewed By: anand1976 Differential Revision: D30905169 Pulled By: pdillinger fbshipit-source-id: e23d63d2aa800e54676269fad3a093cd3f9f222d	2021-09-13 12:18:15 -07:00
Levi Tamasi	306b779957	Use GetBlobFileSize instead of GetTotalBlobBytes in DB properties (#8902 ) Summary: The patch adjusts the definition of BlobDB's DB properties a bit by switching to `GetBlobFileSize` from `GetTotalBlobBytes`. The difference is that the value returned by `GetBlobFileSize` includes the blob file header and footer as well, and thus matches the on-disk size of blob files. In addition, the patch removes the `Version` number from the `blob_stats` property, and updates/extends the unit tests a little. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8902 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D30859542 Pulled By: ltamasi fbshipit-source-id: e3426d2d567bd1bd8c8636abdafaafa0743c854c	2021-09-13 10:47:16 -07:00
mrambacher	dafa584fd1	Change the File System File Wrappers to std::unique_ptr (#8618 ) Summary: This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership. Previously, many classes implemented their own ownership solutions. Fixes https://github.com/facebook/rocksdb/issues/8606 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8618 Reviewed By: pdillinger Differential Revision: D30136064 Pulled By: mrambacher fbshipit-source-id: d0bf471df8818dbb1770a86335fe98f761cca193	2021-09-13 08:46:19 -07:00
Yanqin Jin	2a2b3e03a5	Allow WriteBatch to have keys with different timestamp sizes (#8725 ) Summary: In the past, we unnecessarily requires all keys in the same write batch to be from column families whose timestamps' formats are the same for simplicity. Specifically, we cannot use the same write batch to write to two column families, one of which enables timestamp while the other disables it. The limitation is due to the member `timestamp_size_` that used to exist in each `WriteBatch` object. We pass a timestamp_size to the constructor of `WriteBatch`. Therefore, users can simply use the old `WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while the internal implementation of `WriteBatch` will take care of memory allocation for timestamps. The above is not necessary. One the one hand, users can set up a memory buffer to store user key and then contiguously append the timestamp to the user key. Then the user can pass this buffer to the `WriteBatch::Put(Slice&)` API. On the other hand, users can set up a SliceParts object which is an array of Slices and let the last Slice to point to the memory buffer storing timestamp. Then the user can pass the SliceParts object to the `WriteBatch::Put(SliceParts&)` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725 Test Plan: make check Reviewed By: ltamasi Differential Revision: D30654499 Pulled By: riversand963 fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2	2021-09-12 15:34:26 -07:00
Peter Dillinger	bda8d93ba9	Fix and detect headers with missing dependencies (#8893 ) Summary: It's always annoying to find a header does not include its own dependencies and only works when included after other includes. This change adds `make check-headers` which validates that each header can be included at the top of a file. Some headers are excluded e.g. because of platform or external dependencies. rocksdb_namespace.h had to be re-worked slightly to enable checking for failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.) Fixes mostly involve adding and cleaning up #includes, but for FileTraceWriter, a constructor was out-of-lined to make a forward declaration sufficient. This check is not currently run with `make check` but is added to CircleCI build-linux-unity since that one is already relatively fast. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893 Test Plan: existing tests and resolving issues detected by new check Reviewed By: mrambacher Differential Revision: D30823300 Pulled By: pdillinger fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572	2021-09-10 10:00:26 -07:00
mrambacher	dc0dc90cf5	Make Statistics a Customizable Class (#8637 ) Summary: Make the Statistics object into a Customizable object. Statistics can now be stored and created to/from the Options file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8637 Reviewed By: zhichao-cao Differential Revision: D30530550 Pulled By: mrambacher fbshipit-source-id: 5fc7d01d8431f37b2c205bbbd8342c9f697023bd	2021-09-10 09:47:39 -07:00
anand76	eea566864e	Support custom Env in db_sst_test and external_sst_file_basic_test (#8888 ) Summary: Support custom Env in these tests. Some custom Envs do not support reopening a file for write, either normal mode or Random RW mode. Added some additional checks in external_sst_file_basic_test to accommodate those Envs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8888 Reviewed By: riversand963 Differential Revision: D30824481 Pulled By: anand1976 fbshipit-source-id: c3ac7a628e6df29e94f42e370e679934a4f77eac	2021-09-08 21:21:49 -07:00
Zhiyi Zhang	0cb0fc6fd3	Add DB properties for BlobDB (#8734 ) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346	2021-09-08 12:22:04 -07:00
mrambacher	beed86473a	Make MemTableRepFactory into a Customizable class (#8419 ) Summary: This PR does the following: -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString -> Makes the existing implementations compatible with configurations -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419 Reviewed By: zhichao-cao Differential Revision: D29558961 Pulled By: mrambacher fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c	2021-09-08 07:46:44 -07:00
Andrew Kryczka	941543721d	Bytes read stat for `VerifyChecksum()` and `VerifyFileChecksums()` APIs (#8741 ) Summary: - Clarified some comments on compatibility for adding new ticker stats - Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741 Test Plan: new unit test Reviewed By: zhichao-cao Differential Revision: D30708578 Pulled By: ajkr fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449	2021-09-07 13:28:29 -07:00
Peter Dillinger	0ef88538c6	Improve support for using regexes (#8740 ) Summary: * Consolidate use of std::regex for testing to testharness.cc, to minimize Facebook linters constantly flagging uses in non-production code. * Improve syntax and error messages for asserting some string matches a regex in tests. * Add a public Regex wrapper class to encapsulate existing usage in ObjectRegistry. * Remove unnecessary include <regex> * Put warnings that use of Regex in production code could cause bad performance or stack overflow. Intended follow-up work: * Replace std::regex with another underlying implementation like RE2 * Improve ObjectRegistry interface in terms of possibly confusing literal string matching vs. regex and in terms of reporting invalid regex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740 Test Plan: tests updated, basic unit test for public Regex, and some manual testing of temporary changes to see example error messages: utilities/backupable/backupable_db_test.cc:917: Failure 000010_1162373755_138626.blob (child.name) does not match regex [0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern) db/db_basic_test.cc:74: Failure R3SHSBA8C4U0CIMV2ZB0 (sid3) does not match regex [0-9A-Z]{20}HAHAHA Reviewed By: mrambacher Differential Revision: D30706246 Pulled By: pdillinger fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd	2021-09-07 13:05:23 -07:00
Peter Dillinger	4750421ece	Replace most typedef with using= (#8751 ) Summary: Old typedef syntax is confusing Most but not all changes with perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files make format Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751 Test Plan: existing Reviewed By: zhichao-cao Differential Revision: D30745277 Pulled By: pdillinger fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9	2021-09-07 11:31:59 -07:00
Levi Tamasi	55ef8972fc	Support custom env in db_blob_{basic,compaction,corruption,index}_test (#8817 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8817 Test Plan: Ran `make check` and built/tested using internal custom environment. Reviewed By: riversand963 Differential Revision: D30768215 Pulled By: ltamasi fbshipit-source-id: cce96211d4c097612d20247f2e997358f40cc3d3	2021-09-07 11:13:56 -07:00
Akanksha Mahajan	e8a7001159	Update branch as "main" in tools/advisor/README.md (#8744 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8744 Reviewed By: ltamasi Differential Revision: D30716145 Pulled By: akankshamahajan15 fbshipit-source-id: c2fcaf9ddcae85a86c0f10496acab28cd795ff12	2021-09-01 20:26:28 -07:00
Peter Dillinger	c9cd5d25a8	Remove some unneeded code (#8736 ) Summary: * FullKey and ParseFullKey appear to serve no purpose in the public API (or anything else) so removed. Only use in one test updated. * NumberToString serves no purpose vs. ToString so removed, numerous calls updated * Remove unnecessary forward declarations in metadata.h by re-arranging class definitions. * Remove some unneeded semicolons Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736 Test Plan: existing tests Reviewed By: mrambacher Differential Revision: D30700039 Pulled By: pdillinger fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2	2021-09-01 14:28:58 -07:00
Peter Dillinger	13ded69484	Built-in support for generating unique IDs, bug fix (#8708 ) Summary: Env::GenerateUniqueId() works fine on Windows and on POSIX where /proc/sys/kernel/random/uuid exists. Our other implementation is flawed and easily produces collision in a new multi-threaded test. As we rely more heavily on DB session ID uniqueness, this becomes a serious issue. This change combines several individually suitable entropy sources for reliable generation of random unique IDs, with goal of uniqueness and portability, not cryptographic strength nor maximum speed. Specifically: * Moves code for getting UUIDs from the OS to port::GenerateRfcUuid rather than in Env implementation details. Callers are now told whether the operation fails or succeeds. * Adds an internal API GenerateRawUniqueId for generating high-quality 128-bit unique identifiers, by combining entropy from three "tracks": * Lots of info from default Env like time, process id, and hostname. * std::random_device * port::GenerateRfcUuid (when working) * Built-in implementations of Env::GenerateUniqueId() will now always produce an RFC 4122 UUID string, either from platform-specific API or by converting the output of GenerateRawUniqueId. DB session IDs now use GenerateRawUniqueId while DB IDs (not as critical) try to use port::GenerateRfcUuid but fall back on GenerateRawUniqueId with conversion to an RFC 4122 UUID. GenerateRawUniqueId is declared and defined under env/ rather than util/ or even port/ because of the Env dependency. Likely follow-up: enhance GenerateRawUniqueId to be faster after the first call and to guarantee uniqueness within the lifetime of a single process (imparting the same property onto DB session IDs). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708 Test Plan: A new mini-stress test in env_test checks the various public and internal APIs for uniqueness, including each track of GenerateRawUniqueId individually. We can't hope to verify anywhere close to 128 bits of entropy, but it can at least detect flaws as bad as the old code. Serial execution of the new tests takes about 350 ms on my machine. Reviewed By: zhichao-cao, mrambacher Differential Revision: D30563780 Pulled By: pdillinger fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364	2021-08-30 15:20:41 -07:00
Zaorang Yang	2bc914094d	Refactor with VersionBuilder (#8706 ) Summary: Introduce a new function to save sst files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8706 Reviewed By: jay-zhuang Differential Revision: D30544242 Pulled By: riversand963 fbshipit-source-id: 554755852daff7ae1c7864b0029f51b27099ee09	2021-08-27 12:15:08 -07:00
James Yin	7ddc096d7d	Fix typo in the comment of log_empty_ (#8711 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8711 Reviewed By: riversand963 Differential Revision: D30566761 Pulled By: jay-zhuang fbshipit-source-id: dd4690f5e2af2d263ed75ea1b9ed24692fe81362	2021-08-27 12:10:29 -07:00
anand76	ebaa3c8a59	Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714 ) Summary: DumpStats() iterates through the ColumnFamilySet. There is a potential race condition because it does Ref the cfd, and the cfd could get destroyed during the iteration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714 Test Plan: make check Reviewed By: ltamasi Differential Revision: D30580199 Pulled By: anand1976 fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850	2021-08-26 15:40:26 -07:00
Jay Zhuang	4afa24f8ae	Deflake test `CompactionJobTest.InputSerialization` (#8712 ) Summary: It's invalid to have an empty file name. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8712 Test Plan: ``` $ gtest-parallel ./compaction_job_test --gtest_filter=CompactionJobTest.InputSerialization -r 10000 ``` Reviewed By: pdillinger Differential Revision: D30566739 Pulled By: jay-zhuang fbshipit-source-id: 41e73175e3c95c4b73b4fdcd33470788d4e29d37	2021-08-26 09:27:37 -07:00
Yanqin Jin	f235f4b0a3	Fix a bug of secondary instance sequence going backward (#8653 ) Summary: Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses `ManifestTailer` whose `Iterate()` method can cause the db's `last_sequence_` to go backward. Consequently, read requests can see out-dated data. For example, latest changes to the primary will not be seen on the secondary even after a `TryCatchUpWithPrimary()` if no new write batches are read from the WALs and no new MANIFEST entries are read from the MANIFEST. Fix the bug so that `VersionEditHandler::CheckIterationResult` will never decrease `last_sequence_`, `last_allocated_sequence_` and `last_published_sequence_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8653 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D30272084 Pulled By: riversand963 fbshipit-source-id: c6a49c534b2509b93ef62d8936ed0acd5b860eaa	2021-08-24 18:18:36 -07:00
Peter Dillinger	318fe6941a	Add port::GetProcessID() (#8693 ) Summary: Useful in some places for object uniqueness across processes. Currently used for generating a host-wide identifier of Cache objects but expected to be used soon in some unique id generation code. `int64_t` is chosen for return type because POSIX uses signed integer type, usually `int`, for `pid_t` and Windows uses `DWORD`, which is `uint32_t`. Future work: avoid copy-pasted declarations in port_*.h, perhaps with port_common.h always included from port.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/8693 Test Plan: manual for now Reviewed By: ajkr, anand1976 Differential Revision: D30492876 Pulled By: pdillinger fbshipit-source-id: 39fc2788623cc9f4787866bdb67a4d183dde7eef	2021-08-24 17:46:14 -07:00
Yanqin Jin	229350ef48	Allow iterate refresh for secondary instance (#8700 ) Summary: Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700 Reviewed By: zhichao-cao Differential Revision: D30523907 Pulled By: riversand963 fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91	2021-08-24 15:40:56 -07:00
Andrew Kryczka	c521f22a1e	Deflake write-prepared and write-unprepared tests (#8696 ) Summary: The `JobContext::job_snapshot` referenced DB state but could have been deleted by a BG thread after the signal/unlock allowing shutdown to proceed. Then we would see an error like this (valgrind): ``` ==354104== Thread 2: ==354104== Invalid read of size 8 ==354104== at 0x694C4D: rocksdb::ManagedSnapshot::~ManagedSnapshot() (snapshot_impl.cc:20) ==354104== by 0x58F5BA: operator() (unique_ptr.h:81) ==354104== by 0x58F5BA: operator() (unique_ptr.h:75) ==354104== by 0x58F5BA: ~unique_ptr (unique_ptr.h:292) ==354104== by 0x58F5BA: rocksdb::JobContext::~JobContext() (job_context.h:221) ==354104== by 0x5F155E: rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) (db_impl_compaction_flush.cc:2696) ==354104== by 0x5F1BC2: rocksdb::DBImpl::BGWorkCompaction(void) (db_impl_compaction_flush.cc:2468) ==354104== by 0x83707A: operator() (std_function.h:688) ==354104== by 0x83707A: rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (threadpool_imp.cc:266) ==354104== by 0x8373ED: rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (threadpool_imp.cc:307) ==354104== by 0x492A800: execute_native_thread_routine (in /usr/local/fbcode/platform009/lib/libstdc++.so.6.0.28) ==354104== by 0x4A5020B: start_thread (in /usr/local/fbcode/platform009/lib/libpthread-2.30.so) ==354104== by 0x4CF281E: clone (in /usr/local/fbcode/platform009/lib/libc-2.30.so) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8696 Test Plan: unable to repro Reviewed By: pdillinger Differential Revision: D30505277 Pulled By: ajkr fbshipit-source-id: 5a99f34137cd14d06b0f624add6d37a70a61135d	2021-08-23 23:09:17 -07:00
Jay Zhuang	249b1078c9	Add extra information to RemoteCompaction APIs (#8680 ) Summary: Currently, we only provide job_id in RemoteCompaction APIs, the main problem of `job_id` is it cannot uniquely identify a compaction job between DB instances or between sessions. Providing DB and session id to the user, which will make building cross DB compaction service easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30444859 Pulled By: jay-zhuang fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448	2021-08-23 16:27:38 -07:00
Peter Dillinger	04db764831	Embed original file number in SST table properties (#8686 ) Summary: I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add file numbers to external SST files (so that more can share db session ids for better uniqueness properties), because of forward compatibility. We would have a version of RocksDB that assumes session IDs are unique on external SST files and therefore can't really break that invariant in future files. This change adds a table property for "orig_file_number" which is populated by normal SST files and also external SST files generated by SstFileWriter. SstFileWriter now keeps a db_session_id for life of the object and increments its own file numbers for embedding in table properties. (They are arguably "fake" file numbers because these numbers and not embedded in the file name.) While updating block_based_table_builder, I removed several unnecessary fields from Rep, because following the pattern would have created another unnecessary field. This change also updates block_based_table_reader to use this new property when available, which means that for newer SST files, we can determine the stable/original <db_session_id,file_number> unique identifier using just the file contents, not the file name. (It's a bit complicated; detailed comments in block_based_table_reader.) Also added DB host id to properties listing by sst_dump, which could be useful in debugging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686 Test Plan: majorly overhauled StableCacheKeys test for this change Reviewed By: zhichao-cao Differential Revision: D30457742 Pulled By: pdillinger fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445	2021-08-20 20:40:48 -07:00
Peter Dillinger	2a383f21f4	Add Bloom/Ribbon hybrid API support (#8679 ) Summary: This is essentially resurrection and fixing of the part of https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically, when configuring Ribbon filter, you can specify an LSM level before which Bloom will be used instead of Ribbon. But Bloom is only considered for Leveled and Universal compaction styles and file going into a known LSM level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as you would expect with NewRibbonFilterPolicy. So that this can be controlled with a single int value and so that flushes can be distinguished from intra-L0, we consider flush to go to level -1 for the purposes of this option. (Explained in API comment.) I also expect the most common and recommended Ribbon configuration to use Bloom during flush, to minimize slowing down writes and because according to my estimates, Ribbon only pays off if the structure lives in memory for more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy to be this mild hybrid configuration. I don't really want to add something like NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for flush, Ribbon otherwise) should be considered a natural choice. C APIs also updated, but because they don't support overloading, rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid configuration. While touching C API, I changed bits per key options from int to double. BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit unused fields from BloomFilterPolicy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679 Test Plan: new + updated tests, including crash test Reviewed By: jay-zhuang Differential Revision: D30445797 Pulled By: pdillinger fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1	2021-08-20 18:00:16 -07:00
Merlin Mao	baf22b4ee6	Add `IteratorTraceExecutionResult` for iterator related trace records. (#8687 ) Summary: - Allow to get `Valid()`, `status()`, `key()` and `value()` of an iterator from `IteratorTraceExecutionResult`. - Move lower bound and upper bound from `IteratorSeekQueryTraceRecord` to `IteratorQueryTraceRecord`. Added test in `DBTest2.TraceAndReplay`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8687 Reviewed By: zhichao-cao Differential Revision: D30457630 Pulled By: autopear fbshipit-source-id: be433099a25895b3aa6f0c00f95ad7b1d7489c1d	2021-08-20 15:35:56 -07:00
Akanksha Mahajan	5efec84c60	Fix blob callback in compaction and atomic flush (#8681 ) Summary: Pass BlobFileCompletionCallback in case of atomic flush and compaction job which is currently nullptr(default parameter). BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681 Test Plan: CircleCI jobs Reviewed By: ltamasi Differential Revision: D30445998 Pulled By: akankshamahajan15 fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021	2021-08-20 11:41:14 -07:00
Merlin Mao	ff8953380f	Add iterator's lower and upper bounds to `TraceRecord` (#8677 ) Summary: Trace file V2 added lower/upper bounds to `Iterator::Seek()` and `Iterator::SeekForPrev()`. They were not used anywhere during the execution of a `TraceRecord`. Now they are added to be used by `ReadOptions` during `Iterator::Seek()` and `Iterator::SeekForPrev()` if they are set. Added test cases in `DBTest2.TraceAndManualReplay`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8677 Reviewed By: zhichao-cao Differential Revision: D30438255 Pulled By: autopear fbshipit-source-id: 82563006be0b69155990e506a74951c18af8d288	2021-08-19 17:27:12 -07:00
Baptiste Lemaire	c625b8d017	Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event listeners to mempurge tests. (#8672 ) Summary: Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables. This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge). Event listeners were added to the unit tests to handle these situations. Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672 Reviewed By: akankshamahajan15 Differential Revision: D30383109 Pulled By: bjlemaire fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73	2021-08-18 17:40:01 -07:00
Merlin Mao	d10801e983	Allow Replayer to report the results of TraceRecords. (#8657 ) Summary: `Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.) `Replayer::Replay()` reports the results via a callback function. New interface: `TraceRecordResult` in "rocksdb/trace_record_result.h". `DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657 Reviewed By: ajkr Differential Revision: D30290216 Pulled By: autopear fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6	2021-08-18 17:06:14 -07:00
Peter Dillinger	b6269b078a	Stable cache keys on ingested SST files (#8669 ) Summary: Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even the same file ingested into different DBs sharing a block cache. Note: These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669 Test Plan: Extended unit test Reviewed By: zhichao-cao Differential Revision: D30398532 Pulled By: pdillinger fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7	2021-08-18 11:33:03 -07:00
Yanqin Jin	2b367fa8cc	Fix bug caused by releasing snapshot(s) during compaction (#8608 ) Summary: In debug mode, we are seeing assertion failure as follows ``` db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \ Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed. ``` It is caused by releasing earliest snapshot during compaction between the execution of `NextFromInput()` and `PrepareOutput()`. In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`, incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened. Fix by taking a more conservative approach: make compaction zero out sequence number only if key is in the earliest snapshot when the compaction starts. Another assertion failure is ``` Assertion `current_user_key_snapshot_ == last_snapshot' failed. ``` It's caused by releasing the snapshot between the PUT and SingleDelete during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D30145645 Pulled By: riversand963 fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81	2021-08-17 22:14:20 -07:00
Levi Tamasi	6878cedcc3	Add statistics support to integrated BlobDB (#8667 ) Summary: The patch adds statistics support to the integrated BlobDB implementation, namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and `BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms `BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like `BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`, `BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.) Note that the vast majority of the old BlobDB's tickers/histograms are not really applicable to the new implementation, since they e.g. pertain to calling dedicated BlobDB APIs (which the integrated BlobDB does not have) or are tied to the legacy BlobDB's design of writing blob files synchronously when a write API is called. Such statistics are marked "legacy BlobDB only" in `statistics.h`. Fixes https://github.com/facebook/rocksdb/issues/8645 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667 Test Plan: Ran `make check` and tested the new statistics using `db_bench`. Reviewed By: riversand963 Differential Revision: D30356884 Pulled By: ltamasi fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4	2021-08-17 17:22:31 -07:00
Peter Dillinger	a207c27809	Stable cache keys using DB session ids in SSTs (#8659 ) Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to https://github.com/facebook/rocksdb/issues/7405 This change has a minor regression in PersistentCache functionality: metaindex blocks are no longer cached in PersistentCache. Table properties blocks already were not but ideally should be. I didn't spent effort to fix & test these issues because we don't believe PersistentCache is used much if at all and expect SecondaryCache to replace it. (Though PRs are welcome.) FIXME: there is more to be fixed for stable cache keys on external SST files Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659 Test Plan: new unit test added, which fails when disabling new functionality Reviewed By: zhichao-cao Differential Revision: D30297705 Pulled By: pdillinger fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a	2021-08-16 20:37:20 -07:00
Adam Retter	5de333fd99	Add db_test2 to to ASSERT_STATUS_CHECKED (#8640 ) Summary: This is the `db_test2` parts of https://github.com/facebook/rocksdb/pull/7737 reworked on the latest HEAD. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8640 Reviewed By: akankshamahajan15 Differential Revision: D30303684 Pulled By: mrambacher fbshipit-source-id: 263e2f82d849bde4048b60aed8b31e7deed4706a	2021-08-16 08:10:32 -07:00
Jay Zhuang	c55460c734	Add property `LiveSstFilesSizeAtTemperature` for tiered storage (#8644 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8644 Reviewed By: siying, zhichao-cao Differential Revision: D30236535 Pulled By: jay-zhuang fbshipit-source-id: 1758d1c46d83a5087560fb63d53a016bf999da81	2021-08-15 14:17:45 -07:00
Baptiste Lemaire	e51be2c5a1	Improve MemPurge sampling (#8656 ) Summary: Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details). In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID. I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction). Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656 Reviewed By: pdillinger Differential Revision: D30288583 Pulled By: bjlemaire fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3	2021-08-13 14:35:41 -07:00
Merlin Mao	f58d276764	Make TraceRecord and Replayer public (#8611 ) Summary: New public interfaces: `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h". `Replayer`, available in `rocksdb/utilities/replayer.h`. User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file. Unit tests: - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes. - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611 Reviewed By: ajkr Differential Revision: D30266329 Pulled By: autopear fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8	2021-08-11 19:32:46 -07:00
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	2021-08-10 18:09:03 -07:00
Levi Tamasi	f63331ebaf	Attempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627 ) Summary: The patch attempts to deflake `DBTestXactLogIterator.TransactionLogIteratorCorruptedLog` by disabling file deletions while retrieving the list of WAL files and truncating the first WAL file. This is to prevent the `PurgeObsoleteFiles` call triggered by `GetSortedWalFiles` from invalidating the result of `GetSortedWalFiles`. The patch also cleans up the test case a bit and changes it to using `test::TruncateFile` instead of calling the `truncate` syscall directly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8627 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D30147002 Pulled By: ltamasi fbshipit-source-id: db11072a4ad8900a2f859cb5294e22b1888c23f6	2021-08-10 11:10:07 -07:00
Jay Zhuang	61f83dfeb7	Add an unittest for tiered storage universal compaction (#8631 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8631 Reviewed By: siying Differential Revision: D30200385 Pulled By: jay-zhuang fbshipit-source-id: 0fa2bb15e74ff81762d767f234078e0fe0106c55	2021-08-09 13:44:23 -07:00
sdong	e7c24168d8	Move old files to warm tier in FIFO compactions (#8310 ) Summary: Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310 Test Plan: Add several unit tests. Reviewed By: ajkr Differential Revision: D28493792 fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501	2021-08-09 12:51:14 -07:00
Roy Crihfield	d4b75d295f	Add more C bindings for OptimisticTransactionDB (#8526 ) Summary: * `rocksdb_optimistictransactiondb_checkpoint_object_create` * `rocksdb_optimistictransactiondb_write` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8526 Reviewed By: ajkr Differential Revision: D30076822 Pulled By: jay-zhuang fbshipit-source-id: a59956a8d5449e75d39a8087fbb2bad148cf697d	2021-08-06 19:10:48 -07:00
Levi Tamasi	87882736ef	Fix the sorting of KeyContexts for batched MultiGet (#8633 ) Summary: `CompareKeyContext::operator()` on the trunk has a bug: when comparing column family IDs, `lhs` is used for both sides of the comparison. This results in the `KeyContext`s getting sorted solely based on key, which in turn means that keys with the same column family do not necessarily form a single range in the sorted list. This violates an assumption of the batched `MultiGet` logic, leading to the same column family showing up multiple times in the list of `MultiGetColumnFamilyData`. The end result is the code attempting to check out the thread-local `SuperVersion` for the same CF multiple times, causing an assertion violation in debug builds and memory corruption/crash in release builds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D30169182 Pulled By: ltamasi fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023	2021-08-06 16:27:42 -07:00
Zaorang Yang	e95c570047	Fix the wrong comment of level compaction cf paths test (#8533 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8533 Reviewed By: ajkr Differential Revision: D29718067 fbshipit-source-id: b4b91c9271362e7a7d47ddbaf28f56fb537cc668	2021-08-06 15:27:12 -07:00
mrambacher	d057e8326d	Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481 ) Summary: - Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes. - Added Options/Configurable/Object Registration for TTL and Cassandra variants - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option - Added tests for new functionality Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481 Reviewed By: zhichao-cao Differential Revision: D30136050 Pulled By: mrambacher fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af	2021-08-06 08:27:25 -07:00
Akanksha Mahajan	fd2079938d	Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620 ) Summary: Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620 Test Plan: Added new unit test Reviewed By: anand1976 Differential Revision: D30091319 Pulled By: akankshamahajan15 fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c	2021-08-05 19:44:51 -07:00
Levi Tamasi	9b25d26dc8	Attempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624 ) Summary: We've been seeing occasional crashes on CI while inserting into the vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes don't reproduce locally (could be either a race or an object lifecycle issue) but the good news is that the vectors in question are not really used for anything meaningful by the test. (The assertion about the sizes of the two vectors being equal is guaranteed to hold, since the two sync points where they are populated are right after each other.) The patch simply removes the vectors from the test, alongside the associated callbacks and sync points. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D30118485 Pulled By: ltamasi fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647	2021-08-05 18:36:16 -07:00
Andrew Kryczka	a685a701ca	Do not attempt to rename non-existent info log (#8622 ) Summary: Previously we attempted to rename "LOG" to "LOG.old.*" without checking its existence first. "LOG" had no reason to exist in a new DB. Errors in renaming a non-existent "LOG" were swallowed via `PermitUncheckedError()` so things worked. However the storage service's error monitoring was detecting all these benign rename failures. So it is better to fix it. Also with this PR we can now distinguish rename failure for other reasons and return them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622 Test Plan: new unit test Reviewed By: akankshamahajan15 Differential Revision: D30115189 Pulled By: ajkr fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170	2021-08-04 17:25:00 -07:00
Yanqin Jin	0879c24040	Fix NotifyOnFlushCompleted() for atomic flush (#8585 ) Summary: PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure `OnFlushCompleted()` is called after committing flush results to MANIFEST. However, `flush_jobs_info_` is not updated in atomic flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`. This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D29913720 Pulled By: riversand963 fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda	2021-08-03 13:31:10 -07:00
Akanksha Mahajan	8b2f60b668	Cache warming blocks during flush (#8561 ) Summary: Insert warm blocks (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561 Test Plan: Added unit test Reviewed By: anand1976 Differential Revision: D29773411 Pulled By: akankshamahajan15 fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66	2021-08-03 12:44:15 -07:00
Baptiste Lemaire	b278152261	Fix db stress crash mempurge (#8604 ) Summary: The db_stress crash was caused by a call to `IsFlushPending()` made by a stats function which triggered an `assert([false])`, which I didn't plan when I created the `trigger_flush` bool. It turns out that this bool variable is not useful: I created it because I thought the `imm_flush_needed` atomic bool would actually trigger a flush. It turns out that this bool is only checked in `IsFlushPending` - this is its only use - and a flush is triggered by either a background thread checking on the imm array, or by an explicit call to `SchedulePendingFlush` which creates a flush request, that is then added to a flush request queue. In this PR, I reverted the MemtableList::Add function to what it was before my changes. I tested the fix by running the exact command line that deterministically triggered the assert error (see below), which confirmed that this is where the error was coming from. I also run `db_crashtest.py whitebox` and `blackbox` for a couple hours locally before committing this PR. Experiment run: ```./db_stress --acquire_snapshot_one_in=0 --allow_concurrent_memtable_write=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=76.90653425292307 --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_ttl=2 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --enable_compaction_filter=1 --enable_pipelined_write=0 --expected_values_path=/dev/shm/rocksdb/rocksdb_crashtest_expected --experimental_allow_mempurge=1 --experimental_mempurge_policy=kAlternate --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --long_running_snapshots=1 --mark_for_compaction_one_file_in=10 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=-1 --prefixpercent=0 --progress_reports=0 --read_fault_one_in=0 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync=1 --sync_fault_injection=False --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=3 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_ribbon_filter=1 --user_timestamp_size=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8604 Reviewed By: pdillinger Differential Revision: D30047295 Pulled By: bjlemaire fbshipit-source-id: b9e379bfa3d6b9bd2b275725fb0bca4bd81a3dbe	2021-08-02 20:26:35 -07:00
Levi Tamasi	3f7e929865	Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605 ) Summary: The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk unlocks the DB mutex before destroying the `ThreadLocalPtr` holding the per-thread `SuperVersion` pointers when the only remaining reference is the back reference from `super_version_`. The idea behind this was to break the circular dependency between `ColumnFamilyData` and `SuperVersion`: when the penultimate reference goes away, `ColumnFamilyData` can clean up the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there is a `SuperVersion` and it is not referenced by anything else.) However, unlocking the mutex throws a wrench in this plan by making it possible for another thread to jump in and take another reference to the `ColumnFamilyData`, keeping the object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like https://github.com/facebook/rocksdb/issues/8440 , https://github.com/facebook/rocksdb/issues/8382 , and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet` destructor we sometimes observe during close in our stress tests. Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call `SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone. https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced the circular dependency and associated cleanup logic described above (in order to enable iterators to keep the `ColumnFamilyData` for dropped column families alive), and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`. Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete` in `SuperVersion::Cleanup`. The patch simply eliminates the unlocking and relocking, which has been unnecessary ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free. This closes the window during which another thread could increase the reference count, and hopefully fixes the issues above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605 Test Plan: Ran `make check` and stress tests locally. Reviewed By: pdillinger Differential Revision: D30051035 Pulled By: ltamasi fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528	2021-08-02 18:12:11 -07:00
yangzaorang	8e91bd90d2	Fix a issue with initializing blob header buffer (#8537 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8537 Reviewed By: ajkr Differential Revision: D29838132 Pulled By: jay-zhuang fbshipit-source-id: e3e78d5f85f240a1800ace417a8b634f74488e41	2021-08-02 17:15:06 -07:00
mrambacher	ab7f7c9e49	Allow WAL dir to change with db dir (#8582 ) Summary: Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file. After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path). Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582 Reviewed By: pdillinger, autopear Differential Revision: D29881122 Pulled By: mrambacher fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb	2021-07-30 12:16:44 -07:00
Yanqin Jin	066b51126d	Several simple local code clean-ups (#8565 ) Summary: This PR tries to remove some unnecessary checks as well as unreachable code blocks to improve readability. An obvious non-public API method naming typo is also corrected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565 Test Plan: make check Reviewed By: lth Differential Revision: D29963984 Pulled By: riversand963 fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a	2021-07-30 12:07:49 -07:00
Peter Dillinger	1d34cd797e	Fix insecure internal API for GetImpl (#8590 ) Summary: Calling the GetImpl function could leave reference to a local callback function in a field of a parameter struct. As this is performance-critical code, I'm not going to attempt to sanitize this code too much, but make the existing hack a bit cleaner by reverting what it overwrites in the input struct. Added SaveAndRestore utility class to make that easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590 Test Plan: added unit test for SaveAndRestore; existing tests for GetImpl Reviewed By: riversand963 Differential Revision: D29947983 Pulled By: pdillinger fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb	2021-07-29 17:23:01 -07:00
sdong	e8f218cb68	DB::GetSortedWalFiles() to ensure file deletion is disabled (#8591 ) Summary: If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done. Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591 Test Plan: Run all existing tests Reviewed By: pdillinger Differential Revision: D29969412 fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de	2021-07-29 11:51:08 -07:00
Peter Dillinger	0804b44fb6	Some fixes and enhancements to `ldb repair` (#8544 ) Summary: * Basic handling of SST file with just range tombstones rather than failing assertion about smallest_seqno <= largest_seqno * Adds --verbose option so that there exists a way to see the INFO output from Repairer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8544 Test Plan: unit test added, manual testing for --verbose Reviewed By: ajkr Differential Revision: D29954805 Pulled By: pdillinger fbshipit-source-id: 696af25805fc36cc178b04ba6045922a22625fd9	2021-07-28 16:44:14 -07:00
jimmycleary	e0ff365a76	Replace macros in compaction_iterator.cc with inline functions (#8592 ) Summary: Internal task T96186510. Created new inline member functions in `CompactionIterator`, `DefinitelyInSnapshot`, `DefinitelyNotInSnapshot`, and `InEarliestSnapshot` to replace the macros at the top of `compaction_iterator.cc`. Placed the definitions in `compaction_iterator.h` in accordance with Google's style guide for inline functions. Separated the declarations and definitions, and only placed the `inline` keyword on the definitions, in line with ISO CPP recommendations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8592 Test Plan: Ran `make check`. Successful build and all tests appeared to pass. Reviewed By: riversand963 Differential Revision: D29966782 Pulled By: jimmycFB fbshipit-source-id: 3584290bbbabf862e9ab58852281f46d37f58be6	2021-07-28 14:53:29 -07:00
Peter Dillinger	74b7c0d249	Fix use-after-free on implicit temporary FileOptions (#8571 ) Summary: FileOptions has an implicit conversion from EnvOptions and some internal APIs take `const FileOptions&` and save the reference, which is counter to Google C++ guidelines, > Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements. This is at least a problem for repair.cc, which passes an EnvOptions to TableCache(), which would save a reference to the temporary copy as FileOptions. This was unfortunately only caught as a side effect of changes in https://github.com/facebook/rocksdb/issues/8544. This change fixes the repair.cc case and updates the involved internal APIs that save a reference to use `const FileOptions*` instead. Unfortunately, I don't know how to get any of our sanitizers to reliably report bugs like this, so I can't rule out more existing in our codebase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571 Test Plan: Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on AWS EC2) Reviewed By: ajkr Differential Revision: D29943890 Pulled By: pdillinger fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9	2021-07-27 21:49:14 -07:00
Peter Dillinger	e352bd5742	Fix missing Handle release in TableCache::GetRangeTombstoneIterator (#8589 ) Summary: This appears to be little used code so not a major bug, but is blocking https://github.com/facebook/rocksdb/issues/8544 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8589 Test Plan: Added regression test to the end of DBRangeDelTest::TableEvictedDuringScan. Without this fix, ASAN reports memory leak. Reviewed By: ajkr Differential Revision: D29943623 Pulled By: pdillinger fbshipit-source-id: f7115fa6d4440aef83888ff609aa03d09216463b	2021-07-27 21:32:11 -07:00
mrambacher	3aee4fbd41	Make EventListener into a Customizable Class (#8473 ) Summary: - Added Type/CreateFromString - Added ability to load EventListeners to DBOptions - Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473 Reviewed By: zhichao-cao Differential Revision: D29901488 Pulled By: mrambacher fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254	2021-07-27 07:47:02 -07:00
Baptiste Lemaire	4361d6d163	Add simple heuristics for experimental mempurge. (#8583 ) Summary: Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE. `ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput). `ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios. This PR also includes support for `experimental_mempurge_policy` in `db_bench`. Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583 Reviewed By: pdillinger Differential Revision: D29888050 Pulled By: bjlemaire fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1	2021-07-26 11:56:29 -07:00
leipeng	4171e3db9b	CompactionJob::Install(): fix log truncation (#8563 ) Summary: event log info may be truncated, the default buffer size is 512, this PR changes buffer size to 8192. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8563 Reviewed By: ajkr Differential Revision: D29838229 Pulled By: jay-zhuang fbshipit-source-id: 00c5dea3caff0641a209f02c972e92d65b505f50	2021-07-23 11:39:24 -07:00
Drewryz	3b27725245	Fix a minor issue with initializing the test path (#8555 ) Summary: The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555 Reviewed By: ajkr Differential Revision: D29758399 Pulled By: jay-zhuang fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011	2021-07-23 08:38:45 -07:00
Baptiste Lemaire	c521a9ab2b	Retire superfluous functions introduced in earlier mempurge PRs. (#8558 ) Summary: The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge. By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558 Test Plan: Already included in `db_flush_test.cc`. Reviewed By: anand1976 Differential Revision: D29764351 Pulled By: bjlemaire fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437	2021-07-22 18:29:13 -07:00
Yanqin Jin	2e5388178f	Return error if trying to open secondary on missing or inaccessible primary (#8200 ) Summary: If the primary's CURRENT file is missing or inaccessible, the secondary should not hang trying repeatedly to switch to the next MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D27840627 Pulled By: riversand963 fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1	2021-07-22 15:48:58 -07:00
Peter Dillinger	84eef260de	Remove TaskLimiterToken::ReleaseOnce for fix (#8567 ) Summary: Rare TSAN and valgrind failures are caused by unnecessary reading of a field on the TaskLimiterToken::limiter_ for an assertion after the token has been released and the limiter destroyed. To simplify we can simply destroy the token before triggering DB shutdown (potentially destroying the limiter). This makes the ReleaseOnce logic unnecessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567 Test Plan: watch for more failures in CI Reviewed By: ajkr Differential Revision: D29811795 Pulled By: pdillinger fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3	2021-07-21 17:37:53 -07:00
Jay Zhuang	42eaa45c1b	Avoid updating option if there's no value updated (#8518 ) Summary: Try avoid expensive updating options operation if `SetDBOptions()` does not change any option value. Skip updating is not guaranteed, for example, changing `bytes_per_sync` to `0` may still trigger updating, as the value could be sanitized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D29672639 Pulled By: jay-zhuang fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4	2021-07-21 13:45:59 -07:00
Zhichao Cao	87e82a41a9	Fix incorrect Status::NoSpace() status check (#8504 ) Summary: If we want to check whether a Status s is NoSpace() or not, we should check the subcode instread of using s==Status::NoSpace(). Fix some of the incorrect check in the ErrorHandler. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8504 Test Plan: make check Reviewed By: anand1976 Differential Revision: D29601764 Pulled By: zhichao-cao fbshipit-source-id: cdab56a827891c23746bba9cbb53f169fe35f086	2021-07-20 18:09:51 -07:00
sdong	9e885939a3	Change to code for trimmed memtable history is to released outside DB mutex (#8530 ) Summary: Currently, the code shows that we delete memtables immedately after it is trimmed from history. Although it should never happen as the super version still holds the memtable, which is only switched after it, it feels a good practice not to do it, but use clean it up in the standard way: put it to WriteContext and clean it after DB mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8530 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D29703410 fbshipit-source-id: 21d8068ac6377de4b6fa7a89697195742659fde4	2021-07-16 19:28:48 -07:00
Peter Dillinger	df5dc73bec	Don't hold DB mutex for block cache entry stat scans (#8538 ) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b	2021-07-16 14:13:08 -07:00
Mark Rambacher	42ba60b3ba	Make EncryptionProvider and BlockCipher into Customizable objects (#8354 ) Summary: Made the EncryptionProvider and BlockCipher classes inherit from Customizable. Added/fixed the CreateFromString method to these classes to create instances from builtin or registered classes. Added tests to verify that instances can be registered and retrieved as appropriate. Added the ability to configure the builtin (CTR, ROT13) classes from configurable properties. Added the appropriate tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8354 Reviewed By: zhichao-cao Differential Revision: D29558949 Pulled By: mrambacher fbshipit-source-id: c20286b32d179777e060f51a58943e9b0cf81d04	2021-07-16 07:58:51 -07:00
Baptiste Lemaire	206845c057	Mempurge support for wal (#8528 ) Summary: In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables. The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables. This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528 Reviewed By: pdillinger Differential Revision: D29701097 Pulled By: bjlemaire fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9	2021-07-15 17:49:13 -07:00
longlijian	803a40d412	Delete legacy code not used any more. (#8508 ) Summary: The removed function in this PR, just only have declared and dose not have any reference used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8508 Reviewed By: mrambacher Differential Revision: D29649033 Pulled By: jay-zhuang fbshipit-source-id: df98143b73d6c184a2a60c9f7ea2548a065ee35d	2021-07-14 16:04:56 -07:00
hongrubb	870033291a	Fix Get() return status when block cache is disabled (#8485 ) Summary: This PR is for https://github.com/facebook/rocksdb/issues/8453 We need to update `s = biter.status();` when `biter.status().IsIncomplete()` is true. By doing this, can fix the problem in issue. Besides, we still need to update `db_statistics` in `get_context.ReportCounters()` before return back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8485 Reviewed By: jay-zhuang Differential Revision: D29604835 Pulled By: ajkr fbshipit-source-id: c7f2f1cd058223ce1b507ec05d57cf264b9c9710	2021-07-13 18:13:24 -07:00
bjlemaire	955b80e84f	Add WARN/INFO for mempurge output status. (#8514 ) Summary: The MemPurge output status can either be an Abort if the mempurge is aborted due to the new_mem memtable reaching more than the target capacity (currently 60%), or for other reasons. As a result, in the log, we want to differentiate between an abort status, which in this PR only leads to a ROCKS_LOG_INFO, and any other status, which in this PR leads to a ROCKS_LOG_WARN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8514 Reviewed By: pdillinger Differential Revision: D29662446 Pulled By: bjlemaire fbshipit-source-id: c9bec8e238ebc7ecb14fbbddf580e6887e281c16	2021-07-12 10:42:14 -07:00
Dmitry Vorobev	0b75b22321	Implement missing Handler methods in ColumnFamilyCollector. (#8456 ) Summary: When db is open as secondary, there are basically 2 step process: 1) Collect column families from wal log 2) Apply changes to Memtable In case primary db is TransactionDB instance, wal log will contain some additional data, like noop, etc. ColumnFamilyCollector doesn't implement methods to handle these, so it fails to open a wal log written by TransactionDB. (Everything works fine with standard DB::Open). Memtable recovery process knows how to handle such wal logs, so only missing piece seems to be ColumnFamilyCollector. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8456 Reviewed By: ajkr Differential Revision: D29455945 Pulled By: mrambacher fbshipit-source-id: 5b29560fcbc008e17e95d0dc4b07558f3d63e26f	2021-07-12 09:23:45 -07:00

1 2 3 4 5 ...

4752 Commits