rocksdb

Author	SHA1	Message	Date
Yanqin Jin	b2e53ab2d8	Add checking for `DB::DestroyColumnFamilyHandle()` (#9347 ) Summary: Closing https://github.com/facebook/rocksdb/issues/5006 Calling `DB::DestroyColumnFamilyHandle(column_family)` with `column_family` being the return value of `DB::DefaultColumnFamily()` will return `Status::InvalidArgument()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9347 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D33369675 Pulled By: riversand963 fbshipit-source-id: a8266a4daddf2b7a773c2dc7f3eb9a4adfb6b6dd	2022-01-05 20:26:53 -08:00
Andrew Kryczka	b860a42158	Recover to exact latest seqno of data committed to MANIFEST (#9305 ) Summary: The LastSequence field in the MANIFEST file is the baseline seqno for a recovered DB. Recovering WAL entries might cause the recovered DB's seqno to advance above this baseline, but the recovered DB will never use a smaller seqno. Before this PR, we were writing the DB's seqno at the time of LogAndApply() as the LastSequence value. This works in the sense that it is a large enough baseline for the recovered DB that it'll never overwrite any records in existing SST files. At the same time, it's arbitrarily larger than what's needed. This behavior comes from LevelDB, where there was no tracking of largest seqno in an SST file. Now we know the largest seqno of newly written SST files, so we can write an exact value in LastSequence that actually reflects the largest seqno in any file referred to by the MANIFEST. This is primarily useful for correctness testing with unsynced data loss, where the recovered DB's seqno needs to indicate what records were recovered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9305 Test Plan: - https://github.com/facebook/rocksdb/issues/9338 adds crash-recovery correctness testing coverage for WAL disabled use cases - https://github.com/facebook/rocksdb/issues/9357 will extend that testing to cover file ingestion - Added assertion at end of LogAndApply() for `VersionSet::descriptor_last_sequence_` consistency with files - Manually tested upgrade/downgrade compatibility with a custom crash test that randomly picks between a `db_stress` built with and without this PR (for old code it must run with `-disable_wal=0`) Reviewed By: riversand963 Differential Revision: D33182770 Pulled By: ajkr fbshipit-source-id: 0bfafaf685f347cc8cb0e1d62e0186340a738f7d	2022-01-05 16:02:21 -08:00
mrambacher	fe31dc53ca	Make the Env class Customizable (#9293 ) Summary: Allows the Env to have options (Configurable) and loads like other Customizable classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293 Reviewed By: pdillinger, zhichao-cao Differential Revision: D33181591 Pulled By: mrambacher fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7	2022-01-04 16:45:49 -08:00
Yanqin Jin	677d2b4a8f	Fix a bug in C-binding causing iterator to return incorrect result (#9343 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/9339 When writing SST file, the name, computed as `prefix_extractor->GetId()` will be written to the properties block. When the SST is opened again in the future, `CreateFromString()` will take the name as argument and try to create a prefix extractor object. Without this fix, the C API will pass a `Wrapper` pointer to the underlying DB's `prefix_extractor`. `Wrapper::GetId()`, in this case, will be missing the prefix length component, causing a prefix extractor of length 0 to be silently created and used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9343 Test Plan: ``` make c_test ./c_test ``` Reviewed By: mrambacher Differential Revision: D33355549 Pulled By: riversand963 fbshipit-source-id: c92c3acd8be262c3bff8794b4229e42b9ee31203	2021-12-30 12:48:07 -08:00
Andrew Kryczka	aa2b3bf675	Added `TraceOptions::preserve_write_order` (#9334 ) Summary: This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB. By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334 Test Plan: - See if below unsynced data loss crash test can run for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering. ``` DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400 ``` Reviewed By: zhichao-cao Differential Revision: D33301990 Pulled By: ajkr fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3	2021-12-28 15:04:26 -08:00
slk	2e5f764294	Make IncreaseFullHistoryTsLow to a public API (#9221 ) Summary: As (https://github.com/facebook/rocksdb/issues/9210) discussed, the full_history_ts_low is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request. However, users may want to advance the fllHistoryTsLow without an immediate compact. This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221 Reviewed By: akankshamahajan15 Differential Revision: D33201106 Pulled By: riversand963 fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee	2021-12-23 11:03:51 -08:00
Andrew Kryczka	538d2365e9	Fix race condition in BackupEngineTest.ChangeManifestDuringBackupCreation (#9327 ) Summary: The failure looked like this: ``` utilities/backupable/backupable_db_test.cc:3161: Failure Value of: db_chroot_env_->FileExists(prev_manifest_path).IsNotFound() Actual: false Expected: true ``` The failure could be coerced consistently with the following patch: ``` diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 80410f671..637636791 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -2772,6 +2772,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) { if (job_context.HaveSomethingToClean() \|\| job_context.HaveSomethingToDelete() \|\| !log_buffer.IsEmpty()) { mutex_.Unlock(); + bg_cv_.SignalAll(); + sleep(1); TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound"); // Have to flush the info logs before bg_flush_scheduled_-- // because if bg_flush_scheduled_ becomes 0 and the lock is ``` The cause was a familiar problem, which is manual flush/compaction may return before files they obsoleted are removed. The solution is just to wait for "scheduled" work to complete, which includes all phases including cleanup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9327 Test Plan: after this PR, even the above patch to coerce the bug cannot cause the test to fail. Reviewed By: riversand963 Differential Revision: D33252208 Pulled By: ajkr fbshipit-source-id: 720a7eaca58c7247d221911fffe3d5e1dbf581e9	2021-12-22 21:59:53 -08:00
Andrew Kryczka	393fc231af	More asserts in listener_test for debuggability (#9320 ) Summary: We ran into a flake I could not debug so instead added assertions in case it happens again. Command was: ``` TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 USE_CLANG=1 OPT=-g SKIP_FORMAT_BUCK_CHECKS=1 make J=80 -j80 ubsan_check ``` Failure output was: ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from EventListenerTest [ RUN ] EventListenerTest.DisableBGCompaction UndefinedBehaviorSanitizer:DEADLYSIGNAL ==1558126==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000031 (pc 0x7fd9c04dda22 bp 0x7fd9bf8aa580 sp 0x7fd9bf8aa540 T1558147) ==1558126==The signal is caused by a READ memory access. ==1558126==Hint: address points to the zero page. #0 0x7fd9c04dda21 in __dynamic_cast /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/libsupc++/../../.././libstdc++-v3/libsupc++/dyncast.cc:49:3 https://github.com/facebook/rocksdb/issues/1 0x510c53 in __ubsan::checkDynamicType(void, void, unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/listener_test+0x510c53) https://github.com/facebook/rocksdb/issues/2 0x50fb32 in HandleDynamicTypeCacheMiss(__ubsan::DynamicTypeCacheMissData, unsigned long, unsigned long, __ubsan::ReportOptions) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/listener_test+0x50fb32) https://github.com/facebook/rocksdb/issues/3 0x510230 in __ubsan_handle_dynamic_type_cache_miss_abort (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/listener_test+0x510230) https://github.com/facebook/rocksdb/issues/4 0x63221a in rocksdb::ColumnFamilyHandleImpl rocksdb::static_cast_with_check<rocksdb::ColumnFamilyHandleImpl, rocksdb::ColumnFamilyHandle>(rocksdb::ColumnFamilyHandle) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/./util/cast_util.h:19:20 https://github.com/facebook/rocksdb/issues/5 0x71cafa in rocksdb::DBImpl::TEST_GetFilesMetaData(rocksdb::ColumnFamilyHandle, std::vector<std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> >, std::allocator<std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> > > >, std::vector<std::shared_ptr<rocksdb::BlobFileMetaData>, std::allocator<std::shared_ptr<rocksdb::BlobFileMetaData> > >) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_debug.cc:63:14 https://github.com/facebook/rocksdb/issues/6 0x53f6b4 in rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB, rocksdb::FlushJobInfo const&) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/listener_test.cc:277:24 https://github.com/facebook/rocksdb/issues/7 0x6e2f7d in rocksdb::DBImpl::NotifyOnFlushCompleted(rocksdb::ColumnFamilyData, rocksdb::MutableCFOptions const&, std::__cxx11::list<std::unique_ptr<rocksdb::FlushJobInfo, std::default_delete<rocksdb::FlushJobInfo> >, std::allocator<std::unique_ptr<rocksdb::FlushJobInfo, std::default_delete<rocksdb::FlushJobInfo> > > >) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:863:19 https://github.com/facebook/rocksdb/issues/8 0x6e1074 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData, rocksdb::MutableCFOptions const&, bool, rocksdb::JobContext, rocksdb::SuperVersionContext, std::vector<unsigned long, std::allocator<unsigned long> >&, unsigned long, rocksdb::SnapshotChecker, rocksdb::LogBuffer, rocksdb::Env::Priority) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:314:5 https://github.com/facebook/rocksdb/issues/9 0x6e3412 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::Env::Priority) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:359:14 https://github.com/facebook/rocksdb/issues/10 0x700df6 in rocksdb::DBImpl::BackgroundFlush(bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::FlushReason, rocksdb::Env::Priority) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2703:14 https://github.com/facebook/rocksdb/issues/11 0x6fe1f0 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2742:16 https://github.com/facebook/rocksdb/issues/12 0x6fc732 in rocksdb::DBImpl::BGWorkFlush(void) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2569:44 https://github.com/facebook/rocksdb/issues/13 0xb3a820 in void std::_Bind<void ( (void))(void)>::operator()<void>() /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.x/functional:482:17 https://github.com/facebook/rocksdb/issues/14 0xb3a820 in std::_Function_handler<void (), std::_Bind<void (* (void))(void)> >::_M_invoke(std::_Any_data const&) /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.x/bits/std_function.h:300:2 https://github.com/facebook/rocksdb/issues/15 0xb347cc in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/util/threadpool_imp.cc:266:5 https://github.com/facebook/rocksdb/issues/16 0xb34a2f in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) /data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/util/threadpool_imp.cc:307:7 https://github.com/facebook/rocksdb/issues/17 0x7fd9c051a660 in execute_native_thread_routine /home/engshare/third-party2/libgcc/9.x/src/gcc-9.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:80:18 https://github.com/facebook/rocksdb/issues/18 0x7fd9c041e20b in start_thread /home/engshare/third-party2/glibc/2.30/src/glibc-2.30/nptl/pthread_create.c:479:8 https://github.com/facebook/rocksdb/issues/19 0x7fd9c01dd16e in clone /home/engshare/third-party2/glibc/2.30/src/glibc-2.30/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9320 Reviewed By: jay-zhuang Differential Revision: D33242185 Pulled By: ajkr fbshipit-source-id: 741984b10a610e0509e0d4e54c42cdbac03f5285	2021-12-21 12:27:54 -08:00
mrambacher	9a116ab4b4	Add NewMetaDataIterator method (#8692 ) Summary: Fixes a problem where the iterator for metadata was being treated as a non-user key when in fact it was a user key. This led to a problem where the property keys could not be searched for correctly. The main exposure of this problem was that the HashIndexReader could not get the "prefixes" property correctly, resulting in the failure of retrieval/creation of the BlockPrefixIndex. Added BlockBasedTableTest.SeekMetaBlocks test to validate this condition. Fixing this condition exposed two other tests (SeekWithPrefixLongerThanKey, MultiGetPrefixFilter) that passed incorrectly previously and now failed. Updated those two tests to pass. Not sure if the tests are functionally correct/still appropriate, but made them pass... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8692 Reviewed By: riversand963 Differential Revision: D33119539 Pulled By: mrambacher fbshipit-source-id: 658969fe9265f73dc184dab97cc3f4eaed2d881a	2021-12-21 11:32:49 -08:00
Andrew Kryczka	782fcc44e1	Fix race condition in `error_handler_fs_test` (#9325 ) Summary: We saw the below assertion failure in `error_handler_fs_test`: ``` db/error_handler_fs_test.cc:2471: Failure Expected equality of these values: listener->new_bg_error() Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00> Status::Aborted() Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00> terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException' what(): db/error_handler_fs_test.cc:2471: Failure Expected equality of these values: listener->new_bg_error() Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00> Status::Aborted() Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00> Received signal 6 (Aborted) ``` The problem was completing `OnErrorRecoveryCompleted()` would wake up the main thread and allow it to proceed to that assertion. But that assertion assumes `OnErrorRecoveryEnd()` has completed since only `OnErrorRecoveryEnd()` affects `new_bg_error()`. The fix is just to make `OnErrorRecoveryCompleted()` not wake up the main thread, by means of not implementing it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9325 Test Plan: - ran `while TEST_TMPDIR=/dev/shm ./error_handler_fs_test ; do : ; done` for a while - injected sleep between `OnErrorRecovery{Completed,End}()` callbacks, which guaranteed repro before this PR Reviewed By: anand1976 Differential Revision: D33249200 Pulled By: ajkr fbshipit-source-id: 1659ee183cd09f90d4dbd898f65103473fcf84a8	2021-12-20 23:16:52 -08:00
mrambacher	423538a816	Make MemoryAllocator into a Customizable class (#8980 ) Summary: - Make MemoryAllocator and its implementations into a Customizable class. - Added a "DefaultMemoryAllocator" which uses new and delete - Added a "CountedMemoryAllocator" that counts the number of allocs and free - Updated the existing tests to use these new allocators - Changed the memkind allocator test into a generic test that can test the various allocators. - Added tests for creating all of the allocators - Added tests to verify/create the JemallocNodumpAllocator using its options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980 Reviewed By: zhichao-cao Differential Revision: D32990403 Pulled By: mrambacher fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76	2021-12-17 04:20:47 -08:00
Peter Dillinger	0050a73a4f	New stable, fixed-length cache keys (#9126 ) Summary: This change standardizes on a new 16-byte cache key format for block cache (incl compressed and secondary) and persistent cache (but not table cache and row cache). The goal is a really fast cache key with practically ideal stability and uniqueness properties without external dependencies (e.g. from FileSystem). A fixed key size of 16 bytes should enable future optimizations to the concurrent hash table for block cache, which is a heavy CPU user / bottleneck, but there appears to be measurable performance improvement even with no changes to LRUCache. This change replaces a lot of disjointed and ugly code handling cache keys with calls to a simple, clean new internal API (cache_key.h). (Preserving the old cache key logic under an option would be very ugly and likely negate the performance gain of the new approach. Complete replacement carries some inherent risk, but I think that's acceptable with sufficient analysis and testing.) The scheme for encoding new cache keys is complicated but explained in cache_key.cc. Also: EndianSwapValue is moved to math.h to be next to other bit operations. (Explains some new include "math.h".) ReverseBits operation added and unit tests added to hash_test for both. Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126 Test Plan: ### Basic correctness Several tests needed updates to work with the new functionality, mostly because we are no longer relying on filesystem for stable cache keys so table builders & readers need more context info to agree on cache keys. This functionality is so core, a huge number of existing tests exercise the cache key functionality. ### Performance Create db with `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters` And test performance with `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4` using DEBUG_LEVEL=0 and simultaneous before & after runs. Before ops/sec, avg over 100 runs: 121924 After ops/sec, avg over 100 runs: 125385 (+2.8%) ### Collision probability I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity over many months, by making some pessimistic simplifying assumptions: * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys) * All of every file is cached for its entire lifetime We use a simple table with skewed address assignment and replacement on address collision to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output with `./cache_bench -stress_cache_key -sck_keep_bits=40`: ``` Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached) ``` These come from default settings of 2.5M files per day of 32 MB each, and `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of the 128-bit cache key. With file size of 2\\25 contiguous keys (pessimistic), our simulation is about 2\\(128-40-25) or about 9 billion billion times more prone to collision than reality. More default assumptions, relatively pessimistic: * 100 DBs in same process (doesn't matter much) * Re-open DB in same process (new session ID related to old session ID) on average every 100 files generated * Restart process (all new session IDs unrelated to old) 24 times per day After enough data, we get a result at the end: ``` (keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected) ``` If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data: ``` (keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected) (keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected) ``` The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases: ``` 197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected) ``` I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data. Reviewed By: zhichao-cao Differential Revision: D33171746 Pulled By: pdillinger fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f	2021-12-16 17:15:13 -08:00
Akanksha Mahajan	96d0773a11	Update prepopulate_block_cache logic to support block-based filter (#9300 ) Summary: Update prepopulate_block_cache logic to support block-based filter during insertion in block cache Pull Request resolved: https://github.com/facebook/rocksdb/pull/9300 Test Plan: CircleCI tests, make crash_test -j64 Reviewed By: pdillinger Differential Revision: D33132018 Pulled By: akankshamahajan15 fbshipit-source-id: 241deabab8645bda704728e572d6de6354df18b2	2021-12-15 13:20:27 -08:00
Yanqin Jin	08721293ea	Fix a bug causing duplicate trailing entries in WritableFile (buffered IO) (#9236 ) Summary: `db_stress` is a user of `FaultInjectionTestFS`. After injecting a write error, `db_stress` probabilistically determins data drop (https://github.com/facebook/rocksdb/blob/6.27.fb/db_stress_tool/db_stress_test_base.cc#L2615:L2619). In some of our recent runs of `db_stress`, we found duplicate trailing entries corresponding to file trivial move in the MANIFEST, causing the recovery to fail, because the file move operation is not idempotent: you cannot delete a file from a given level twice. Investigation suggests that data buffering in both `WritableFileWriter` and `FaultInjectionTestFS` may be the root cause. WritableFileWriter buffers data to write in a memory buffer, `WritableFileWriter::buf_`. After each `WriteBuffered()`/`WriteBufferedWithChecksum()` succeeds, the `buf_` is cleared. If the underlying file `WritableFileWriter::writable_file_` is opened in buffered IO mode, then `FaultInjectionTestFS` buffers data written for each file until next file sync. After an injected error, user of `FaultInjectionFS` can choose to drop some or none of previously buffered data. If `db_stress` does not drop any unsynced data, then such data will still exist in the `FaultInjectionTestFS`'s buffer. Existing implementation of `WritableileWriter::WriteBuffered()` does not clear `buf_` if there is an error. This may lead to the data being buffered two copies: one in `WritableFileWriter`, and another in `FaultInjectionTestFS`. We also know that the `WritableFileWriter` of MANIFEST file will close upon an error. During `Close()`, it will flush the content in `buf_`. If no write error is injected to `FaultInjectionTestFS` this time, then we end up with two copies of the data appended to the file. To fix, we clear the `WritableFileWriter::buf_` upon failure as well. We focus this PR on files opened in non-direct mode. This PR includes a unit test to reproduce a case when write error injection to `WritableFile` can cause duplicate trailing entries. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9236 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D33033984 Pulled By: riversand963 fbshipit-source-id: ebfa5a0db8cbf1ed73100528b34fcba543c5db31	2021-12-13 09:00:36 -08:00
Akanksha Mahajan	eca85cdb66	Fix flaky tests related to Blob file deletions (#9287 ) Summary: CompactRange() only waits for manual.done to be set which happens as soon as new version is installed. Added TEST_WaitForCompact() which waits for compaction thread to actually finish which is after PurgeObsoleteFiles(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9287 Test Plan: Reproducible by adding `bg_cv_.SignalAll();` inside if condition `297d913275/db/db_impl/db_impl_compaction_flush.cc (L2876)` Reviewed By: ajkr Differential Revision: D33051122 Pulled By: akankshamahajan15 fbshipit-source-id: cd793c79efb8cf8587faaf89f7c51f5d8e5bb71d	2021-12-12 15:31:38 -08:00
Yanqin Jin	5455cacd18	Fix link error reported in issue 9272 (#9278 ) Summary: As title, Closes https://github.com/facebook/rocksdb/issues/9272 Since TimestampAssigner-related classes needs to access `WriteBatch::ProtectionInfo` objects which is for internal use only, it's difficult to make `AssignTimestamp` methods a template and put them in the same public header, `include/rocksdb/write_batch.h`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9278 Test Plan: ``` make check # Also manually test following the repro-steps in issue 9272 ``` Reviewed By: ltamasi Differential Revision: D33012686 Pulled By: riversand963 fbshipit-source-id: 89f24a86a1170125bd0b94ef3b32e69aa08bd949	2021-12-10 20:33:46 -08:00
Yanqin Jin	bd513fd075	Add commit marker with timestamp (#9266 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266 This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing this tag to WAL, thus it is unavailable to users. This is an ongoing effort to add user-defined timestamp support to write-committed transactions. This diff also indicates all column families that may potentially participate in the same transaction must either disable timestamp or have the same timestamp format, since `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit timestamp of the transaction. We will enforce this checking in a future diff. We keep this diff small. Reviewed By: ltamasi Differential Revision: D31721350 fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6	2021-12-10 11:05:35 -08:00
Peter Dillinger	653c392e47	More refactoring ahead of footer & meta changes (#9240 ) Summary: I'm working on a new format_version=6 to support context checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test updates to support that change. Test coverage data and manual inspection agree on dead code in block_based_table_reader.cc (removed). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240 Test Plan: tests enhanced to cover more cases etc. Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible. TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true} (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness) Before w/ wal: 454512 After w/ wal: 444820 (-2.1%) Before w/o wal: 1004560 After w/o wal: 998897 (-0.6%) Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case. This regression will be corrected in a follow-up PR. Reviewed By: ajkr Differential Revision: D32813769 Pulled By: pdillinger fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f	2021-12-10 08:13:26 -08:00
mrambacher	5486717ee2	Fix an issue with MemTableRepFactory::CreateFromString (#9273 ) Summary: If ignore_unsupported_options=true, then it is possible for MemTableRepFactory::CreateFromString to succeed without setting a result (result=nullptr). This would cause the original value to be overwritten with null and an error would be raised later when PrepareOptions is invoked. Added unit test for this condition. Will add (in another PR unless required by reviewers) comparable tests for all of the other Customizable classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9273 Reviewed By: ltamasi Differential Revision: D32990365 Pulled By: mrambacher fbshipit-source-id: b150724c3f5ae7346357b3866244fd93466875c7	2021-12-09 12:36:18 -08:00
Si Ke	79f4a04ee3	Get DBTest passing Assert Status Checked (#7737 ) Summary: Closes https://github.com/facebook/rocksdb/pull/7737 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9231 Reviewed By: hx235 Differential Revision: D32978332 Pulled By: pdillinger fbshipit-source-id: b28900b685d60c668529a90dbaa8e1b357b28f76	2021-12-09 11:00:17 -08:00
anand76	ecf2bec613	Add a listener callback for end of auto error recovery (#9244 ) Summary: Previously, the OnErrorRecoveryCompleted callback was called when RocksDB was able to successfully recover from a retryable error. However, if the recovery failed and was eventually stopped, there was no indication of the status. To fix that, a new OnErrorRecoveryEnd callback is introduced that deprecates the OnErrorRecoveryCompleted callback. The new callback is called with the original error and the new error status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9244 Test Plan: Add a new unit test in error_handler_fs_test Reviewed By: zhichao-cao Differential Revision: D32922303 Pulled By: anand1976 fbshipit-source-id: f04e77a9cb92c5ea6385590682d3fcf559971b99	2021-12-08 14:30:57 -08:00
Akanksha Mahajan	9e4d56f2c9	Fix segmentation fault in table_options.prepopulate_block_cache when used with partition_filters (#9263 ) Summary: When table_options.prepopulate_block_cache is set to BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and table_options.partition_filters is also set true, then there is segmentation failure when top level filter is fetched because its entered with wrong type in cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263 Test Plan: Updated unit tests; Ran db_stress: make crash_test -j32 Reviewed By: pdillinger Differential Revision: D32936566 Pulled By: akankshamahajan15 fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426	2021-12-08 12:44:38 -08:00
Levi Tamasi	94d99400dc	Fix a typo in DBSSTTest.DBWithMaxSpaceAllowedWithBlobFiles (#9270 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9270 Test Plan: ``` gtest-parallel --repeat=10000 ./db_sst_test --gtest_filter=DBSSTTest.DBWithMaxSpaceAllowedWithBlobFiles ``` Reviewed By: akankshamahajan15 Differential Revision: D32958154 Pulled By: ltamasi fbshipit-source-id: b6ec2fbbece80d73c567cec57638dffd3c84a2ba	2021-12-08 12:05:37 -08:00
Levi Tamasi	d1f053b0ae	Attempt to deflake DBSSTTest.DestroyDBWithRateLimitedDelete (#9269 ) Summary: This test case seems to be occasionally failing due to the code hitting the immediate deletion branch in `DeleteScheduler::DeleteFile`. The patch increases the allowed trash ratio to a huge value to prevent this from happening. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9269 Test Plan: ``` gtest-parallel --repeat=10000 ./db_sst_test --gtest_filter=DBSSTTest.DestroyDBWithRateLimitedDelete ``` Reviewed By: akankshamahajan15 Differential Revision: D32956596 Pulled By: ltamasi fbshipit-source-id: 3945e7c1c19ede76698e03c3f133bc1d9fd61b84	2021-12-08 11:16:46 -08:00
sdong	88875df821	File temperature information should be preserved when restart the DB (#9242 ) Summary: Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit. Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242 Test Plan: Add a unit test that would fail without the fix. Reviewed By: jay-zhuang Differential Revision: D32818150 fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec	2021-12-03 14:43:14 -08:00
Levi Tamasi	930f2e92e6	Attempt to deflake DBSSTTest.DBWithSFMForBlobFilesAtomicFlush (#9241 ) Summary: When using the SST file manager, the actual deletion of DB files potentially occurs in the background. The patch adds another call to `SstFileManagerImpl::WaitForEmptyTrash` to the test case `DBSSTTest.DBWithSFMForBlobFilesAtomicFlush` to ensure the deletions are performed before the test checks the number of deleted files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9241 Test Plan: ``` gtest-parallel --repeat=1000 ./db_sst_test --gtest_filter=DBSSTTest.DBWithSFMForBlobFilesAtomicFlush ``` Reviewed By: akankshamahajan15 Differential Revision: D32811427 Pulled By: ltamasi fbshipit-source-id: 7f2ad649a22bd2d7900e5f132372034093cfcf47	2021-12-02 16:54:21 -08:00
lgqss	77c7085594	MemTableList::TrimHistory now use allocated bytes (#9020 ) Summary: Fix a bug when both max_write_buffer_size_to_maintain and max_write_buffer_number_to_maintain are 0. The bug was introduced in 6.5.0 and https://github.com/facebook/rocksdb/issues/5022. Fix https://github.com/facebook/rocksdb/issues/8371 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9020 Reviewed By: pdillinger Differential Revision: D32767084 Pulled By: ajkr fbshipit-source-id: c401ee6e2557230e892d0fe8abb4966cbd18e85f	2021-12-02 11:45:39 -08:00
Hui Xiao	9daf07305c	Replace TableProperties::properties_offsets map with external_sst_file_global_seqno_offset (#9212 ) Summary: Context: Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties. Note: - See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets` - See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence. Summary: - Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212 Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested. Reviewed By: ajkr Differential Revision: D32665941 Pulled By: hx235 fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795	2021-12-02 08:30:36 -08:00
mrambacher	7cd5835a28	Make RateLimiter Customizable (#9141 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9141 Reviewed By: zhichao-cao Differential Revision: D32432190 Pulled By: mrambacher fbshipit-source-id: 7930ed88a02412128cd407b5063522484e45c6ce	2021-12-01 06:57:02 -08:00
Yanqin Jin	924616526a	Update WriteBatch::AssignTimestamp() and Add (#9205 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9205 Update WriteBatch::AssignTimestamp() APIs so that they take an additional argument, i.e. a function object called `checker` indicating the user-specified logic of performing checks on timestamp sizes. WriteBatch is a building block used by multiple other RocksDB components, each of which may track timestamp information in different data structures. For example, transaction can either write to `WriteBatchWithIndex` which is a `WriteBatch` with index, or write directly to raw `WriteBatch` if `Transaction::DisableIndexing()` is called. `WriteBatchWithIndex` keeps mapping from column family id to comparator, and transaction needs to keep similar information for the `WriteBatch` if user calls `Transaction::DisableIndexing()` (dynamically) so that we will know the size of each timestamp later. The bookkeeping info maintained by `WriteBatchWithIndex` and `Transaction` should not overlap. When we later call `WriteBatch::AssignTimestamp()`, we need to use these data structures to guarantee that we do not accidentally assign timestamps for keys from column families that disable timestamp. Reviewed By: ltamasi Differential Revision: D31735186 fbshipit-source-id: 8b1709ed880ac72f995aa9e012e5873b290840a7	2021-11-30 22:33:00 -08:00
Artem Krylysov	552256cb1a	Add rocksdb_livefiles_column_family_name C API (#9232 ) Summary: Extend C API to add new function `rocksdb_livefiles_column_family_name`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9232 Reviewed By: akankshamahajan15 Differential Revision: D32736516 Pulled By: ajkr fbshipit-source-id: a854256a0f4652c903ab5ad8355ded051ac19987	2021-11-30 16:54:27 -08:00
leipeng	c712b68f5b	Fix num files in single compaction for universal compaction (#9168 ) Summary: https://github.com/facebook/rocksdb/issues/9026 fixed histogram NUM_FILES_IN_SINGLE_COMPACTION for level compaction, but missed fix for universal compaction. This PR fixed NUM_FILES_IN_SINGLE_COMPACTION for universal compaction. Quote from https://github.com/facebook/rocksdb/issues/9026: > currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input. Thanks for ajkr pointed this missed fix! Pull Request resolved: https://github.com/facebook/rocksdb/pull/9168 Reviewed By: akankshamahajan15 Differential Revision: D32434494 Pulled By: ajkr fbshipit-source-id: 93ea092af4afbd8dce67898ffb350cf26b065ed2	2021-11-30 15:11:21 -08:00
Peter Dillinger	2a67d475f1	Fix bug affecting GetSortedWalFiles, Backups, Checkpoint (#9208 ) Summary: Saw error like this: `Backup failed -- IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb/rocksdb_crashtest_blackbox/004426.log: No such file or directory` Unfortunately, GetSortedWalFiles (used by Backups, Checkpoint, etc.) relies on no file deletions happening while its operating, which means not only disabling (more) deletions, but ensuring any pending deletions are completed. Two fixes related to this: * There was a gap in several places between decrementing pending_purge_obsolete_files_ and incrementing bg_purge_scheduled_ where the db mutex would be released and GetSortedWalFiles (and others) could get false information that no deletions are pending. * The fix to https://github.com/facebook/rocksdb/issues/8591 (disabling deletions in GetSortedWalFiles) seems incomplete because it doesn't prevent pending deletions from occuring during the operation (if deletions not already disabled, the case that was to be fixed by the change). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9208 Test Plan: existing tests (it's hard to write a test for interleavings that are now excluded - this is what stress test is for) Reviewed By: ajkr Differential Revision: D32630675 Pulled By: pdillinger fbshipit-source-id: a121e3da648de130cd24d44c524232f4eb22f178	2021-11-24 14:52:00 -08:00
Yanqin Jin	12e98add68	Print file checksum in hex (#9196 ) Summary: Printing file checksum (usually an integer) in non-hex format is barely useful. To make the matter worse, it can mess with the output format. If you use `less` to redirect the output of `ldb manifest_dump`, non-hex file checksum can cause `less` not to function as expected. Also output some additional fields to json output. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9196 Test Plan: manually test `ldb manifest_dump`. Reviewed By: ajkr Differential Revision: D32590253 Pulled By: riversand963 fbshipit-source-id: de434b7e60dd05b0b7cb76eff2240b21f9ae4b32	2021-11-22 09:30:47 -08:00
Yanqin Jin	43ac7a2774	Fix an assertion failure when ManifestTailer switches to new Manifest in multi-cf mode (#9143 ) Summary: Original unit test fail to test the case of multi-cf mode switching to new manifest. The assertion failure will trigger when the primary instance reopens and secondary continues to tail the newly-created MANIFEST. Fix the assertion failure and update existing unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9143 Test Plan: make check Reviewed By: ltamasi Differential Revision: D32574233 Pulled By: riversand963 fbshipit-source-id: 857ddbe994019091276458abebcf8e2b65340468	2021-11-19 19:53:40 -08:00
Levi Tamasi	dc5de45af8	Support readahead during compaction for blob files (#9187 ) Summary: The patch adds a new BlobDB configuration option `blob_compaction_readahead_size` that can be used to enable prefetching data from blob files during compaction. This is important when using storage with higher latencies like HDDs or remote filesystems. If enabled, prefetching is used for all cases when blobs are read during compaction, namely garbage collection, compaction filters (when the existing value has to be read from a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187 Test Plan: Ran `make check` and the stress/crash test. Reviewed By: riversand963 Differential Revision: D32565512 Pulled By: ltamasi fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d	2021-11-19 17:53:47 -08:00
Peter Dillinger	cd4ea675e3	Fix backward compatibility with 2.5 through 2.7 (#9189 ) Summary: A bug in https://github.com/facebook/rocksdb/issues/9163 can cause checksum verification to fail if parsing a properties block fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9189 Test Plan: check_format_compatible.sh (never quite works locally but this particular case seems fixed using variants of SHORT_TEST=1). And added new unit test case. Reviewed By: ajkr Differential Revision: D32574626 Pulled By: pdillinger fbshipit-source-id: 6fa5c8595737b71a3c3d011a52daf6d6c08715d7	2021-11-19 17:31:01 -08:00
Jay Zhuang	6cde8d2190	Deprecating `iter_start_seqnum` and `preserve_deletes` (#9091 ) Summary: `ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are deprecated, please try using user defined timestamp feature instead. The feature is used to support differential snapshots, but not well maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which returns an internal key from the iterator. The user defined timestamp feature is a more flexible feature to support similar usecase, please switch to that if you have such usecase. The deprecated feature will be removed in a future release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091 Test Plan: check LOG Fix https://github.com/facebook/rocksdb/issues/9090 Reviewed By: ajkr Differential Revision: D32071750 Pulled By: jay-zhuang fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104	2021-11-19 16:55:45 -08:00
slk	e12753eb71	Track each SST's timestamp information as user properties (#9093 ) Summary: Track each SST's timestamp information as user properties https://github.com/facebook/rocksdb/issues/8959 Rockdb has supported user-defined timestamp feature. Application can specify a timestamp when writing each k-v pair. When data flush from memory to disk file called SST files. Each SST files consist of multiple data blocks and several metadata blocks. Among the metadata blocks, there is one called Properties block that tracks some pre-defined properties of this SST file. This PR is for collecting the properties of min and max timestamps of all keys in the file. With those properties the SST file is more convenient to tell whether the keys in the SST have timestamps or not. The changes involved are as follows: 1) Add a class TimestampTablePropertiesCollector to collect min/max timestamp when add keys to table, The way TimestampTablePropertiesCollector use to compare timestamp of key should defined by user by implementing the Comparator::CompareTimestamp function in the user defined comparator. 2) Add corresponding unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9093 Reviewed By: ltamasi Differential Revision: D32406927 Pulled By: riversand963 fbshipit-source-id: 25922971b7e67bacf4d53a1fb67c4c5ddaa61573	2021-11-19 11:37:06 -08:00
sdong	12117b26a3	Fix flaky DBTest2.RateLimitedCompactionReads (#9185 ) Summary: DBTest2.RateLimitedCompactionReads sometime shows following failure: what(): db/db_test2.cc:3976: Failure Expected equality of these values: i + 1 Which is: 4 NumTableFilesAtLevel(0) Which is: 0 The assertion itself doesn't appear to be correct. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9185 Test Plan: Removing an assertion shouldn't break anything. Reviewed By: ajkr Differential Revision: D32549530 fbshipit-source-id: 9993372d8af89161f903337a13f3e316e690a6b8	2021-11-19 10:08:59 -08:00
Yanqin Jin	1e8322c0f5	Fix a bug in FlushJob picking more memtables beyond synced WALs (#9142 ) Summary: After RocksDB 6.19 and before this PR, RocksDB FlushJob may pick more memtables to flush beyond synced WALs. This can be problematic if there are multiple column families, since it can prematurely advance the flushed column family's log_number. Should subsequent attempts fail to sync the latest WALs and the database goes through a recovery, it may detect corrupted WAL number below the flushed column family's log number and complain about column family inconsistency. To fix, we record the maximum memtable ID of the column family being flushed. Then we call SyncClosedLogs() so that all closed WALs at the time when memtable ID is recorded will be synced. I also disabled a unit test temporarily due to reasons described in https://github.com/facebook/rocksdb/issues/9151 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9142 Test Plan: make check Reviewed By: ajkr Differential Revision: D32299956 Pulled By: riversand963 fbshipit-source-id: 0da75888177d91905cf8c9d00605b73afb5970a7	2021-11-19 09:56:00 -08:00
Andrew Kryczka	8cf4294e25	Adhere to per-DB concurrency limit when bottom-pri compactions exist (#9179 ) Summary: - Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect. - Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179 Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency. Reviewed By: riversand963 Differential Revision: D32488048 Pulled By: ajkr fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310	2021-11-18 17:31:50 -08:00
Peter Dillinger	230660be73	Improve / clean up meta block code & integrity (#9163 ) Summary: * Checksums are now checked on meta blocks unless specifically suppressed or not applicable (e.g. plain table). (Was other way around.) This means a number of cases that were not checking checksums now are, including direct read TableProperties in Version::GetTableProperties (fixed in meta_blocks ReadTableProperties), reading any block from PersistentCache (fixed in BlockFetcher), read TableProperties in SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open, maybe more. * For that to work, I moved the global_seqno+TableProperties checksum logic to the shared table/ code, because that is used by many utilies such as SstFileDumper. * Also for that to work, we have to know when we're dealing with a block that has a checksum (trailer), so added that capability to Footer based on magic number, and from there BlockFetcher. * Knowledge of trailer presence has also fixed a problem where other table formats were reading blocks including bytes for a non-existant trailer--and awkwardly kind-of not using them, e.g. no shared code checking checksums. (BlockFetcher compression type was populated incorrectly.) Now we only read what is needed. * Minimized code duplication and differing/incompatible/awkward abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block without parsing block handle) * Moved some meta block handling code from table_properties. * Moved some code specific to block-based table from shared table/ code to BlockBasedTable class. The checksum stuff means we can't completely separate it, but things that don't need to be in shared table/ code should not be. * Use unique_ptr rather than raw ptr in more places. (Note: you can std::move from unique_ptr to shared_ptr.) Without enhancements to GetPropertiesOfAllTablesTest (see below), net reduction of roughly 100 lines of code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163 Test Plan: existing tests and * Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that checksums are now checked on direct read of table properties by TableCache (new test would fail before this change) * Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test putting table properties under old meta name * Also generally enhanced that same test to actually test what it was supposed to be testing already, by kicking things out of table cache when we don't want them there. Reviewed By: ajkr, mrambacher Differential Revision: D32514757 Pulled By: pdillinger fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9	2021-11-18 11:43:44 -08:00
Hui Xiao	74544d582f	Account Bloom/Ribbon filter construction memory in global memory limit (#9073 ) Summary: Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge. Context: Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`. The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option. Summary: - Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction: - hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance), - banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`), - final filter (i.e, `mutable_buf`'s size). - Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table. - Added option fo run `filter_bench` with this memory reservation feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073 Test Plan: - Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory` - To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around 3000 - 6000 ms and others take around 1500 - 2000 ms, in total adding 20000 - 25000 ms to the test suit running locally - Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter - Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over 20 run as below: - FastLocalBloom - baseline `./filter_bench -impl=2 -quick -runs 20 \| grep 'Build avg'`: - Build avg ns/key: 29.56295 (DEBUG_LEVEL=1), 29.98153 (DEBUG_LEVEL=0) - new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true \| grep 'Build avg'`: - Build avg ns/key: 30.99046 (DEBUG_LEVEL=1), 30.48867 (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true \| grep 'Build avg'` : - Build avg ns/key: 31.146975 (DEBUG_LEVEL=1), 30.08165 (DEBUG_LEVEL=0) - Ribbon128 - baseline `./filter_bench -impl=3 -quick -runs 20 \| grep 'Build avg'`: - Build avg ns/key: 129.17585 (DEBUG_LEVEL=1), 130.5225 (DEBUG_LEVEL=0) - new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true \| grep 'Build avg' `: - Build avg ns/key: 131.61645 (DEBUG_LEVEL=1), 132.98075 (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true \| grep 'Build avg'` : - Build avg ns/key: 52.032965 (DEBUG_LEVEL=1), 52.597825 (DEBUG_LEVEL=0) - And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console. Reviewed By: pdillinger Differential Revision: D31991348 Pulled By: hx235 fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e	2021-11-18 09:42:20 -08:00
Zhichao Cao	b694cd0e0d	Add tiered storage related read bytes stats to Statistic (#9123 ) Summary: Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123 Test Plan: added new testing cases. Reviewed By: siying Differential Revision: D32154745 Pulled By: zhichao-cao fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da	2021-11-16 15:17:17 -08:00
Peter Dillinger	f8c685c4fc	Check for and disallow shared key space in block caches (#9172 ) Summary: We have three layers of block cache that often use the same key but map to different physical data: * BlockBasedTableOptions::block_cache * BlockBasedTableOptions::block_cache_compressed * BlockBasedTableOptions::persistent_cache If any two of these happen to share an underlying implementation and key space (insertion into one shows up in another), then memory safety is broken. The simplest case is block_cache == block_cache_compressed. (Credit mrambacher for asking about this case in a review.) With this change, we explicitly check for overlap and preemptively and safely fail with a Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9172 Test Plan: test added. Crashes without new check Reviewed By: anand1976 Differential Revision: D32465659 Pulled By: pdillinger fbshipit-source-id: 3876b45b6dce6167e5a7a642725ddc86b96f8e40	2021-11-16 11:16:05 -08:00
Adam Simpkins	28f54e71f3	fix compile errors in db/kv_checksum.h (#9173 ) Summary: When defining a template class, the constructor should be specified simply using the class name; it does not take template arguments.a Apparently older versions of gcc and clang did not complain about this syntax, but gcc 11.x and recent versions of clang both complain about this file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9173 Test Plan: When building with platform010 I got compile errors in this file both in `mode/dev` (clang) and in `mode/opt-gcc`. This diff fixes the compile failures. Reviewed By: ajkr Differential Revision: D32455881 Pulled By: simpkins fbshipit-source-id: 0682910d9e2cdade94ce1e77973d47ac04d9f7e2	2021-11-16 10:20:50 -08:00
Yanqin Jin	2035798834	Update TransactionUtil::CheckKeyForConflict to also use timestamps (#9162 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162 Existing TransactionUtil::CheckKeyForConflict() performs only seq-based conflict checking. If user-defined timestamp is enabled, it should perform conflict checking based on timestamps too. Update TransactionUtil::CheckKey-related methods to verify the timestamp of the latest version of a key is smaller than the read timestamp. Note that CheckKeysForConflict() is not updated since it's used only by optimistic transaction, and we do not plan to update it in this upcoming batch of diffs. Existing GetLatestSequenceForKey() returns the sequence of the latest version of a specific user key. Since we support user-defined timestamp, we need to update this method to also return the timestamp (if enabled) of the latest version of the key. This will be needed for snapshot validation. Reviewed By: ltamasi Differential Revision: D31567960 fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44	2021-11-15 12:52:18 -08:00
Akanksha Mahajan	17ce1ca48b	Reuse internal auto readhead_size at each Level (expect L0) for Iterations (#9056 ) Summary: RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB. This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block. 1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB. 2. It maintains readahead_size for L1 - Ln levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056 Test Plan: Added new unit tests Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled. Reviewed By: anand1976 Differential Revision: D31773640 Pulled By: akankshamahajan15 fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98	2021-11-10 16:20:04 -08:00
slk	937fbcbddc	Track per-SST user-defined timestamp information in MANIFEST (#9092 ) Summary: Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957 Rockdb has supported user-defined timestamp feature. Application can specify a timestamp when writing each k-v pair. When data flush from memory to disk file called SST files, file creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the MANIFEST for each file. The changes involved are as follows: 1) Track max/min timestamp in FileMetaData, and fix invoved codes. 2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as VersionEdit Encode and Decode etc. 3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092 Reviewed By: ajkr, akankshamahajan15 Differential Revision: D32252323 Pulled By: riversand963 fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162	2021-11-10 10:49:04 -08:00
Yanqin Jin	a113cecfc9	Fix a bug in timestamp-related GC (#9116 ) Summary: For multiple versions (ts + seq) of the same user key, if they cross the boundary of `full_history_ts_low_`, we should retain the version that is visible to the `full_history_ts_low_`. Namely, we keep the internal key with the largest timestamp smaller than `full_history_ts_low`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9116 Test Plan: make check Reviewed By: ltamasi Differential Revision: D32261514 Pulled By: riversand963 fbshipit-source-id: e10f47c254c04c05261440051e4f50cb7d95474e	2021-11-09 13:08:55 -08:00
Zhichao Cao	efaef9b40a	cleanup error_handler related code (#9098 ) Summary: Remove code not in use, add comments, remove redundant code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9098 Test Plan: make check Reviewed By: anand1976 Differential Revision: D32027219 Pulled By: zhichao-cao fbshipit-source-id: 253aae926c87726268af6c027bf805dc9156c8a8	2021-11-08 15:49:17 -08:00
anand76	dddb791c18	Enable a few unit tests to use custom Env objects (#9087 ) Summary: Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087 Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc. Reviewed By: riversand963 Differential Revision: D32007222 Pulled By: anand1976 fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792	2021-11-08 11:05:59 -08:00
Jay Zhuang	5aad38f262	Deflake DBBasicTestWithTimestampCompressionSettings.PutAndGetWithComp… (#9136 ) Summary: …action ``` db/db_with_timestamp_basic_test.cc:2643: Failure db_->CompactFiles(compact_opt, handles_[cf], collector->GetFlushedFiles(), static_cast<int>(kNumTimestamps - i)) Invalid argument: A compaction must contain at least one file. ``` Able to be reproduced by run multiple test in parallel. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9136 Test Plan: ``` gtest-parallel ./db_with_timestamp_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampCompressionSettings.PutAndGetWithCompaction/12 -r 100 -w 100 ``` Reviewed By: riversand963 Differential Revision: D32197734 Pulled By: jay-zhuang fbshipit-source-id: aeb0d6e9b37312f577e203ca81bb7a0f14d4e7ce	2021-11-05 17:57:50 -07:00
Levi Tamasi	3fbddb1d27	Refactor and unify blob file saving and the logic that finds the oldest live blob file (#9122 ) Summary: The patch refactors and unifies the logic in `VersionBuilder::SaveBlobFilesTo` and `VersionBuilder::GetMinOldestBlobFileNumber` by introducing a generic helper that can "merge" the list of `BlobFileMetaData` in the base version with the list of `MutableBlobFileMetaData` representing the updated state after applying a sequence of `VersionEdit`s. This serves as groundwork for subsequent changes that will enable us to determine whether a blob file is live after applying a sequence of edits without calling `VersionBuilder::SaveTo`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9122 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D32151472 Pulled By: ltamasi fbshipit-source-id: 11622b475866de823334b8bc21b0e99d913af97e	2021-11-05 16:50:52 -07:00
Yanqin Jin	5237b39d2e	Fix assertion error during compaction with write-prepared txn enabled (#9105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105 The user contract of SingleDelete is that: a SingleDelete can only be issued to a key that exists and has NOT been updated. For example, application can insert one key `key`, and uses a SingleDelete to delete it in the future. The `key` cannot be updated or removed using Delete. In reality, especially when write-prepared transaction is being used, things can get tricky. For example, a prepared transaction already writes `key` to the memtable after a successful Prepare(). Afterwards, should the transaction rollback, it will insert a Delete into the memtable to cancel out the prior Put. Consider the following sequence of operations. ``` // operation sequence 1 Begin txn Put(key) Prepare() Flush() Rollback txn Flush() ``` There will be two SSTs resulting from above. One of the contains a PUT, while the second one contains a Delete. It is also known that releasing a snapshot can lead to an L0 containing only a SD for a particular key. Consider the following operations following the above block. ``` // operation sequence 2 db->Put(key) db->SingleDelete(key) Flush() ``` The operation sequence 2 can result in an L0 with only the SD. Should there be a snapshot for conflict checking created before operation sequence 1, then an attempt to compact the db may hit the assertion failure below, because ikey_.type is Delete (from a rollback). ``` else if (clear_and_output_next_key_) { assert(ikey_.type == kTypeValue \|\| ikey_.type == kTypeBlobIndex); } ``` To fix the assertion failure, we can skip the SingleDelete if we detect an earlier Delete in the same snapshot interval. Reviewed By: ltamasi Differential Revision: D32056848 fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748	2021-11-05 15:29:18 -07:00
Yanqin Jin	9b53f14a35	Fixed a bug in CompactionIterator when write-preared transaction is used (#9060 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060 RocksDB bottommost level compaction may zero out an internal key's sequence if the key's sequence is in the earliest_snapshot. In write-prepared transaction, checking the visibility of a certain sequence in a specific released snapshot may return a "snapshot released" result. Therefore, it is possible, after a certain sequence of events, a PUT has its sequence zeroed out, but a subsequent SingleDelete of the same key will still be output with its original sequence. This violates the ascending order of keys and leads to incorrect result. The solution is to use an extra variable `last_key_seq_zeroed_` to track the information about visibility in earliest snapshot. With this variable, we can know for sure that a SingleDelete is in the earliest snapshot even if the said snapshot is released during compaction before processing the SD. Reviewed By: ltamasi Differential Revision: D31813016 fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1	2021-11-03 15:55:00 -07:00
Jay Zhuang	29102641dd	Skip directory fsync for filesystem btrfs (#8903 ) Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 4 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncAfterFileRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is forced by set `IOOptions.force_dir_fsync = true` 4. renaming multiple files (like backup and checkpoint): dir-fsync is forced, the same as above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903 Test Plan: run tests on btrfs and non btrfs Reviewed By: ajkr Differential Revision: D30885059 Pulled By: jay-zhuang fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a	2021-11-03 12:21:27 -07:00
Levi Tamasi	081722780b	Refactor the detailed consistency checks and the SST saving logic in VersionBuilder (#9099 ) Summary: The patch refactors the parts of `VersionBuilder` that deal with SST file comparisons. Specifically, it makes the following changes: * Turns `NewestFirstBySeqNo` and `BySmallestKey` from free-standing functions into function objects. Note: `BySmallestKey` has a pointer to the `InternalKeyComparator`, while `NewestFirstBySeqNo` is completely stateless. * Eliminates the wrapper `FileComparator`, which was essentially an unnecessary DIY virtual function call mechanism. * Refactors `CheckConsistencyDetails` and `SaveSSTFilesTo` using helper function templates that take comparator/checker function objects. Using static polymorphism eliminates the need to make runtime decisions about which comparator to use. * Extends some error messages returned by the consistency checks and makes them more uniform. * Removes some incomplete/redundant consistency checks from `VersionBuilder` and `FilePicker`. * Improves const correctness in several places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9099 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D32027503 Pulled By: ltamasi fbshipit-source-id: 621326ae41f4f55f7ad6a91abbd6e666d5c7857c	2021-11-03 11:52:47 -07:00
Peter Dillinger	2b60621f16	Don't call OnTableFileCreated with OK for empty+deleted file (#9118 ) Summary: EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. This could lead to a stress test assertion failure added in https://github.com/facebook/rocksdb/issues/9054. This changes the status to Aborted, to align with the API doc: "... if the file is successfully created. Now it will also be called on failure case. User can check info.status to see if it succeeded or not." For internal purposes, this case is considered "success" but for listener purposes, no SST file is (successfully) created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9118 Test Plan: test case added + existing db_stress Reviewed By: ajkr, riversand963 Differential Revision: D32120232 Pulled By: pdillinger fbshipit-source-id: a804e2e0a52598018d3b182da97804d402ffcdfa	2021-11-03 08:43:27 -07:00
Peter Dillinger	21f8a57f2a	Fix TSAN report on MemPurge test (#9115 ) Summary: TSAN reported data race on count variables in MemPurgeBasic test. This suggests the test could fail if mempurges were slow enough that they don't complete before the count variables being checked, but injecting a long sleep into MemPurge (outside DB mutex) confirms that blocked writes ensure enough mempurges/flushes happen to make the test pass. All the possible different values on testing should be OK to make the test pass. So this change makes the variables atomic so that up-to-date value is always read and TSAN report suppressed. I have also used `.exchange(0)` to make the checking less stateful by "popping off" all the accumulated counts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9115 Test Plan: updated test, watch for any flakiness Reviewed By: riversand963 Differential Revision: D32114432 Pulled By: pdillinger fbshipit-source-id: c985609d39896a0d8f69ebc87b221e688609bdd8	2021-11-02 21:54:29 -07:00
mrambacher	f72c834eab	Make FileSystem a Customizable Class (#8649 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649 Reviewed By: zhichao-cao Differential Revision: D32036059 Pulled By: mrambacher fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295	2021-11-02 09:07:11 -07:00
sdong	a2b9be42b6	Try to start TTL earlier with kMinOverlappingRatio is used (#8749 ) Summary: Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles. When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed. In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749 Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end. Reviewed By: jay-zhuang Differential Revision: D30735261 fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e	2021-11-01 14:36:31 -07:00
leipeng	230c98f3ce	fix histogram NUM_FILES_IN_SINGLE_COMPACTION (#9026 ) Summary: currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026 Reviewed By: ajkr Differential Revision: D31668241 Pulled By: jay-zhuang fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af	2021-11-01 12:57:27 -07:00
Levi Tamasi	b1c27a52d2	Add a consistency check that prevents the overflow of garbage in blob files (#9100 ) Summary: The number or total size of garbage blobs in any given blob file can never exceed the number or total size of all blobs in the file. (This would be a similar error to e.g. attempting to remove from the LSM tree an SST file that has already been removed.) The patch builds on https://github.com/facebook/rocksdb/issues/9085 and adds a consistency check to `VersionBuilder` that prevents the above from happening. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9100 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D32048982 Pulled By: ltamasi fbshipit-source-id: 6f7e0793bf534ad04c3359cc0f696b8e4e5ef81c	2021-11-01 12:32:14 -07:00
leipeng	2b70224f82	remove bad extra RecordTick(stats_, WRITE_WITH_WAL) (#9064 ) Summary: This PR fix wrong ticker `WRITE_WITH_WAL`. `RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`. Fixes: 1. Delete these two extra `RecordTick(WRITE_WITH_WAL)` 2. Fix corresponding test case Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064 Reviewed By: ajkr Differential Revision: D31944459 Pulled By: riversand963 fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086	2021-11-01 11:43:14 -07:00
leipeng	01bd86ad35	InternalStats::DumpCFMapStat: fix sum.w_amp (#9065 ) Summary: sum `w_amp` will be a very large number`(bytes_written + bytes_written_blob)` when there is no any flush and ingest. This PR set sum `w_amp` to zero if there is no any flush and ingest, this is conform to per-level `w_amp` computation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9065 Reviewed By: ajkr Differential Revision: D31943994 Pulled By: riversand963 fbshipit-source-id: acbef5e331debebfad09e0e0d8d0885ebbc00609	2021-10-31 23:11:43 -07:00
Yanqin Jin	8e59a1dc9a	Attempt to deflake ListenerTest.MultiCF (#9084 ) Summary: EventListenerTest.MultiCF uses TestFlushListener which has members flushed_dbs_ and flushed_column_family_names_ that are not protected by locks. This implicitly indicates that we need to ensure the methods accessing these data structures in a single threaded way. In other tests, e.g. MultiDBMultiListeners, we use TEST_WaitForFlushMemtable() to wait until all memtables of a given column family are flushed, hence no pending flush threads will concurrently call OnFlushCompleted() and cause data race for flushed_dbs_. To fix a test failure, we should do the same for MultiCF. Example data race stack traces reported by TSAN ``` Read of size 8 at 0x7b6000002840 by main thread: #0 std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:655:40 https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:380:7 Previous write of size 8 at 0x7b6000002840 by thread T2: #0 void std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::_M_emplace_back_aux<rocksdb::DB* const&>(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/vector.tcc:442:26 https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB, std::allocator<rocksdb::DB> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/stl_vector.h:923:4 https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9084 Test Plan: ./listener_test --gtest_filter=EventListenerTest.MultiCF Reviewed By: jay-zhuang Differential Revision: D31952259 Pulled By: riversand963 fbshipit-source-id: 94a7f29e4e9466ead42418944eb2247fc32bd499	2021-10-31 22:12:15 -07:00
Yanqin Jin	8f4f302316	Attempt to deflake DBFlushTest.FireOnFlushCompletedAfterCommittedResult (#9083 ) Summary: DBFlushTest.FireOnFlushCompletedAfterCommittedResult uses test sync points to coordinate interleaving of different threads. Before this PR, the test writes some data to memtable, triggers a manual flush, and triggers a second manual flush after a first bg flush thread starts executing. Though unlikely, it is possible for the second bg flush thread to run faster than the first bg flush thread and deques flush queue first. In this case, the original test will fail. The fix is to wait until the first bg flush thread deques the flush queue before triggering second manual flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9083 Test Plan: ./db_flush_test --gtest_filter=DBFlushTest.FireOnFlushCompletedAfterCommittedResult Reviewed By: jay-zhuang Differential Revision: D31951239 Pulled By: riversand963 fbshipit-source-id: f32d7cdabe6ad6808fd18e54e663936dc0a9edb4	2021-10-31 22:08:48 -07:00
Levi Tamasi	44d04582cb	Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085 ) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3	2021-10-29 17:47:02 -07:00
Yanqin Jin	fdf2a0d7eb	Fix a compaction bug for write-prepared txn (#9061 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061 In write-prepared txn, checking a sequence's visibility in a released (old) snapshot may return "Snapshot released". Suppose we have two snapshots: ``` earliest_snap < earliest_write_conflict_snap ``` If we release `earliest_write_conflict_snap` but keep `earliest_snap` during bottommost level compaction, then it is possible that certain sequence of events can lead to a PUT being seq-zeroed followed by a SingleDelete of the same key. This violates the ascending order of keys, and will cause data inconsistency. Reviewed By: ltamasi Differential Revision: D31813017 fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285	2021-10-29 15:23:17 -07:00
Peter Dillinger	a7d4bea43a	Implement XXH3 block checksum type (#9069 ) Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. Existing functionality change: * reject bad ChecksumType in options with InvalidArgument This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069 Test Plan: tests updated, and substantially expanded. Unit tests now check that we don't accidentally change the values generated by the checksum algorithms ("schema test") and that we properly handle invalid/unrecognized checksum types in options or in file footer. DBTestBase::ChangeOptions (etc.) updated from two to one configuration changing from default CRC32c ChecksumType. The point of this test code is to detect possible interactions among features, and the likelihood of some bad interaction being detected by including configurations other than XXH3 and CRC32c--and then not detected by stress/crash test--is extremely low. Stress/crash test also updated (manual run long enough to see it accepts new checksum type). db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 \| grep 'micros/op' \| tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3 Reviewed By: mrambacher Differential Revision: D31905249 Pulled By: pdillinger fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1	2021-10-28 22:15:17 -07:00
Andrew Kryczka	f24c39ab3d	Prevent corruption with parallel manual compactions and `change_level == true` (#9077 ) Summary: The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well. Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction. The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps. Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default). The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled. Thanks to siying for finding the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077 Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions. Reviewed By: siying Differential Revision: D31921908 Pulled By: ajkr fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb	2021-10-27 23:08:56 -07:00
Jonathan Albrecht	e970248602	Add support for building on s390x platform (#8962 ) Summary: This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in. There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with? 1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package 2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image Not sure if there is more required for travis. Happy to help in any way I can. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962 Reviewed By: mrambacher Differential Revision: D31802198 Pulled By: pdillinger fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2	2021-10-22 10:13:15 -07:00
Yanqin Jin	f72fd58565	Fix atomic flush waiting forever for MANIFEST write (#9034 ) Summary: In atomic flush, concurrent background flush threads will commit to the MANIFEST one by one, in the order of the IDs of their picked memtables for all included column families. Each time, a background flush thread decides whether to wait based on two criteria: - Is db stopped? If so, don't wait. - Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go. When atomic flush was implemented, error writing to or syncing the MANIFEST would cause the db to be stopped. Therefore, this background thread does not have to check for the background error while waiting. If there has been such an error, `DBStopped()` would have been true, and this thread will not wait forever. After we improved error handling, RocksDB may map an IOError while writing to MANIFEST to a soft error, if there is no WAL. This requires the background threads to check for background error while waiting. Otherwise, a background flush thread may wait forever. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D31639225 Pulled By: riversand963 fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd	2021-10-20 21:34:47 -07:00
leipeng	0a73ada7b5	remove unused local obj and simpilify comple code (#9052 ) Summary: This PR does not change code sematics, it just changes for: 1. local obj `nonmem_w` and `lfile` are unused 2. null check for `delete ptr` is unnecessary 3. use `unique_ptr::reset` instead of `release` + `delete` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9052 Reviewed By: zhichao-cao Differential Revision: D31801661 Pulled By: anand1976 fbshipit-source-id: 16a77d45da8c8833bf5bf3bce546bb3711b335df	2021-10-20 14:08:05 -07:00
leipeng	0c53b41856	db_impl_write.cc: use stats_ instead of immutable_db_options_.stats (#9053 ) Summary: This PR has no semantic changes, just to make code shorter. `stats_` has value same with `immutable_db_options_.stats`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9053 Reviewed By: zhichao-cao Differential Revision: D31801603 Pulled By: anand1976 fbshipit-source-id: cbd8fe478d3e90ae078ace49b4f2eb9bb028ccf6	2021-10-20 14:04:59 -07:00
Andrew Kryczka	4217d1bce7	Support `GetMapProperty()` with "rocksdb.dbstats" (#9057 ) Summary: This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057 Test Plan: new unit test Reviewed By: zhichao-cao Differential Revision: D31781495 Pulled By: ajkr fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b	2021-10-20 13:17:00 -07:00
sdong	c66b4429ff	Incremental Space Amp Compactions in Universal Style (#8655 ) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb	2021-10-20 10:04:13 -07:00
sdong	f053851af6	Ignore non-overlapping levels when determinig grandparent files (#9051 ) Summary: Right now, when picking a compaction, grand parent files are from output_level + 1. This usually works, but if the level doesn't have any overlapping file, it will be more efficient to go further down. This is because the files are likely to be trivial moved further and might create a violation of max_compaction_bytes. This situation can naturally happen and might happen even more with TTL compactions. There is no harm to fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9051 Test Plan: Run existing tests and see it passes. Also briefly run crash test. Reviewed By: ajkr Differential Revision: D31748829 fbshipit-source-id: 52b99ab4284dc816d22f34406d528a3c98ff6719	2021-10-19 12:48:18 -07:00
Peter Dillinger	ad5325a736	Experimental support for SST unique IDs (#8990 ) Summary: * New public header unique_id.h and function GetUniqueIdFromTableProperties which computes a universally unique identifier based on table properties of table files from recent RocksDB versions. * Generation of DB session IDs is refactored so that they are guaranteed unique in the lifetime of a process running RocksDB. (SemiStructuredUniqueIdGen, new test included.) Along with file numbers, this enables SST unique IDs to be guaranteed unique among SSTs generated in a single process, and "better than random" between processes. See https://github.com/pdillinger/unique_id * In addition to public API producing 'external' unique IDs, there is a function for producing 'internal' unique IDs, with functions for converting between the two. In short, the external ID is "safe" for things people might do with it, and the internal ID enables more "power user" features for the future. Specifically, the external ID goes through a hashing layer so that any subset of bits in the external ID can be used as a hash of the full ID, while also preserving uniqueness guarantees in the first 128 bits (bijective both on first 128 bits and on full 192 bits). Intended follow-up: * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into the third 64-bit value of the unique ID.) * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990 Test Plan: Unit tests added, and checking of unique ids in stress test. NOTE in stress test we do not generate nearly enough files to thoroughly stress uniqueness, but the test trims off pieces of the ID to check for uniqueness so that we can infer (with some assumptions) stronger properties in the aggregate. Reviewed By: zhichao-cao, mrambacher Differential Revision: D31582865 Pulled By: pdillinger fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243	2021-10-18 23:32:01 -07:00
Giuseppe Ottaviano	f0841d4faf	Fix out-of-bounds access in MultiDBParallelOpenTest (#9046 ) Summary: `dbs` should not be cleared, as it is reused later when reopening the DBs, so we have an out-of-bounds access with `dbnames[dbnum]`. The values left in the vector don't need to be reset, as the db pointer is an out parameter for `DB::Open`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9046 Reviewed By: pdillinger Differential Revision: D31738263 Pulled By: ot fbshipit-source-id: c619e947b8d3dbc3d896f29971f093d3e3c794d3	2021-10-18 21:25:45 -07:00
Jay Zhuang	314de7e7de	Make `DB::Close()` thread-safe (#8970 ) Summary: If `DB::Close()` is called in multi-thread env, the resource could be double released, which causes exception or assert. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970 Test Plan: Test with multi-thread benchmark, with each thread try to close the DB at the end. Reviewed By: pdillinger Differential Revision: D31242042 Pulled By: jay-zhuang fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb	2021-10-18 20:32:35 -07:00
Jay Zhuang	53a0ab2bea	Deflaky ObsoleteFilesTest (#9049 ) Summary: WaitForFlushMemTable() may only wait for mem flush but not background flush finishing. The the obsoleted file may not be purged yet. `fcaa7ff638/db/db_impl/db_impl_compaction_flush.cc (L2200-L2203)` Use WaitForCompact() instead to wait for background flush job. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9049 Test Plan: `gtest-parallel ./obsolete_files_test --gtest_filter=ObsoleteFilesTest.DeleteObsoleteOptionsFile -r 1000` Reviewed By: zhichao-cao Differential Revision: D31737343 Pulled By: jay-zhuang fbshipit-source-id: 82276ebeae7c7c75a733d3e1fd1c130d45e4761f	2021-10-18 15:15:23 -07:00
Peter Dillinger	3ffb3baa0b	Add (Live)FileStorageInfo API (#8968 ) Summary: New classes FileStorageInfo and LiveFileStorageInfo and 'experimental' function DB::GetLiveFilesStorageInfo, which is intended to largely replace several fragmented DB functions needed to create checkpoints and backups. This function is now used to create checkpoints and backups, because it fixes many (probably not all) of the prior complexities of checkpoint not having atomic access to DB metadata. This also ensures strong functional test coverage of the new API. Specifically, much of the old CheckpointImpl::CreateCustomCheckpoint has been migrated to and updated in DBImpl::GetLiveFilesStorageInfo, with the former now calling the latter. Also, the class FileStorageInfo in metadata.h compatibly replaces BackupFileInfo and serves as a new base class for SstFileMetaData. Some old fields of SstFileMetaData are still provided (for now) but deprecated. Although FileStorageInfo::directory is accurate when using db_paths and/or cf_paths, these have never been supported by Checkpoint nor BackupEngine and still are not. This change does now detect these cases and return NotSupported when appropriate. (More work needed for support.) Somehow this change broke ProgressCallbackDuringBackup, but the progress_callback logic was dubious to begin with because it would call the callback based on copy buffer size, not size actually copied. Logic and test updated to track size actually copied per-thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968 Test Plan: tests updated. DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl. DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo, including reading the data after DB close. Added CheckpointTest.CheckpointWithDbPath (NotSupported). Reviewed By: siying Differential Revision: D31242045 Pulled By: pdillinger fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0	2021-10-16 10:04:32 -07:00
Giuseppe Ottaviano	4bfd415e34	Fix sequence number bump logic in multi-CF SST ingestion (#9005 ) Summary: The code in `IngestExternalFiles()` that bumps the DB's sequence number depending on what seqnos were assigned to the files has 3 bugs: 1) There is an assertion that the sequence number is increased in all the affected column families, but this is unnecessary, it is fine if some files can stick to a lower sequence number. It is very easy to hit the assertion: it is sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that doesn't (for example the CF is empty). The line added in the `IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail. 2) SetLastSequence() is called with the sum of all the bumps across CFs, but we should take the maximum instead, as all CFs start with the current seqno and bump it independently. 3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in optimized builds, so some files may be assigned seqnos from the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005 Test Plan: Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that triggers the assertion, verified that the test (and all the others) pass after the fix. Reviewed By: ajkr Differential Revision: D31597892 Pulled By: ot fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770	2021-10-12 20:39:52 -07:00
Levi Tamasi	7cc52cd8f5	Update HISTORY for PR 8994 (#9017 ) Summary: Also, expand on/clarify a comment in `VersionStorageInfoTest`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9017 Reviewed By: riversand963 Differential Revision: D31566130 Pulled By: ltamasi fbshipit-source-id: 1d30c7af084c4de7b2030bc6c768838d65746010	2021-10-12 10:19:56 -07:00
Giuseppe Ottaviano	22d4dc5066	Fix race in WriteBufferManager (#9009 ) Summary: EndWriteStall has a data race: `queue_.empty()` is checked outside of the mutex, so once we enter the critical section another thread may already have cleared the list, and accessing the `front()` is undefined behavior (and causes interesting crashes under high concurrency). This PR fixes the bug, and also rewrites the logic to make it easier to reason about it. It also fixes another subtle bug: if some writers are stalled and `SetBufferSize(0)` is called, which disables the WBM, the writer are not unblocked because of an early `enabled()` check in `EndWriteStall()`. It doesn't significantly change the locking behavior, as before writers won't lock unless entering a stall condition, and `FreeMem` almost always locks if stalling is allowed, but that is inevitable with the current design. Liveness is guaranteed by the fact that if some writes are blocked, eventually all writes will be blocked due to `stall_active_`, and eventually all memory is freed. While at it, do a couple of optimizations: - In `WBMStallInterface::Signal()` signal the CV only after releasing the lock. Signaling under the lock is a common pitfall, as it causes the woken-up thread to immediately go back to sleep because the mutex is still locked by the awaker. - Move all allocations and deallocations outside of the lock. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9009 Test Plan: ``` USE_CLANG=1 make -j64 all check ``` Reviewed By: akankshamahajan15 Differential Revision: D31550668 Pulled By: ot fbshipit-source-id: 5125387c3dc7ecaaa2b8bbc736e58c4156698580	2021-10-12 00:16:21 -07:00
Levi Tamasi	3e1bf771a3	Make it possible to force the garbage collection of the oldest blob files (#8994 ) Summary: The current BlobDB garbage collection logic works by relocating the valid blobs from the oldest blob files as they are encountered during compaction, and cleaning up blob files once they contain nothing but garbage. However, with sufficiently skewed workloads, it is theoretically possible to end up in a situation when few or no compactions get scheduled for the SST files that contain references to the oldest blob files, which can lead to increased space amp due to the lack of GC. In order to efficiently handle such workloads, the patch adds a new BlobDB configuration option called `blob_garbage_collection_force_threshold`, which signals to BlobDB to schedule targeted compactions for the SST files that keep alive the oldest batch of blob files if the overall ratio of garbage in the given blob files meets the threshold and all the given blob files are eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example, if the new option is set to 0.9, targeted compactions will get scheduled if the sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the oldest blob files, assuming all affected blob files are below the age-based cutoff.) The net result of these targeted compactions is that the valid blobs in the oldest blob files are relocated and the oldest blob files themselves cleaned up (since all SST files that rely on them get compacted away). These targeted compactions are similar to periodic compactions in the sense that they force certain SST files that otherwise would not get picked up to undergo compaction and also in the sense that instead of merging files from multiple levels, they target a single file. (Note: such compactions might still include neighboring files from the same level due to the need of having a "clean cut" boundary but they never include any files from any other level.) This functionality is currently only supported with the leveled compaction style and is inactive by default (since the default value is set to 1.0, i.e. 100%). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994 Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests. Reviewed By: riversand963 Differential Revision: D31489850 Pulled By: ltamasi fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab	2021-10-11 18:03:01 -07:00
Andrew Kryczka	a282eff3d1	Protect existing files in `FaultInjectionTest{Env,FS}::ReopenWritableFile()` (#8995 ) Summary: `FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage. The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995 Test Plan: - Verified it fixes the following failure: ``` $ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1 ... $ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1 ... Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound: Crash-recovery verification failed :( ... ``` - `make check -j48` Reviewed By: ltamasi Differential Revision: D31495388 Pulled By: ajkr fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385	2021-10-11 16:23:18 -07:00
Zhichao Cao	bcd049cd2d	Ingest external SST files with Temperature hints (#8949 ) Summary: Add the file temperature to `IngestExternalFileArg` such that when SST files are ingested, user is able to assign the temperature to each SST file. If the temperature vector is empty or its size does not match the file name vector size, all ingested SST files will be assigned with `Temperature::unKnown`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8949 Test Plan: add the new test and make check Reviewed By: siying Differential Revision: D31127852 Pulled By: zhichao-cao fbshipit-source-id: 141a81f0f7b473d88f4ab0cb2a21a114cbc6f83c	2021-10-08 10:32:24 -07:00
Andrew Kryczka	fcaa7ff638	Cancel manual compactions waiting on automatic compactions to drain (#8991 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991 Test Plan: the new test hangs forever without this fix and passes with this fix. Reviewed By: hx235 Differential Revision: D31456419 Pulled By: ajkr fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff	2021-10-07 15:23:55 -07:00
Kajetan Janiak	8717c26823	Warning about incompatible options with level_compaction_dynamic_level_bytes (#8329 ) Summary: This change introduces warnings instead of a silent override when trying to use level_compaction_dynamic_level_bytes with multiple cf_paths/db_paths. I have completed the CLA. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8329 Reviewed By: hx235 Differential Revision: D31399713 Pulled By: ajkr fbshipit-source-id: 29c6fe5258d1f739b4590ecd44aee44f55415595	2021-10-07 15:23:55 -07:00
Zhichao Cao	b632ed0c67	Add file temperature related counter and bytes stats to and io_stats (#8710 ) Summary: For tiered storage project, we need to know the block read count and read bytes of files with different temperature. Add FileIOByTemperature to IOStatsContext and collect the bytes read and read count from different temperature files through the RandomAccessFileReader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8710 Test Plan: make check, add the testing cases Reviewed By: siying Differential Revision: D30582400 Pulled By: zhichao-cao fbshipit-source-id: d83173de594374fc8404af5ce93a6a9be72c7141	2021-10-07 14:58:41 -07:00
Ramkumar Vadivelu	fe994bbd0b	Misc doc fixes (#8983 ) Summary: - Update few stale GitHub wiki link references from rocksdb.org - Update the API comments for ignore_range_deletions Pull Request resolved: https://github.com/facebook/rocksdb/pull/8983 Reviewed By: ajkr Differential Revision: D31355965 Pulled By: ramvadiv fbshipit-source-id: 245ac4a6913976dd82afa308bc4aae6bff3d788c	2021-10-07 11:22:17 -07:00
mrambacher	53e595d1f3	Cleanup multiple implementations of VectorIterator (#8901 ) Summary: There were three implementations of VectorIterator (util/vector_iterator, test_util/testutil.h and LoggingForwardVectorIterator). Merged them into one class to increase code coverage/testing and reduce duplication. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8901 Reviewed By: pdillinger Differential Revision: D31022673 Pulled By: mrambacher fbshipit-source-id: 8e3acbd2dfd60b4df609d02cc72846de2389d531	2021-10-06 07:48:31 -07:00
Stefan Roesch	a776406de3	Add file operation callbacks to SequentialFileReader (#8982 ) Summary: This change adds File IO Notifications to the SequentialFileReader The SequentialFileReader is extended with a listener parameter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8982 Test Plan: A new test EventListenerTest::OnWALOperationTest has been added. The test verifies that during restore the sequential file reader is called and the notifications are fired. Reviewed By: riversand963 Differential Revision: D31320844 Pulled By: shrfb fbshipit-source-id: 040b24da7c010d7c14ebb5c6460fae9a19b8c168	2021-10-05 10:51:59 -07:00
mrambacher	787229837e	Fix LITE mode builds on MacOs (#8981 ) Summary: On MacOS, there were errors building in LITE mode related to unused private member variables: In file included from ./db/compaction/compaction_job.h:20: ./db/blob/blob_file_completion_callback.h:87:19: error: private field ‘sst_file_manager_’ is not used [-Werror,-Wunused-private-field] SstFileManager* sst_file_manager_; ^ ./db/blob/blob_file_completion_callback.h:88:22: error: private field ‘mutex_’ is not used [-Werror,-Wunused-private-field] InstrumentedMutex* mutex_; ^ ./db/blob/blob_file_completion_callback.h:89:17: error: private field ‘error_handler_’ is not used [-Werror,-Wunused-private-field] ErrorHandler* error_handler_; This PR resolves those build issues by removing the values as members in LITE mode and fixing the constructor to ignore the input values in LITE mode (otherwise we get unused parameter warnings). Tested by validating compiles without warnings. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8981 Reviewed By: akankshamahajan15 Differential Revision: D31320141 Pulled By: mrambacher fbshipit-source-id: d67875ebbd39a9555e4f09b2d37159566dd8a085	2021-10-04 05:30:26 -07:00
Yanqin Jin	2cdaf5ca5b	Add additional checks for three existing unit tests (#8973 ) Summary: With test sync points, we can assert on the equality of iterator value in three existing unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8973 Test Plan: ``` gtest-parallel -r 1000 ./db_test2 --gtest_filter=DBTest2.IterRaceFlush2:DBTest2.IterRaceFlush1:DBTest2.IterRefreshRaceFlush ``` make check Reviewed By: akankshamahajan15 Differential Revision: D31256340 Pulled By: riversand963 fbshipit-source-id: a9440767ab383e0ec61bd43ffa8fbec4ba562ea2	2021-10-01 17:22:37 -07:00
anand76	532ff334d9	Don't ignore deletion rate limit if WAL dir is different (#8967 ) Summary: If WAL dir is different from the DB dir, we should still honor the SstFileManager deletion rate limit for SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8967 Test Plan: Add a new unit test in db_sst_test Reviewed By: pdillinger Differential Revision: D31220116 Pulled By: anand1976 fbshipit-source-id: bcde8a53a7d728e15e597fb5d07ee86c1b38bd28	2021-09-30 13:26:31 -07:00
Yanqin Jin	2acffecca1	Add comments for MultiGetBlob() and checks for MultiRead() (#8972 ) Summary: Add comments for MultiGetBlob() that input argument `offsets` must be sorted. In addition, add assertion for this condition in debug build. Repeat the same for RandomAccessFileReader::MultiRead(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8972 Test Plan: make check Reviewed By: pdillinger Differential Revision: D31253205 Pulled By: riversand963 fbshipit-source-id: 98758229b8052f3aeb319d5584026b4de2d220a2	2021-09-29 14:27:19 -07:00
mrambacher	13ae16c315	Cleanup includes in dbformat.h (#8930 ) Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d	2021-09-29 04:04:40 -07:00
Jay Zhuang	6b34eb0ebc	Add remote compaction read/write bytes statistics (#8939 ) Summary: Add basic read/write bytes statistics on the primary side: `REMOTE_COMPACT_READ_BYTES` `REMOTE_COMPACT_WRITE_BYTES` Fixed existing statistics missing some IO for remote compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8939 Test Plan: CI Reviewed By: ajkr Differential Revision: D31074672 Pulled By: jay-zhuang fbshipit-source-id: c57afdba369990185008ffaec7e3fe7c62e8902f	2021-09-28 14:00:37 -07:00
Hui Xiao	d6bd1a0291	Support "level_at_creation" in TablePropertiesCollectorFactory::Context (#8919 ) Summary: Context: Exposing the level of the sst file (i.e, table) where it is created in `TablePropertiesCollectorFactory::Context` allows users of `TablePropertiesCollectorFactory` to customize some implementation details of `TablePropertiesCollectorFactory` and `TablePropertiesCollector` based on the level of creation. For example, `TablePropertiesCollector::NeedCompact()` can return different values based on level of creation. - Declared an extra field `level_at_creation` in `TablePropertiesCollectorFactory::Context` - Allowed `level_at_creation` to be passed in as an argument in `IntTblPropCollectorFactory::CreateIntTblPropCollector()` and `UserKeyTablePropertiesCollectorFactory::CreateIntTblPropCollector()`, the latter of which is an internal wrapper of user's passed-in `TablePropertiesCollectorFactory::CreateTablePropertiesCollector()` used in table-building process - Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` passed into both `BlockBasedTableBuilder` and `PlainTableBuilder` - `PlainTableBuilder` previously did not capture `level_at_creation` from `TableBuilderOptions` in `PlainTableFactory`. In order for it to call the method with this parameter, this PR also made `PlainTableBuilder` capture `level_at_creation` as a required parameter - Called `IntTblPropCollectorFactory::CreateIntTblPropCollector()` with `level_at_creation` its overridden functions in its derived classes, including `RegularKeysStartWithAFactory::CreateIntTblPropCollector()` in `table_properties_collector_test.cc`, `SstFileWriterPropertiesCollectorFactory::CreateIntTblPropCollector()` in `sst_file_writer_collectors.h` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8919 Test Plan: - Passed the added assertion for `context.level_at_creation` - Passed existing tests - Run `Make` to make sure adding a required parameter to `PlainTableBuilder`'s constructor does not break anything Reviewed By: anand1976 Differential Revision: D30951729 Pulled By: hx235 fbshipit-source-id: c4a0173b0d9344a4cf47e1b987d759c1c73cb474	2021-09-28 12:35:24 -07:00
mrambacher	7fd68b7c39	Make WalFilter, SstPartitionerFactory, FileChecksumGenFactory, and TableProperties Customizable (#8638 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8638 Reviewed By: zhichao-cao Differential Revision: D31024729 Pulled By: mrambacher fbshipit-source-id: 954c04ccab0b8dee64050a27aadf78ed119106c0	2021-09-28 05:32:02 -07:00
Akanksha Mahajan	78afb4d81e	Support SingleDelete for user-defined timestamps (#8921 ) Summary: Added support for SingleDelete for user-defined timestamps. Users can now Get and Iterate over keys deleted with SingleDelete. It also includes changes in CompactionIterator which preserves the same user key with different timestamps, unless the timestamp is below a certain threshold full_history_ts_low. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8921 Test Plan: Added new unit tests Reviewed By: riversand963 Differential Revision: D31098191 Pulled By: akankshamahajan15 fbshipit-source-id: 78a59ef4b4884ae324fcd10f56e62a27d5ee2f49	2021-09-27 11:51:07 -07:00
Peter Dillinger	0774d640c0	Fix some lint warnings reported on 6.25 (#8945 ) Summary: Fix some lint warnings Pull Request resolved: https://github.com/facebook/rocksdb/pull/8945 Test Plan: existing tests, linters Reviewed By: zhichao-cao Differential Revision: D31103824 Pulled By: pdillinger fbshipit-source-id: 4dd9b0c30fa50e588107ac6ed392b2dfb507a5d4	2021-09-27 11:43:20 -07:00
mrambacher	e0f697d2bd	Make SliceTransform into a Customizable class (#8641 ) Summary: Made SliceTransform into a Customizable class. Would be nice to write a test that stored and used a custom transform in an SST table. There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter. Is this expected? Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641 Reviewed By: zhichao-cao Differential Revision: D31142793 Pulled By: mrambacher fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1	2021-09-27 07:43:47 -07:00
Yanqin Jin	b92cef2d1d	Sort per-file blob read requests by offset (#8953 ) Summary: `RandomAccessFileReader::MultiRead()` tries to merge requests in direct IO, assuming input IO requests are sorted by offsets. Add a test in direct IO mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8953 Test Plan: make check Reviewed By: ltamasi Differential Revision: D31183546 Pulled By: riversand963 fbshipit-source-id: 5d043ec68e2daa47a3149066150afd41ee3d73e6	2021-09-24 22:14:30 -07:00
Hui Xiao	58444eadda	Make RateLimiter::GetTotalPendingRequest() non pure virtual for backward compability (#8938 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/8890 added a public API `RateLimiter::GetTotalPendingRequest()` but mistakenly marked it as pure virtual, forcing RateLimiter's derived classes to implement this function and breaking backward compatibility. This PR makes `RateLimiter::GetTotalPendingRequest()` as non-pure virtual method by providing a trivial implementation in rate_limiter.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/8938 Test Plan: Passing existing tests Reviewed By: pdillinger Differential Revision: D31100661 Pulled By: hx235 fbshipit-source-id: 06eff1005156a6e5a881e393b2c5b2ad706897d8	2021-09-21 21:29:26 -07:00
mrambacher	6924869867	Make SystemClock into a Customizable Class (#8636 ) Summary: Made SystemClock into a Customizable class, complete with CreateFromString. Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv). Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636 Reviewed By: zhichao-cao Differential Revision: D30483360 Pulled By: mrambacher fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1	2021-09-21 09:23:48 -07:00
Jay Zhuang	1c290c785d	RemoteCompaction support Fallback to local compaction (#8709 ) Summary: Add support for fallback to local compaction, the user can return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8709 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30560163 Pulled By: jay-zhuang fbshipit-source-id: 65d8905a4a1bc185a68daa120997f21d3198dbe1	2021-09-18 00:25:04 -07:00
Yanqin Jin	b512f4bc76	Batch blob read IO for MultiGet (#8699 ) Summary: In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()` to read the blobs instead of issuing multiple `Read()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8699 Test Plan: ``` make check ``` Reviewed By: ltamasi Differential Revision: D31030861 Pulled By: riversand963 fbshipit-source-id: a0df6060cbfd54cff9515a4eee08807b1dbcb0c8	2021-09-17 19:23:13 -07:00
Akanksha Mahajan	d6aa8c49f8	Expose blob file information through the EventListener interface (#8675 ) Summary: 1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files. 2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted. 3. Test OnFile*Finish operations notifications with Blob Files. 4. Log the blob file creation/deletion events through EventLogger in Log file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675 Test Plan: Add new unit tests in listener_test Reviewed By: ltamasi Differential Revision: D30412613 Pulled By: akankshamahajan15 fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07	2021-09-16 17:23:36 -07:00
Jay Zhuang	b97c53b629	Add compaction priority information in RemoteCompaction (#8707 ) Summary: Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8707 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30548401 Pulled By: jay-zhuang fbshipit-source-id: b30446511fb31b4583c49edd8565d496cf013a34	2021-09-16 15:09:35 -07:00
Peter Dillinger	f4a1d10668	Fix flaky WALTrashCleanupOnOpen (#8917 ) Summary: Test did not consider that slower deletion rate only kicks in after a file is deleted Fixes https://github.com/facebook/rocksdb/issues/7546 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8917 Test Plan: no longer reproduces using buck test mode/dev //internal_repo_rocksdb/repo:db_sst_test -- --exact 'internal_repo_rocksdb/repo:db_sst_test - DBWALTestWithParam/DBWALTestWithParam.WALTrashCleanupOnOpen/0' --jobs 40 --stress-runs 600 --record-results Reviewed By: siying Differential Revision: D30949127 Pulled By: pdillinger fbshipit-source-id: 5d0607f8f548071b07410fe8f532b4618cd225e5	2021-09-15 21:31:20 -07:00
Peter Dillinger	2819c7840e	Fix PrepopulateBlockCache::kFlushOnly (#8750 ) Summary: kFlushOnly currently means "always" except in the case of remote compaction. This makes it flushes only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8750 Test Plan: test updated Reviewed By: akankshamahajan15 Differential Revision: D30968034 Pulled By: pdillinger fbshipit-source-id: 5dbd24dde18852a0e937a540995fba9bfbe89037	2021-09-15 15:33:20 -07:00
sdong	12d798ac06	Always iniitalize ArenaWrappedDBIter::db_iter_ to nullptr (#8889 ) Summary: ArenaWrappedDBIter::db_iter_ should never be nullptr. However, when debugging a segfault, it's hard to distinguish it is not initialized (not possible) and other corruption. Add this nullptr to help distinguish the case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8889 Test Plan: Run existing unit tests. Reviewed By: pdillinger Differential Revision: D30814756 fbshipit-source-id: 4b1f36896a33dc203d4f1f424ded9554927d61ba	2021-09-14 14:33:15 -07:00
Andrew Kryczka	d648cb47b9	Adapt key-value checksum for timestamp-suffixed keys (#8914 ) Summary: After https://github.com/facebook/rocksdb/issues/8725, keys added to `WriteBatch` may be timestamp-suffixed, while `WriteBatch` has no awareness of the timestamp size. Therefore, `WriteBatch` can no longer calculate timestamp checksum separately from the rest of the key's checksum in all cases. This PR changes the definition of key in KV checksum to include the timestamp suffix. That way we do not need to worry about where the timestamp begins within the key. I believe the only practical effect of this change is now `AssignTimestamp()` requires recomputing the whole key checksum (`UpdateK()`) rather than just the timestamp portion (`UpdateT()`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8914 Test Plan: run stress command that used to fail ``` $ ./db_stress --batch_protection_bytes_per_key=8 -clear_column_family_one_in=0 -test_batches_snapshots=1 ``` Reviewed By: riversand963 Differential Revision: D30925715 Pulled By: ajkr fbshipit-source-id: c143f7ccb46c0efb390ad57ef415c250d754deff	2021-09-14 13:14:39 -07:00
eharry	0b6be7eb68	Fix WAL log data corruption #8723 (#8746 ) Summary: Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together (https://github.com/facebook/rocksdb/issues/8723) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8746 Reviewed By: ajkr Differential Revision: D30758468 Pulled By: riversand963 fbshipit-source-id: 07c20899d5f2447dc77861b4845efc68a59aa4e8	2021-09-13 20:15:59 -07:00
Peter Dillinger	7bef598440	Bypass unused parameterization in ExternalSSTFileBasicTest.IngestExte… (#8910 ) Summary: Facebook infrastructure doesn't like continuously skipping tests, so fixing this permanently disabled parameterization to BYPASS instead of SKIP. (Internal ref: T100525285) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8910 Test Plan: manual Reviewed By: anand1976 Differential Revision: D30905169 Pulled By: pdillinger fbshipit-source-id: e23d63d2aa800e54676269fad3a093cd3f9f222d	2021-09-13 12:18:15 -07:00
Levi Tamasi	306b779957	Use GetBlobFileSize instead of GetTotalBlobBytes in DB properties (#8902 ) Summary: The patch adjusts the definition of BlobDB's DB properties a bit by switching to `GetBlobFileSize` from `GetTotalBlobBytes`. The difference is that the value returned by `GetBlobFileSize` includes the blob file header and footer as well, and thus matches the on-disk size of blob files. In addition, the patch removes the `Version` number from the `blob_stats` property, and updates/extends the unit tests a little. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8902 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D30859542 Pulled By: ltamasi fbshipit-source-id: e3426d2d567bd1bd8c8636abdafaafa0743c854c	2021-09-13 10:47:16 -07:00
mrambacher	dafa584fd1	Change the File System File Wrappers to std::unique_ptr (#8618 ) Summary: This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership. Previously, many classes implemented their own ownership solutions. Fixes https://github.com/facebook/rocksdb/issues/8606 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8618 Reviewed By: pdillinger Differential Revision: D30136064 Pulled By: mrambacher fbshipit-source-id: d0bf471df8818dbb1770a86335fe98f761cca193	2021-09-13 08:46:19 -07:00
Yanqin Jin	2a2b3e03a5	Allow WriteBatch to have keys with different timestamp sizes (#8725 ) Summary: In the past, we unnecessarily requires all keys in the same write batch to be from column families whose timestamps' formats are the same for simplicity. Specifically, we cannot use the same write batch to write to two column families, one of which enables timestamp while the other disables it. The limitation is due to the member `timestamp_size_` that used to exist in each `WriteBatch` object. We pass a timestamp_size to the constructor of `WriteBatch`. Therefore, users can simply use the old `WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while the internal implementation of `WriteBatch` will take care of memory allocation for timestamps. The above is not necessary. One the one hand, users can set up a memory buffer to store user key and then contiguously append the timestamp to the user key. Then the user can pass this buffer to the `WriteBatch::Put(Slice&)` API. On the other hand, users can set up a SliceParts object which is an array of Slices and let the last Slice to point to the memory buffer storing timestamp. Then the user can pass the SliceParts object to the `WriteBatch::Put(SliceParts&)` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725 Test Plan: make check Reviewed By: ltamasi Differential Revision: D30654499 Pulled By: riversand963 fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2	2021-09-12 15:34:26 -07:00
Peter Dillinger	bda8d93ba9	Fix and detect headers with missing dependencies (#8893 ) Summary: It's always annoying to find a header does not include its own dependencies and only works when included after other includes. This change adds `make check-headers` which validates that each header can be included at the top of a file. Some headers are excluded e.g. because of platform or external dependencies. rocksdb_namespace.h had to be re-worked slightly to enable checking for failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.) Fixes mostly involve adding and cleaning up #includes, but for FileTraceWriter, a constructor was out-of-lined to make a forward declaration sufficient. This check is not currently run with `make check` but is added to CircleCI build-linux-unity since that one is already relatively fast. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893 Test Plan: existing tests and resolving issues detected by new check Reviewed By: mrambacher Differential Revision: D30823300 Pulled By: pdillinger fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572	2021-09-10 10:00:26 -07:00
mrambacher	dc0dc90cf5	Make Statistics a Customizable Class (#8637 ) Summary: Make the Statistics object into a Customizable object. Statistics can now be stored and created to/from the Options file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8637 Reviewed By: zhichao-cao Differential Revision: D30530550 Pulled By: mrambacher fbshipit-source-id: 5fc7d01d8431f37b2c205bbbd8342c9f697023bd	2021-09-10 09:47:39 -07:00
anand76	eea566864e	Support custom Env in db_sst_test and external_sst_file_basic_test (#8888 ) Summary: Support custom Env in these tests. Some custom Envs do not support reopening a file for write, either normal mode or Random RW mode. Added some additional checks in external_sst_file_basic_test to accommodate those Envs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8888 Reviewed By: riversand963 Differential Revision: D30824481 Pulled By: anand1976 fbshipit-source-id: c3ac7a628e6df29e94f42e370e679934a4f77eac	2021-09-08 21:21:49 -07:00
Zhiyi Zhang	0cb0fc6fd3	Add DB properties for BlobDB (#8734 ) Summary: RocksDB exposes certain internal statistics via the DB property interface. However, there are currently no properties related to BlobDB. For starters, we would like to add the following BlobDB properties: `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree). `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version. `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions. `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version. `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734 Test Plan: ``` ➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test [==========] Running 16 tests from 2 test cases. [----------] Global test environment set-up. [----------] 10 tests from DBBlobBasicTest [ RUN ] DBBlobBasicTest.GetBlob [ OK ] DBBlobBasicTest.GetBlob (12 ms) [ RUN ] DBBlobBasicTest.MultiGetBlobs [ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms) [ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex [ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms) [ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex [ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms) [ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber [ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms) [ RUN ] DBBlobBasicTest.GenerateIOTracing [ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms) [ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile [ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms) [ RUN ] DBBlobBasicTest.GetMergeBlobWithPut [ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms) [ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut [ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms) [ RUN ] DBBlobBasicTest.BlobDBProperties [ OK ] DBBlobBasicTest.BlobDBProperties (21 ms) [----------] 10 tests from DBBlobBasicTest (124 ms total) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms) [ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 [ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms) [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total) [----------] Global test environment tear-down [==========] 16 tests from 2 test cases ran. (2190 ms total) [ PASSED ] 16 tests. ``` Reviewed By: ltamasi Differential Revision: D30690849 Pulled By: Zhiyi-Zhang fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346	2021-09-08 12:22:04 -07:00
mrambacher	beed86473a	Make MemTableRepFactory into a Customizable class (#8419 ) Summary: This PR does the following: -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString -> Makes the existing implementations compatible with configurations -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419 Reviewed By: zhichao-cao Differential Revision: D29558961 Pulled By: mrambacher fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c	2021-09-08 07:46:44 -07:00
Andrew Kryczka	941543721d	Bytes read stat for `VerifyChecksum()` and `VerifyFileChecksums()` APIs (#8741 ) Summary: - Clarified some comments on compatibility for adding new ticker stats - Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741 Test Plan: new unit test Reviewed By: zhichao-cao Differential Revision: D30708578 Pulled By: ajkr fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449	2021-09-07 13:28:29 -07:00
Peter Dillinger	0ef88538c6	Improve support for using regexes (#8740 ) Summary: * Consolidate use of std::regex for testing to testharness.cc, to minimize Facebook linters constantly flagging uses in non-production code. * Improve syntax and error messages for asserting some string matches a regex in tests. * Add a public Regex wrapper class to encapsulate existing usage in ObjectRegistry. * Remove unnecessary include <regex> * Put warnings that use of Regex in production code could cause bad performance or stack overflow. Intended follow-up work: * Replace std::regex with another underlying implementation like RE2 * Improve ObjectRegistry interface in terms of possibly confusing literal string matching vs. regex and in terms of reporting invalid regex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740 Test Plan: tests updated, basic unit test for public Regex, and some manual testing of temporary changes to see example error messages: utilities/backupable/backupable_db_test.cc:917: Failure 000010_1162373755_138626.blob (child.name) does not match regex [0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern) db/db_basic_test.cc:74: Failure R3SHSBA8C4U0CIMV2ZB0 (sid3) does not match regex [0-9A-Z]{20}HAHAHA Reviewed By: mrambacher Differential Revision: D30706246 Pulled By: pdillinger fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd	2021-09-07 13:05:23 -07:00
Peter Dillinger	4750421ece	Replace most typedef with using= (#8751 ) Summary: Old typedef syntax is confusing Most but not all changes with perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files make format Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751 Test Plan: existing Reviewed By: zhichao-cao Differential Revision: D30745277 Pulled By: pdillinger fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9	2021-09-07 11:31:59 -07:00
Levi Tamasi	55ef8972fc	Support custom env in db_blob_{basic,compaction,corruption,index}_test (#8817 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8817 Test Plan: Ran `make check` and built/tested using internal custom environment. Reviewed By: riversand963 Differential Revision: D30768215 Pulled By: ltamasi fbshipit-source-id: cce96211d4c097612d20247f2e997358f40cc3d3	2021-09-07 11:13:56 -07:00
Akanksha Mahajan	e8a7001159	Update branch as "main" in tools/advisor/README.md (#8744 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8744 Reviewed By: ltamasi Differential Revision: D30716145 Pulled By: akankshamahajan15 fbshipit-source-id: c2fcaf9ddcae85a86c0f10496acab28cd795ff12	2021-09-01 20:26:28 -07:00
Peter Dillinger	c9cd5d25a8	Remove some unneeded code (#8736 ) Summary: * FullKey and ParseFullKey appear to serve no purpose in the public API (or anything else) so removed. Only use in one test updated. * NumberToString serves no purpose vs. ToString so removed, numerous calls updated * Remove unnecessary forward declarations in metadata.h by re-arranging class definitions. * Remove some unneeded semicolons Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736 Test Plan: existing tests Reviewed By: mrambacher Differential Revision: D30700039 Pulled By: pdillinger fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2	2021-09-01 14:28:58 -07:00
Peter Dillinger	13ded69484	Built-in support for generating unique IDs, bug fix (#8708 ) Summary: Env::GenerateUniqueId() works fine on Windows and on POSIX where /proc/sys/kernel/random/uuid exists. Our other implementation is flawed and easily produces collision in a new multi-threaded test. As we rely more heavily on DB session ID uniqueness, this becomes a serious issue. This change combines several individually suitable entropy sources for reliable generation of random unique IDs, with goal of uniqueness and portability, not cryptographic strength nor maximum speed. Specifically: * Moves code for getting UUIDs from the OS to port::GenerateRfcUuid rather than in Env implementation details. Callers are now told whether the operation fails or succeeds. * Adds an internal API GenerateRawUniqueId for generating high-quality 128-bit unique identifiers, by combining entropy from three "tracks": * Lots of info from default Env like time, process id, and hostname. * std::random_device * port::GenerateRfcUuid (when working) * Built-in implementations of Env::GenerateUniqueId() will now always produce an RFC 4122 UUID string, either from platform-specific API or by converting the output of GenerateRawUniqueId. DB session IDs now use GenerateRawUniqueId while DB IDs (not as critical) try to use port::GenerateRfcUuid but fall back on GenerateRawUniqueId with conversion to an RFC 4122 UUID. GenerateRawUniqueId is declared and defined under env/ rather than util/ or even port/ because of the Env dependency. Likely follow-up: enhance GenerateRawUniqueId to be faster after the first call and to guarantee uniqueness within the lifetime of a single process (imparting the same property onto DB session IDs). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708 Test Plan: A new mini-stress test in env_test checks the various public and internal APIs for uniqueness, including each track of GenerateRawUniqueId individually. We can't hope to verify anywhere close to 128 bits of entropy, but it can at least detect flaws as bad as the old code. Serial execution of the new tests takes about 350 ms on my machine. Reviewed By: zhichao-cao, mrambacher Differential Revision: D30563780 Pulled By: pdillinger fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364	2021-08-30 15:20:41 -07:00
Zaorang Yang	2bc914094d	Refactor with VersionBuilder (#8706 ) Summary: Introduce a new function to save sst files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8706 Reviewed By: jay-zhuang Differential Revision: D30544242 Pulled By: riversand963 fbshipit-source-id: 554755852daff7ae1c7864b0029f51b27099ee09	2021-08-27 12:15:08 -07:00
James Yin	7ddc096d7d	Fix typo in the comment of log_empty_ (#8711 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8711 Reviewed By: riversand963 Differential Revision: D30566761 Pulled By: jay-zhuang fbshipit-source-id: dd4690f5e2af2d263ed75ea1b9ed24692fe81362	2021-08-27 12:10:29 -07:00
anand76	ebaa3c8a59	Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714 ) Summary: DumpStats() iterates through the ColumnFamilySet. There is a potential race condition because it does Ref the cfd, and the cfd could get destroyed during the iteration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714 Test Plan: make check Reviewed By: ltamasi Differential Revision: D30580199 Pulled By: anand1976 fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850	2021-08-26 15:40:26 -07:00
Jay Zhuang	4afa24f8ae	Deflake test `CompactionJobTest.InputSerialization` (#8712 ) Summary: It's invalid to have an empty file name. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8712 Test Plan: ``` $ gtest-parallel ./compaction_job_test --gtest_filter=CompactionJobTest.InputSerialization -r 10000 ``` Reviewed By: pdillinger Differential Revision: D30566739 Pulled By: jay-zhuang fbshipit-source-id: 41e73175e3c95c4b73b4fdcd33470788d4e29d37	2021-08-26 09:27:37 -07:00
Yanqin Jin	f235f4b0a3	Fix a bug of secondary instance sequence going backward (#8653 ) Summary: Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses `ManifestTailer` whose `Iterate()` method can cause the db's `last_sequence_` to go backward. Consequently, read requests can see out-dated data. For example, latest changes to the primary will not be seen on the secondary even after a `TryCatchUpWithPrimary()` if no new write batches are read from the WALs and no new MANIFEST entries are read from the MANIFEST. Fix the bug so that `VersionEditHandler::CheckIterationResult` will never decrease `last_sequence_`, `last_allocated_sequence_` and `last_published_sequence_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8653 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D30272084 Pulled By: riversand963 fbshipit-source-id: c6a49c534b2509b93ef62d8936ed0acd5b860eaa	2021-08-24 18:18:36 -07:00
Peter Dillinger	318fe6941a	Add port::GetProcessID() (#8693 ) Summary: Useful in some places for object uniqueness across processes. Currently used for generating a host-wide identifier of Cache objects but expected to be used soon in some unique id generation code. `int64_t` is chosen for return type because POSIX uses signed integer type, usually `int`, for `pid_t` and Windows uses `DWORD`, which is `uint32_t`. Future work: avoid copy-pasted declarations in port_*.h, perhaps with port_common.h always included from port.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/8693 Test Plan: manual for now Reviewed By: ajkr, anand1976 Differential Revision: D30492876 Pulled By: pdillinger fbshipit-source-id: 39fc2788623cc9f4787866bdb67a4d183dde7eef	2021-08-24 17:46:14 -07:00
Yanqin Jin	229350ef48	Allow iterate refresh for secondary instance (#8700 ) Summary: Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700 Reviewed By: zhichao-cao Differential Revision: D30523907 Pulled By: riversand963 fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91	2021-08-24 15:40:56 -07:00
Andrew Kryczka	c521f22a1e	Deflake write-prepared and write-unprepared tests (#8696 ) Summary: The `JobContext::job_snapshot` referenced DB state but could have been deleted by a BG thread after the signal/unlock allowing shutdown to proceed. Then we would see an error like this (valgrind): ``` ==354104== Thread 2: ==354104== Invalid read of size 8 ==354104== at 0x694C4D: rocksdb::ManagedSnapshot::~ManagedSnapshot() (snapshot_impl.cc:20) ==354104== by 0x58F5BA: operator() (unique_ptr.h:81) ==354104== by 0x58F5BA: operator() (unique_ptr.h:75) ==354104== by 0x58F5BA: ~unique_ptr (unique_ptr.h:292) ==354104== by 0x58F5BA: rocksdb::JobContext::~JobContext() (job_context.h:221) ==354104== by 0x5F155E: rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) (db_impl_compaction_flush.cc:2696) ==354104== by 0x5F1BC2: rocksdb::DBImpl::BGWorkCompaction(void) (db_impl_compaction_flush.cc:2468) ==354104== by 0x83707A: operator() (std_function.h:688) ==354104== by 0x83707A: rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (threadpool_imp.cc:266) ==354104== by 0x8373ED: rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (threadpool_imp.cc:307) ==354104== by 0x492A800: execute_native_thread_routine (in /usr/local/fbcode/platform009/lib/libstdc++.so.6.0.28) ==354104== by 0x4A5020B: start_thread (in /usr/local/fbcode/platform009/lib/libpthread-2.30.so) ==354104== by 0x4CF281E: clone (in /usr/local/fbcode/platform009/lib/libc-2.30.so) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8696 Test Plan: unable to repro Reviewed By: pdillinger Differential Revision: D30505277 Pulled By: ajkr fbshipit-source-id: 5a99f34137cd14d06b0f624add6d37a70a61135d	2021-08-23 23:09:17 -07:00
Jay Zhuang	249b1078c9	Add extra information to RemoteCompaction APIs (#8680 ) Summary: Currently, we only provide job_id in RemoteCompaction APIs, the main problem of `job_id` is it cannot uniquely identify a compaction job between DB instances or between sessions. Providing DB and session id to the user, which will make building cross DB compaction service easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680 Test Plan: unittest Reviewed By: ajkr Differential Revision: D30444859 Pulled By: jay-zhuang fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448	2021-08-23 16:27:38 -07:00
Peter Dillinger	04db764831	Embed original file number in SST table properties (#8686 ) Summary: I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add file numbers to external SST files (so that more can share db session ids for better uniqueness properties), because of forward compatibility. We would have a version of RocksDB that assumes session IDs are unique on external SST files and therefore can't really break that invariant in future files. This change adds a table property for "orig_file_number" which is populated by normal SST files and also external SST files generated by SstFileWriter. SstFileWriter now keeps a db_session_id for life of the object and increments its own file numbers for embedding in table properties. (They are arguably "fake" file numbers because these numbers and not embedded in the file name.) While updating block_based_table_builder, I removed several unnecessary fields from Rep, because following the pattern would have created another unnecessary field. This change also updates block_based_table_reader to use this new property when available, which means that for newer SST files, we can determine the stable/original <db_session_id,file_number> unique identifier using just the file contents, not the file name. (It's a bit complicated; detailed comments in block_based_table_reader.) Also added DB host id to properties listing by sst_dump, which could be useful in debugging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686 Test Plan: majorly overhauled StableCacheKeys test for this change Reviewed By: zhichao-cao Differential Revision: D30457742 Pulled By: pdillinger fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445	2021-08-20 20:40:48 -07:00
Peter Dillinger	2a383f21f4	Add Bloom/Ribbon hybrid API support (#8679 ) Summary: This is essentially resurrection and fixing of the part of https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically, when configuring Ribbon filter, you can specify an LSM level before which Bloom will be used instead of Ribbon. But Bloom is only considered for Leveled and Universal compaction styles and file going into a known LSM level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as you would expect with NewRibbonFilterPolicy. So that this can be controlled with a single int value and so that flushes can be distinguished from intra-L0, we consider flush to go to level -1 for the purposes of this option. (Explained in API comment.) I also expect the most common and recommended Ribbon configuration to use Bloom during flush, to minimize slowing down writes and because according to my estimates, Ribbon only pays off if the structure lives in memory for more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy to be this mild hybrid configuration. I don't really want to add something like NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for flush, Ribbon otherwise) should be considered a natural choice. C APIs also updated, but because they don't support overloading, rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid configuration. While touching C API, I changed bits per key options from int to double. BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit unused fields from BloomFilterPolicy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679 Test Plan: new + updated tests, including crash test Reviewed By: jay-zhuang Differential Revision: D30445797 Pulled By: pdillinger fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1	2021-08-20 18:00:16 -07:00
Merlin Mao	baf22b4ee6	Add `IteratorTraceExecutionResult` for iterator related trace records. (#8687 ) Summary: - Allow to get `Valid()`, `status()`, `key()` and `value()` of an iterator from `IteratorTraceExecutionResult`. - Move lower bound and upper bound from `IteratorSeekQueryTraceRecord` to `IteratorQueryTraceRecord`. Added test in `DBTest2.TraceAndReplay`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8687 Reviewed By: zhichao-cao Differential Revision: D30457630 Pulled By: autopear fbshipit-source-id: be433099a25895b3aa6f0c00f95ad7b1d7489c1d	2021-08-20 15:35:56 -07:00
Akanksha Mahajan	5efec84c60	Fix blob callback in compaction and atomic flush (#8681 ) Summary: Pass BlobFileCompletionCallback in case of atomic flush and compaction job which is currently nullptr(default parameter). BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681 Test Plan: CircleCI jobs Reviewed By: ltamasi Differential Revision: D30445998 Pulled By: akankshamahajan15 fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021	2021-08-20 11:41:14 -07:00
Merlin Mao	ff8953380f	Add iterator's lower and upper bounds to `TraceRecord` (#8677 ) Summary: Trace file V2 added lower/upper bounds to `Iterator::Seek()` and `Iterator::SeekForPrev()`. They were not used anywhere during the execution of a `TraceRecord`. Now they are added to be used by `ReadOptions` during `Iterator::Seek()` and `Iterator::SeekForPrev()` if they are set. Added test cases in `DBTest2.TraceAndManualReplay`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8677 Reviewed By: zhichao-cao Differential Revision: D30438255 Pulled By: autopear fbshipit-source-id: 82563006be0b69155990e506a74951c18af8d288	2021-08-19 17:27:12 -07:00
Baptiste Lemaire	c625b8d017	Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event listeners to mempurge tests. (#8672 ) Summary: Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables. This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge). Event listeners were added to the unit tests to handle these situations. Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672 Reviewed By: akankshamahajan15 Differential Revision: D30383109 Pulled By: bjlemaire fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73	2021-08-18 17:40:01 -07:00
Merlin Mao	d10801e983	Allow Replayer to report the results of TraceRecords. (#8657 ) Summary: `Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.) `Replayer::Replay()` reports the results via a callback function. New interface: `TraceRecordResult` in "rocksdb/trace_record_result.h". `DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657 Reviewed By: ajkr Differential Revision: D30290216 Pulled By: autopear fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6	2021-08-18 17:06:14 -07:00
Peter Dillinger	b6269b078a	Stable cache keys on ingested SST files (#8669 ) Summary: Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even the same file ingested into different DBs sharing a block cache. Note: These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669 Test Plan: Extended unit test Reviewed By: zhichao-cao Differential Revision: D30398532 Pulled By: pdillinger fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7	2021-08-18 11:33:03 -07:00
Yanqin Jin	2b367fa8cc	Fix bug caused by releasing snapshot(s) during compaction (#8608 ) Summary: In debug mode, we are seeing assertion failure as follows ``` db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \ Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed. ``` It is caused by releasing earliest snapshot during compaction between the execution of `NextFromInput()` and `PrepareOutput()`. In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`, incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened. Fix by taking a more conservative approach: make compaction zero out sequence number only if key is in the earliest snapshot when the compaction starts. Another assertion failure is ``` Assertion `current_user_key_snapshot_ == last_snapshot' failed. ``` It's caused by releasing the snapshot between the PUT and SingleDelete during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D30145645 Pulled By: riversand963 fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81	2021-08-17 22:14:20 -07:00
Levi Tamasi	6878cedcc3	Add statistics support to integrated BlobDB (#8667 ) Summary: The patch adds statistics support to the integrated BlobDB implementation, namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and `BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms `BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like `BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`, `BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.) Note that the vast majority of the old BlobDB's tickers/histograms are not really applicable to the new implementation, since they e.g. pertain to calling dedicated BlobDB APIs (which the integrated BlobDB does not have) or are tied to the legacy BlobDB's design of writing blob files synchronously when a write API is called. Such statistics are marked "legacy BlobDB only" in `statistics.h`. Fixes https://github.com/facebook/rocksdb/issues/8645 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667 Test Plan: Ran `make check` and tested the new statistics using `db_bench`. Reviewed By: riversand963 Differential Revision: D30356884 Pulled By: ltamasi fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4	2021-08-17 17:22:31 -07:00
Peter Dillinger	a207c27809	Stable cache keys using DB session ids in SSTs (#8659 ) Summary: Use DB session ids in SST table properties to make cache keys stable across DB re-open and copy / move / restore / etc. These new cache keys are currently only enabled when FileSystem does not provide GetUniqueId. For now, they are typically larger, so slightly less efficient. Relevant to https://github.com/facebook/rocksdb/issues/7405 This change has a minor regression in PersistentCache functionality: metaindex blocks are no longer cached in PersistentCache. Table properties blocks already were not but ideally should be. I didn't spent effort to fix & test these issues because we don't believe PersistentCache is used much if at all and expect SecondaryCache to replace it. (Though PRs are welcome.) FIXME: there is more to be fixed for stable cache keys on external SST files Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659 Test Plan: new unit test added, which fails when disabling new functionality Reviewed By: zhichao-cao Differential Revision: D30297705 Pulled By: pdillinger fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a	2021-08-16 20:37:20 -07:00
Adam Retter	5de333fd99	Add db_test2 to to ASSERT_STATUS_CHECKED (#8640 ) Summary: This is the `db_test2` parts of https://github.com/facebook/rocksdb/pull/7737 reworked on the latest HEAD. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8640 Reviewed By: akankshamahajan15 Differential Revision: D30303684 Pulled By: mrambacher fbshipit-source-id: 263e2f82d849bde4048b60aed8b31e7deed4706a	2021-08-16 08:10:32 -07:00
Jay Zhuang	c55460c734	Add property `LiveSstFilesSizeAtTemperature` for tiered storage (#8644 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8644 Reviewed By: siying, zhichao-cao Differential Revision: D30236535 Pulled By: jay-zhuang fbshipit-source-id: 1758d1c46d83a5087560fb63d53a016bf999da81	2021-08-15 14:17:45 -07:00
Baptiste Lemaire	e51be2c5a1	Improve MemPurge sampling (#8656 ) Summary: Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details). In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID. I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction). Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656 Reviewed By: pdillinger Differential Revision: D30288583 Pulled By: bjlemaire fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3	2021-08-13 14:35:41 -07:00
Merlin Mao	f58d276764	Make TraceRecord and Replayer public (#8611 ) Summary: New public interfaces: `TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h". `Replayer`, available in `rocksdb/utilities/replayer.h`. User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file. Unit tests: - `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes. - `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611 Reviewed By: ajkr Differential Revision: D30266329 Pulled By: autopear fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8	2021-08-11 19:32:46 -07:00
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	2021-08-10 18:09:03 -07:00
Levi Tamasi	f63331ebaf	Attempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627 ) Summary: The patch attempts to deflake `DBTestXactLogIterator.TransactionLogIteratorCorruptedLog` by disabling file deletions while retrieving the list of WAL files and truncating the first WAL file. This is to prevent the `PurgeObsoleteFiles` call triggered by `GetSortedWalFiles` from invalidating the result of `GetSortedWalFiles`. The patch also cleans up the test case a bit and changes it to using `test::TruncateFile` instead of calling the `truncate` syscall directly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8627 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D30147002 Pulled By: ltamasi fbshipit-source-id: db11072a4ad8900a2f859cb5294e22b1888c23f6	2021-08-10 11:10:07 -07:00
Jay Zhuang	61f83dfeb7	Add an unittest for tiered storage universal compaction (#8631 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8631 Reviewed By: siying Differential Revision: D30200385 Pulled By: jay-zhuang fbshipit-source-id: 0fa2bb15e74ff81762d767f234078e0fe0106c55	2021-08-09 13:44:23 -07:00
sdong	e7c24168d8	Move old files to warm tier in FIFO compactions (#8310 ) Summary: Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310 Test Plan: Add several unit tests. Reviewed By: ajkr Differential Revision: D28493792 fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501	2021-08-09 12:51:14 -07:00
Roy Crihfield	d4b75d295f	Add more C bindings for OptimisticTransactionDB (#8526 ) Summary: * `rocksdb_optimistictransactiondb_checkpoint_object_create` * `rocksdb_optimistictransactiondb_write` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8526 Reviewed By: ajkr Differential Revision: D30076822 Pulled By: jay-zhuang fbshipit-source-id: a59956a8d5449e75d39a8087fbb2bad148cf697d	2021-08-06 19:10:48 -07:00
Levi Tamasi	87882736ef	Fix the sorting of KeyContexts for batched MultiGet (#8633 ) Summary: `CompareKeyContext::operator()` on the trunk has a bug: when comparing column family IDs, `lhs` is used for both sides of the comparison. This results in the `KeyContext`s getting sorted solely based on key, which in turn means that keys with the same column family do not necessarily form a single range in the sorted list. This violates an assumption of the batched `MultiGet` logic, leading to the same column family showing up multiple times in the list of `MultiGetColumnFamilyData`. The end result is the code attempting to check out the thread-local `SuperVersion` for the same CF multiple times, causing an assertion violation in debug builds and memory corruption/crash in release builds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D30169182 Pulled By: ltamasi fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023	2021-08-06 16:27:42 -07:00
Zaorang Yang	e95c570047	Fix the wrong comment of level compaction cf paths test (#8533 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8533 Reviewed By: ajkr Differential Revision: D29718067 fbshipit-source-id: b4b91c9271362e7a7d47ddbaf28f56fb537cc668	2021-08-06 15:27:12 -07:00
mrambacher	d057e8326d	Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481 ) Summary: - Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes. - Added Options/Configurable/Object Registration for TTL and Cassandra variants - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option - Added tests for new functionality Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481 Reviewed By: zhichao-cao Differential Revision: D30136050 Pulled By: mrambacher fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af	2021-08-06 08:27:25 -07:00
Akanksha Mahajan	fd2079938d	Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620 ) Summary: Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620 Test Plan: Added new unit test Reviewed By: anand1976 Differential Revision: D30091319 Pulled By: akankshamahajan15 fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c	2021-08-05 19:44:51 -07:00
Levi Tamasi	9b25d26dc8	Attempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624 ) Summary: We've been seeing occasional crashes on CI while inserting into the vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes don't reproduce locally (could be either a race or an object lifecycle issue) but the good news is that the vectors in question are not really used for anything meaningful by the test. (The assertion about the sizes of the two vectors being equal is guaranteed to hold, since the two sync points where they are populated are right after each other.) The patch simply removes the vectors from the test, alongside the associated callbacks and sync points. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D30118485 Pulled By: ltamasi fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647	2021-08-05 18:36:16 -07:00
Andrew Kryczka	a685a701ca	Do not attempt to rename non-existent info log (#8622 ) Summary: Previously we attempted to rename "LOG" to "LOG.old.*" without checking its existence first. "LOG" had no reason to exist in a new DB. Errors in renaming a non-existent "LOG" were swallowed via `PermitUncheckedError()` so things worked. However the storage service's error monitoring was detecting all these benign rename failures. So it is better to fix it. Also with this PR we can now distinguish rename failure for other reasons and return them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622 Test Plan: new unit test Reviewed By: akankshamahajan15 Differential Revision: D30115189 Pulled By: ajkr fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170	2021-08-04 17:25:00 -07:00
Yanqin Jin	0879c24040	Fix NotifyOnFlushCompleted() for atomic flush (#8585 ) Summary: PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure `OnFlushCompleted()` is called after committing flush results to MANIFEST. However, `flush_jobs_info_` is not updated in atomic flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`. This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D29913720 Pulled By: riversand963 fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda	2021-08-03 13:31:10 -07:00
Akanksha Mahajan	8b2f60b668	Cache warming blocks during flush (#8561 ) Summary: Insert warm blocks (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561 Test Plan: Added unit test Reviewed By: anand1976 Differential Revision: D29773411 Pulled By: akankshamahajan15 fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66	2021-08-03 12:44:15 -07:00
Baptiste Lemaire	b278152261	Fix db stress crash mempurge (#8604 ) Summary: The db_stress crash was caused by a call to `IsFlushPending()` made by a stats function which triggered an `assert([false])`, which I didn't plan when I created the `trigger_flush` bool. It turns out that this bool variable is not useful: I created it because I thought the `imm_flush_needed` atomic bool would actually trigger a flush. It turns out that this bool is only checked in `IsFlushPending` - this is its only use - and a flush is triggered by either a background thread checking on the imm array, or by an explicit call to `SchedulePendingFlush` which creates a flush request, that is then added to a flush request queue. In this PR, I reverted the MemtableList::Add function to what it was before my changes. I tested the fix by running the exact command line that deterministically triggered the assert error (see below), which confirmed that this is where the error was coming from. I also run `db_crashtest.py whitebox` and `blackbox` for a couple hours locally before committing this PR. Experiment run: ```./db_stress --acquire_snapshot_one_in=0 --allow_concurrent_memtable_write=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=76.90653425292307 --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_ttl=2 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --enable_compaction_filter=1 --enable_pipelined_write=0 --expected_values_path=/dev/shm/rocksdb/rocksdb_crashtest_expected --experimental_allow_mempurge=1 --experimental_mempurge_policy=kAlternate --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --long_running_snapshots=1 --mark_for_compaction_one_file_in=10 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=-1 --prefixpercent=0 --progress_reports=0 --read_fault_one_in=0 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync=1 --sync_fault_injection=False --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=3 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_ribbon_filter=1 --user_timestamp_size=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8604 Reviewed By: pdillinger Differential Revision: D30047295 Pulled By: bjlemaire fbshipit-source-id: b9e379bfa3d6b9bd2b275725fb0bca4bd81a3dbe	2021-08-02 20:26:35 -07:00
Levi Tamasi	3f7e929865	Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605 ) Summary: The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk unlocks the DB mutex before destroying the `ThreadLocalPtr` holding the per-thread `SuperVersion` pointers when the only remaining reference is the back reference from `super_version_`. The idea behind this was to break the circular dependency between `ColumnFamilyData` and `SuperVersion`: when the penultimate reference goes away, `ColumnFamilyData` can clean up the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there is a `SuperVersion` and it is not referenced by anything else.) However, unlocking the mutex throws a wrench in this plan by making it possible for another thread to jump in and take another reference to the `ColumnFamilyData`, keeping the object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like https://github.com/facebook/rocksdb/issues/8440 , https://github.com/facebook/rocksdb/issues/8382 , and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet` destructor we sometimes observe during close in our stress tests. Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call `SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone. https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced the circular dependency and associated cleanup logic described above (in order to enable iterators to keep the `ColumnFamilyData` for dropped column families alive), and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`. Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete` in `SuperVersion::Cleanup`. The patch simply eliminates the unlocking and relocking, which has been unnecessary ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free. This closes the window during which another thread could increase the reference count, and hopefully fixes the issues above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605 Test Plan: Ran `make check` and stress tests locally. Reviewed By: pdillinger Differential Revision: D30051035 Pulled By: ltamasi fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528	2021-08-02 18:12:11 -07:00
yangzaorang	8e91bd90d2	Fix a issue with initializing blob header buffer (#8537 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8537 Reviewed By: ajkr Differential Revision: D29838132 Pulled By: jay-zhuang fbshipit-source-id: e3e78d5f85f240a1800ace417a8b634f74488e41	2021-08-02 17:15:06 -07:00
mrambacher	ab7f7c9e49	Allow WAL dir to change with db dir (#8582 ) Summary: Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file. After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path). Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582 Reviewed By: pdillinger, autopear Differential Revision: D29881122 Pulled By: mrambacher fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb	2021-07-30 12:16:44 -07:00
Yanqin Jin	066b51126d	Several simple local code clean-ups (#8565 ) Summary: This PR tries to remove some unnecessary checks as well as unreachable code blocks to improve readability. An obvious non-public API method naming typo is also corrected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565 Test Plan: make check Reviewed By: lth Differential Revision: D29963984 Pulled By: riversand963 fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a	2021-07-30 12:07:49 -07:00
Peter Dillinger	1d34cd797e	Fix insecure internal API for GetImpl (#8590 ) Summary: Calling the GetImpl function could leave reference to a local callback function in a field of a parameter struct. As this is performance-critical code, I'm not going to attempt to sanitize this code too much, but make the existing hack a bit cleaner by reverting what it overwrites in the input struct. Added SaveAndRestore utility class to make that easier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590 Test Plan: added unit test for SaveAndRestore; existing tests for GetImpl Reviewed By: riversand963 Differential Revision: D29947983 Pulled By: pdillinger fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb	2021-07-29 17:23:01 -07:00
sdong	e8f218cb68	DB::GetSortedWalFiles() to ensure file deletion is disabled (#8591 ) Summary: If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done. Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591 Test Plan: Run all existing tests Reviewed By: pdillinger Differential Revision: D29969412 fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de	2021-07-29 11:51:08 -07:00
Peter Dillinger	0804b44fb6	Some fixes and enhancements to `ldb repair` (#8544 ) Summary: * Basic handling of SST file with just range tombstones rather than failing assertion about smallest_seqno <= largest_seqno * Adds --verbose option so that there exists a way to see the INFO output from Repairer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8544 Test Plan: unit test added, manual testing for --verbose Reviewed By: ajkr Differential Revision: D29954805 Pulled By: pdillinger fbshipit-source-id: 696af25805fc36cc178b04ba6045922a22625fd9	2021-07-28 16:44:14 -07:00
jimmycleary	e0ff365a76	Replace macros in compaction_iterator.cc with inline functions (#8592 ) Summary: Internal task T96186510. Created new inline member functions in `CompactionIterator`, `DefinitelyInSnapshot`, `DefinitelyNotInSnapshot`, and `InEarliestSnapshot` to replace the macros at the top of `compaction_iterator.cc`. Placed the definitions in `compaction_iterator.h` in accordance with Google's style guide for inline functions. Separated the declarations and definitions, and only placed the `inline` keyword on the definitions, in line with ISO CPP recommendations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8592 Test Plan: Ran `make check`. Successful build and all tests appeared to pass. Reviewed By: riversand963 Differential Revision: D29966782 Pulled By: jimmycFB fbshipit-source-id: 3584290bbbabf862e9ab58852281f46d37f58be6	2021-07-28 14:53:29 -07:00
Peter Dillinger	74b7c0d249	Fix use-after-free on implicit temporary FileOptions (#8571 ) Summary: FileOptions has an implicit conversion from EnvOptions and some internal APIs take `const FileOptions&` and save the reference, which is counter to Google C++ guidelines, > Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements. This is at least a problem for repair.cc, which passes an EnvOptions to TableCache(), which would save a reference to the temporary copy as FileOptions. This was unfortunately only caught as a side effect of changes in https://github.com/facebook/rocksdb/issues/8544. This change fixes the repair.cc case and updates the involved internal APIs that save a reference to use `const FileOptions*` instead. Unfortunately, I don't know how to get any of our sanitizers to reliably report bugs like this, so I can't rule out more existing in our codebase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571 Test Plan: Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on AWS EC2) Reviewed By: ajkr Differential Revision: D29943890 Pulled By: pdillinger fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9	2021-07-27 21:49:14 -07:00
Peter Dillinger	e352bd5742	Fix missing Handle release in TableCache::GetRangeTombstoneIterator (#8589 ) Summary: This appears to be little used code so not a major bug, but is blocking https://github.com/facebook/rocksdb/issues/8544 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8589 Test Plan: Added regression test to the end of DBRangeDelTest::TableEvictedDuringScan. Without this fix, ASAN reports memory leak. Reviewed By: ajkr Differential Revision: D29943623 Pulled By: pdillinger fbshipit-source-id: f7115fa6d4440aef83888ff609aa03d09216463b	2021-07-27 21:32:11 -07:00
mrambacher	3aee4fbd41	Make EventListener into a Customizable Class (#8473 ) Summary: - Added Type/CreateFromString - Added ability to load EventListeners to DBOptions - Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473 Reviewed By: zhichao-cao Differential Revision: D29901488 Pulled By: mrambacher fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254	2021-07-27 07:47:02 -07:00
Baptiste Lemaire	4361d6d163	Add simple heuristics for experimental mempurge. (#8583 ) Summary: Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE. `ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput). `ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios. This PR also includes support for `experimental_mempurge_policy` in `db_bench`. Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583 Reviewed By: pdillinger Differential Revision: D29888050 Pulled By: bjlemaire fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1	2021-07-26 11:56:29 -07:00
leipeng	4171e3db9b	CompactionJob::Install(): fix log truncation (#8563 ) Summary: event log info may be truncated, the default buffer size is 512, this PR changes buffer size to 8192. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8563 Reviewed By: ajkr Differential Revision: D29838229 Pulled By: jay-zhuang fbshipit-source-id: 00c5dea3caff0641a209f02c972e92d65b505f50	2021-07-23 11:39:24 -07:00
Drewryz	3b27725245	Fix a minor issue with initializing the test path (#8555 ) Summary: The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555 Reviewed By: ajkr Differential Revision: D29758399 Pulled By: jay-zhuang fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011	2021-07-23 08:38:45 -07:00
Baptiste Lemaire	c521a9ab2b	Retire superfluous functions introduced in earlier mempurge PRs. (#8558 ) Summary: The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge. By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558 Test Plan: Already included in `db_flush_test.cc`. Reviewed By: anand1976 Differential Revision: D29764351 Pulled By: bjlemaire fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437	2021-07-22 18:29:13 -07:00
Yanqin Jin	2e5388178f	Return error if trying to open secondary on missing or inaccessible primary (#8200 ) Summary: If the primary's CURRENT file is missing or inaccessible, the secondary should not hang trying repeatedly to switch to the next MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D27840627 Pulled By: riversand963 fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1	2021-07-22 15:48:58 -07:00
Peter Dillinger	84eef260de	Remove TaskLimiterToken::ReleaseOnce for fix (#8567 ) Summary: Rare TSAN and valgrind failures are caused by unnecessary reading of a field on the TaskLimiterToken::limiter_ for an assertion after the token has been released and the limiter destroyed. To simplify we can simply destroy the token before triggering DB shutdown (potentially destroying the limiter). This makes the ReleaseOnce logic unnecessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8567 Test Plan: watch for more failures in CI Reviewed By: ajkr Differential Revision: D29811795 Pulled By: pdillinger fbshipit-source-id: 135549ebb98fe4f176d1542ed85d5bd6350a40b3	2021-07-21 17:37:53 -07:00
Jay Zhuang	42eaa45c1b	Avoid updating option if there's no value updated (#8518 ) Summary: Try avoid expensive updating options operation if `SetDBOptions()` does not change any option value. Skip updating is not guaranteed, for example, changing `bytes_per_sync` to `0` may still trigger updating, as the value could be sanitized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D29672639 Pulled By: jay-zhuang fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4	2021-07-21 13:45:59 -07:00
Zhichao Cao	87e82a41a9	Fix incorrect Status::NoSpace() status check (#8504 ) Summary: If we want to check whether a Status s is NoSpace() or not, we should check the subcode instread of using s==Status::NoSpace(). Fix some of the incorrect check in the ErrorHandler. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8504 Test Plan: make check Reviewed By: anand1976 Differential Revision: D29601764 Pulled By: zhichao-cao fbshipit-source-id: cdab56a827891c23746bba9cbb53f169fe35f086	2021-07-20 18:09:51 -07:00
sdong	9e885939a3	Change to code for trimmed memtable history is to released outside DB mutex (#8530 ) Summary: Currently, the code shows that we delete memtables immedately after it is trimmed from history. Although it should never happen as the super version still holds the memtable, which is only switched after it, it feels a good practice not to do it, but use clean it up in the standard way: put it to WriteContext and clean it after DB mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8530 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D29703410 fbshipit-source-id: 21d8068ac6377de4b6fa7a89697195742659fde4	2021-07-16 19:28:48 -07:00
Peter Dillinger	df5dc73bec	Don't hold DB mutex for block cache entry stat scans (#8538 ) Summary: I previously didn't notice the DB mutex was being held during block cache entry stat scans, probably because I primarily checked for read performance regressions, because they require the block cache and are traditionally latency-sensitive. This change does some refactoring to avoid holding DB mutex and to avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats"). Some tests have to be updated because now the stats collector is populated in the Cache aggressively on DB startup rather than lazily. (I hope to clean up some of this added complexity in the future.) This change also ensures proper treatment of need_out_of_mutex for non-int DB properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538 Test Plan: Added unit test logic that uses sync points to fail if the DB mutex is held during a scan, covering the various ways that a scan might be triggered. Performance test - the known impact to holding the DB mutex is on TransactionDB, and the easiest way to see the impact is to hack the scan code to almost always miss and take an artificially long time scanning. Here I've injected an unconditional 5s sleep at the call to ApplyToAllEntries. Before (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0) rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856 $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0) rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323 Notice the 5s P100 write time. After (hacked): $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0) rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407 $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918 P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code: $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0) rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767 $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 \| egrep 'db.db.write.micros\|micros/op' randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0) rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832 No substantial difference. Reviewed By: siying Differential Revision: D29738847 Pulled By: pdillinger fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b	2021-07-16 14:13:08 -07:00
Mark Rambacher	42ba60b3ba	Make EncryptionProvider and BlockCipher into Customizable objects (#8354 ) Summary: Made the EncryptionProvider and BlockCipher classes inherit from Customizable. Added/fixed the CreateFromString method to these classes to create instances from builtin or registered classes. Added tests to verify that instances can be registered and retrieved as appropriate. Added the ability to configure the builtin (CTR, ROT13) classes from configurable properties. Added the appropriate tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8354 Reviewed By: zhichao-cao Differential Revision: D29558949 Pulled By: mrambacher fbshipit-source-id: c20286b32d179777e060f51a58943e9b0cf81d04	2021-07-16 07:58:51 -07:00
Baptiste Lemaire	206845c057	Mempurge support for wal (#8528 ) Summary: In this PR, `mempurge` is made compatible with the Write Ahead Log: in case of recovery, the DB is now capable of recovering the data that was "mempurged" and kept in the `imm()` list of immutable memtables. The twist was to add a uint64_t to the `memtable` struct to store the number of the earliest log file containing entries from the `memtable`. When a `Flush` operation is replaced with a `MemPurge`, the `VersionEdit` (which usually contains the new min log file number to pick up for recovery and the level 0 file path of the newly created SST file) is no longer appended to the manifest log, and every time the `deleteWal` method is called, a check is made on the list of immutable memtables. This PR also includes a unit test that verifies that no data is lost upon Reopening of the database when the mempurge feature is activated. This extensive unit test includes two column families, with valid data contained in the imm() at time of "crash"/reopening (recovery). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8528 Reviewed By: pdillinger Differential Revision: D29701097 Pulled By: bjlemaire fbshipit-source-id: 072a900fb6ccc1edcf5eef6caf88f3060238edf9	2021-07-15 17:49:13 -07:00
longlijian	803a40d412	Delete legacy code not used any more. (#8508 ) Summary: The removed function in this PR, just only have declared and dose not have any reference used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8508 Reviewed By: mrambacher Differential Revision: D29649033 Pulled By: jay-zhuang fbshipit-source-id: df98143b73d6c184a2a60c9f7ea2548a065ee35d	2021-07-14 16:04:56 -07:00
hongrubb	870033291a	Fix Get() return status when block cache is disabled (#8485 ) Summary: This PR is for https://github.com/facebook/rocksdb/issues/8453 We need to update `s = biter.status();` when `biter.status().IsIncomplete()` is true. By doing this, can fix the problem in issue. Besides, we still need to update `db_statistics` in `get_context.ReportCounters()` before return back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8485 Reviewed By: jay-zhuang Differential Revision: D29604835 Pulled By: ajkr fbshipit-source-id: c7f2f1cd058223ce1b507ec05d57cf264b9c9710	2021-07-13 18:13:24 -07:00
bjlemaire	955b80e84f	Add WARN/INFO for mempurge output status. (#8514 ) Summary: The MemPurge output status can either be an Abort if the mempurge is aborted due to the new_mem memtable reaching more than the target capacity (currently 60%), or for other reasons. As a result, in the log, we want to differentiate between an abort status, which in this PR only leads to a ROCKS_LOG_INFO, and any other status, which in this PR leads to a ROCKS_LOG_WARN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8514 Reviewed By: pdillinger Differential Revision: D29662446 Pulled By: bjlemaire fbshipit-source-id: c9bec8e238ebc7ecb14fbbddf580e6887e281c16	2021-07-12 10:42:14 -07:00

... 2 3 4 5 6 ...

4853 Commits