rocksdb

Author	SHA1	Message	Date
Cheng Chang	3f9b75604d	Fix wrong level args (#7346 ) Summary: The level args should be output level instead of input levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7346 Test Plan: make check Reviewed By: ajkr Differential Revision: D23506373 Pulled By: cheng-chang fbshipit-source-id: b2f701d44c13581c5c10c4dbebded4fcd354d641	2020-09-03 23:17:37 -07:00
Hiep	d0c1a01c1b	Avoid converting MERGES to PUTS when allow_ingest_behind is true (#7166 ) Summary: - Closes https://github.com/facebook/rocksdb/issues/6490 - Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166 Test Plan: - Automated: `make all check` - Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges. Reviewed By: cheng-chang Differential Revision: D23325425 Pulled By: ajkr fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68	2020-09-03 14:39:58 -07:00
Zhichao Cao	d51f88c9e4	Pass SST file checksum information through OnTableFileCreated (#7108 ) Summary: When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108 Test Plan: make check, listener_test. Reviewed By: ajkr Differential Revision: D22470240 Pulled By: zhichao-cao fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae	2020-08-25 10:46:11 -07:00
Jay Zhuang	ac7dcfda10	Add missing ComputeCompactionScore() for a new universal manual compaction (#7281 ) Summary: Seems it's only causing assert failure during compaction pick, but in production code, the problematic compactions are excluded at a later step. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7281 Reviewed By: akankshamahajan15 Differential Revision: D23228000 Pulled By: jay-zhuang fbshipit-source-id: 2e4055aeebe0f5a2b07e299e0a2d51a1ad2e216d	2020-08-19 17:42:08 -07:00
Andrew Kryczka	a1aa3f8385	Disable manual compaction during `ReFitLevel()` (#7250 ) Summary: Manual compaction with `CompactRangeOptions::change_levels` set could refit to a level targeted by another manual compaction. If force_consistency_checks were disabled, it could be possible for overlapping files to be written at that target level. This PR prevents the possibility by calling `DisableManualCompaction()` prior to `ReFitLevel()`. It also improves the manual compaction disabling mechanism to wait for pending manual compactions to complete before returning, and support disabling from multiple threads. Fixes https://github.com/facebook/rocksdb/issues/6432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250 Test Plan: crash test command that repro'd the bug reliably: ``` $ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0 ``` Reviewed By: ltamasi Differential Revision: D23090800 Pulled By: ajkr fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13	2020-08-14 11:29:52 -07:00
Akanksha Mahajan	1f9f630b27	Store FileSystemPtr object that contains FileSystem ptr (#7180 ) Summary: As part of the IOTracing project, this PR 1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer. 2. FileSystemPtr object is created using FileSystem pointer and IOTracer pointer. 3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor. 4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr returns FileSystemTracingWrapper pointer for tracing purpose and when it is disabled underlying FileSystem pointer is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180 Test Plan: make check -j64 COMPILE_WITH_TSAN=1 make check -j64 Reviewed By: anand1976 Differential Revision: D22987117 Pulled By: akankshamahajan15 fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1	2020-08-12 17:31:23 -07:00
anand76	f308da5273	Fix delete triggered compaction for single level universal (#7224 ) Summary: Delete triggered compaction (DTC) for universal compaction style with ```num_levels = 1``` has been disabled for sometime due to a data correctness bug. This PR re-enables it with a bug fix. A file marked for compaction can be picked, along with all L0 files after it as the compaction input. We stop adding files to the input once we encounter a file already being compacted (the original bug failed to check the compaction status of the files). Tests: Add unit tests to ```compaction_picker_test.cc``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7224 Reviewed By: ajkr Differential Revision: D23031845 Pulled By: anand1976 fbshipit-source-id: 9de3cab5f9774cede666c2c48d309a7d9b88a505	2020-08-10 12:19:17 -07:00
jsteemann	5e1808d515	fix typo: paraniod -> paranoid (#7163 ) Summary: Rename "paraniod" to "paranoid" in a few places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7163 Reviewed By: ajkr Differential Revision: D22678242 fbshipit-source-id: 28b1011a736d0a95612676f7e1b9500a70c324b4	2020-08-06 14:25:34 -07:00
Andrew Kryczka	a4a4a2dabd	dedup ReadOptions in iterator hierarchy (#7210 ) Summary: Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator` and every `LevelIterator`. This redundancy consumes extra memory, resulting in the `Arena` making more allocations, and iteration observing worse cache performance. This PR migrates callers of `NewInternalIterator()` and `MakeInputIterator()` to provide a `ReadOptions` object guaranteed to outlive the returned iterator. When the iterator's lifetime will be managed by the user, this lifetime guarantee is achieved by storing the `ReadOptions` value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and `MakeInputIterator()` can hold a reference-to-const `ReadOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210 Test Plan: - `make check` under ASAN and valgrind - benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes. Reviewed By: anand1976 Differential Revision: D22861323 Pulled By: ajkr fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce	2020-08-03 15:23:04 -07:00
Aaron Kabcenell	56ed601df3	Compaction Read/Write Stats by Compaction Type (#7165 ) Summary: Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165 Test Plan: TTL and periodic can be checked by runnning db_bench with the options activated: /db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1 ./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1 Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected. Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking Reviewed By: ajkr Differential Revision: D22693050 Pulled By: akabcenell fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d	2020-07-29 13:39:29 -07:00
Tomas Kolda	cd4592c220	SST Partitioner interface that allows to split SST files (#6957 ) Summary: SST Partitioner interface that allows to split SST files during compactions. It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957 Reviewed By: ajkr Differential Revision: D22461239 fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e	2020-07-24 13:44:49 -07:00
Jay Zhuang	b0c5ecd6b3	Make max_subcompactions dynamically changeable (#7159 ) Summary: Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159 Reviewed By: siying Differential Revision: D22671238 Pulled By: jay-zhuang fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42	2020-07-22 18:32:52 -07:00
mrambacher	d44cbc5314	Add hash of key/value checks when paranoid_file_checks=true (#7134 ) Summary: When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected. Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134 Reviewed By: cheng-chang Differential Revision: D22646149 fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f	2020-07-22 11:04:40 -07:00
Zhichao Cao	a10f12eda1	Auto resume the DB from Retryable IO Error (#6765 ) Summary: In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765 Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check Reviewed By: anand1976 Differential Revision: D21916789 Pulled By: zhichao-cao fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b	2020-07-15 11:03:58 -07:00
Yanqin Jin	27735dea9a	Report corrupted keys during compaction (#7124 ) Summary: Currently, RocksDB lets compaction to go through even in case of corrupted keys, the number of which is reported in CompactionJobStats. However, RocksDB does not check this value. We should let compaction run in a stricter mode. Temporarily disable two tests that allow corrupted keys in compaction. With this PR, the two tests will assert(false) and terminate. Still need to investigate what is the recommended google-test way of doing it. Death test (EXPECT_DEATH) in gtest has warnings now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124 Test Plan: make check Reviewed By: ajkr Differential Revision: D22530722 Pulled By: riversand963 fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6	2020-07-14 17:18:17 -07:00
Jay Zhuang	00de699096	Replace reinterpret_cast with static_cast_with_check (#7067 ) Summary: Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067 Reviewed By: siying Differential Revision: D22361587 Pulled By: jay-zhuang fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1	2020-07-02 19:25:41 -07:00
Burton Li	5be2cb6948	Compaction filter support for BlobDB (#6850 ) Summary: Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850 Reviewed By: siying Differential Revision: D22263487 Pulled By: ltamasi fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8	2020-06-29 17:32:14 -07:00
Yanqin Jin	d47c871190	Fix data race to VersionSet::io_status_ (#7034 ) Summary: After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple threads without lock, causing tsan test to fail. For example, a bg flush thread resets io_status_ before calling LogAndApply(), while another thread already in the process of LogAndApply() reads io_status_. This is a bug. We do not have to reset io_status_ each time we call LogAndApply(). io_status_ is part of the state of VersionSet, and it indicates the outcome of preceding MANIFEST/CURRENT files IO operations. Its value should be updated only when: 1. MANIFEST/CURRENT files IO fail for the first time. 2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior failure without process restart, e.g. calling Resume(). Test Plan (devserver): COMPILE_WITH_TSAN=1 make check COMPILE_WITH_TSAN=1 make db_test2 ./db_test2 --gtest_filter=DBTest2.CompactionStall Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034 Reviewed By: zhichao-cao Differential Revision: D22247137 Pulled By: riversand963 fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216	2020-06-27 08:57:31 -07:00
Yanqin Jin	e66199d848	First step towards handling MANIFEST write error (#6949 ) Summary: This PR provides preliminary support for handling IO error during MANIFEST write. File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted. One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach. If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled. Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled. Possible future directions: - Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added. Test plan (dev server): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949 Reviewed By: anand1976 Differential Revision: D22026020 Pulled By: riversand963 fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8	2020-06-24 19:07:08 -07:00
Zitan Chen	94d04529de	Store DB identity and DB session ID in SST files (#6983 ) Summary: `db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file. The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`. In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`. A table property test is added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983 Test Plan: make check and some manual tests. Reviewed By: zhichao-cao Differential Revision: D22048826 Pulled By: gg814 fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9	2020-06-17 10:57:40 -07:00
Andrew Kryczka	02db03af8d	make L0 index/filter pinned memory usage predictable (#6911 ) Summary: Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911 Test Plan: unit test Reviewed By: siying Differential Revision: D21835818 Pulled By: ajkr fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4	2020-06-09 16:51:23 -07:00
sdong	afa3518839	Revert "Update googletest from 1.8.1 to 1.10.0 (#6808 )" (#6923 ) Summary: This reverts commit `8d87e9cea1`. Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923 Reviewed By: pdillinger Differential Revision: D21864799 fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc	2020-06-03 15:55:03 -07:00
Zhichao Cao	2adb7e3768	Fix potential overflow of unsigned type in for loop (#6902 ) Summary: x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902 Test Plan: pass make asan_check Reviewed By: ajkr Differential Revision: D21843767 Pulled By: zhichao-cao fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1	2020-06-02 15:05:07 -07:00
Adam Retter	8d87e9cea1	Update googletest from 1.8.1 to 1.10.0 (#6808 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808 Reviewed By: anand1976 Differential Revision: D21483984 Pulled By: pdillinger fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598	2020-06-01 20:33:42 -07:00
Levi Tamasi	e3f953a863	Make it possible to look up files by number in VersionStorageInfo (#6862 ) Summary: Does what it says on the can: the patch adds a hash map to `VersionStorageInfo` that maps file numbers to file locations, i.e. (level, position in level) pairs. This will enable stricter consistency checks in `VersionBuilder`. The patch also fixes all the unit tests that used duplicate file numbers in a version (which would trigger an assertion with the new code). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6862 Test Plan: `make check` `make whitebox_crash_test` Reviewed By: riversand963 Differential Revision: D21670446 Pulled By: ltamasi fbshipit-source-id: 2eac249945cf33d8fb8597b26bfff5221e1a861a	2020-05-28 10:03:06 -07:00
anand76	50d63a2af0	Fix LITE build failure in compaction_picker_test (#6839 ) Summary: Fix LITE build. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6839 Test Plan: LITE=1 make check Reviewed By: riversand963 Differential Revision: D21535808 Pulled By: anand1976 fbshipit-source-id: fcad961eca08e13fb0c256c92d18c3c1f1165f22	2020-05-13 10:47:15 -07:00
anand76	94265234de	Fix race due to delete triggered compaction in Universal compaction mode (#6799 ) Summary: Delete triggered compaction in universal compaction mode was causing a corruption when scheduled in parallel with other compactions. 1. When num_levels = 1, a file marked for compaction may be picked along with all older files in L0, without checking if any of them are already being compaction. This can cause unpredictable results like resurrection of older versions of keys or deleted keys. 2. When num_levels > 1, a delete triggered compaction would not get scheduled if it overlaps with a running regular compaction. However, the reverse is not true. This is due to the fact that in ```UniversalCompactionBuilder::CalculateSortedRuns```, it assumes that entire sorted runs are picked for compaction and only checks the first file in a sorted run to determine conflicts. This is violated by a delete triggered compaction as it works on a subset of a sorted run. Fix the bug for num_levels > 1, and disable the feature for now when num_levels = 1. After disabling this feature, files would still get marked for compaction, but no compaction would get scheduled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6799 Reviewed By: pdillinger Differential Revision: D21431286 Pulled By: anand1976 fbshipit-source-id: ae9f0bdb1d6ae2f10284847db731c23f43af164a	2020-05-07 17:32:17 -07:00
Cheng Chang	91bc0130fa	Refactor level compaction picker (#6804 ) Summary: 1. refactor out PickFileToCompact to remove duplicated logic. 2. remove redundant checks of `start_level_inputs_.empty()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6804 Test Plan: make check Reviewed By: siying Differential Revision: D21390053 Pulled By: cheng-chang fbshipit-source-id: 185d5987a08bfdaf63f0f245310c6da69878d415	2020-05-05 11:09:29 -07:00
Peter Dillinger	b810e62b39	Clarifying comments in db.h (#6768 ) Summary: And fix a confusingly worded log message Pull Request resolved: https://github.com/facebook/rocksdb/pull/6768 Reviewed By: anand1976 Differential Revision: D21284527 Pulled By: pdillinger fbshipit-source-id: f03c1422c229a901c3a65e524740452349626164	2020-04-28 15:26:03 -07:00
Mike Kolupaev	e45673dece	Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621 ) Summary: Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype. Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling. It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas. Note that the deferred value loading only happens for internal iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621 Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats. Reviewed By: siying Differential Revision: D20786930 Pulled By: al13n321 fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee	2020-04-15 17:40:44 -07:00
Yanqin Jin	0c05624d50	Compaction with timestamp: input boundaries (#6645 ) Summary: Towards making compaction logic compatible with user timestamp. When computing boundaries and overlapping ranges for inputs of compaction, We need to compare SSTs by user key without timestamp. Test plan (devserver): ``` make check ``` Several individual tests: ``` ./version_set_test --gtest_filter=VersionStorageInfoTimestampTest.GetOverlappingInputs ./db_with_timestamp_compaction_test ./db_with_timestamp_basic_test ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6645 Reviewed By: ltamasi Differential Revision: D20960012 Pulled By: riversand963 fbshipit-source-id: ad377fa9eb481bf7a8a3e1824aaade48cdc653a4	2020-04-10 16:05:49 -07:00
Akanksha Mahajan	a0faff126d	Report kFilesMarkedForCompaction for delete triggered compactions (#6680 ) Summary: Summary : Set manual_compaction false in case of DeleteTriggeredCompaction object so that kFilesMarkedForComapaction can be reported. Added a DeletionTriggeredUniversalCompactionMarking test case for Deletion Triggered compaction in case of Universal Compaction. Test Plan : make check -j64 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6680 Test Plan: make check -j64 Reviewed By: anand1976 Differential Revision: D20945946 Pulled By: akankshamahajan15 fbshipit-source-id: af84e417bd7127652aaae9143c560d1ab3815d25	2020-04-10 15:30:38 -07:00
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	2020-04-01 16:40:18 -07:00
Yanqin Jin	b5818f87f0	Fix clang analyze error (#6622 ) Summary: As title. https://github.com/facebook/rocksdb/issues/6612 caused clang analyze to fail with the error: ``` db/compaction/compaction_picker_fifo.cc:105:39: warning: Called C++ object pointer is null cf_name.c_str(), f->fd.GetNumber(), creation_time); ^~~~~~~~~~~~~~~~~ ./logging/logging.h:59:36: note: expanded from macro 'ROCKS_LOG_BUFFER' ##__VA_ARGS__) ^~~~~~~~~~~ 1 warning generated. ``` Test Plan (devserver): USE_CLANG=1 make analyze Pull Request resolved: https://github.com/facebook/rocksdb/pull/6622 Reviewed By: ltamasi Differential Revision: D20787407 Pulled By: riversand963 fbshipit-source-id: a5de4910cc1aa0d3481a73ec114578925bfd63f7	2020-04-01 10:01:38 -07:00
sdong	80979f81c7	Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615 ) Summary: These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615 Test Plan: Add a unit test to make sure that SetOptions() can change the options. Reviewed By: riversand963 Differential Revision: D20755951 fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff	2020-03-31 12:11:42 -07:00
Yanqin Jin	18cf0de640	Use flush time for the props.creation_time for FIFO compaction (#6612 ) Summary: For FIFO compaction, we use flush time instead of oldest key time as the creation time. This is to prevent FIFO compaction dropping files whose oldest key time is older than TTL but which has newer keys than TTL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612 Test Plan: make check Reviewed By: siying Differential Revision: D20748217 Pulled By: riversand963 fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663	2020-03-30 18:59:17 -07:00
Zhichao Cao	e8d332d97e	Use FileChecksumGenFactory for SST file checksum (#6600 ) Summary: In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600 Test Plan: tested with make asan_check Reviewed By: riversand963 Differential Revision: D20717670 Pulled By: zhichao-cao fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6	2020-03-29 15:58:46 -07:00
Zhichao Cao	4246888101	Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487 ) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0	2020-03-27 16:04:43 -07:00
Levi Tamasi	c15e85bdcb	Move BlobDB related files under db/ to db/blob/ (#6519 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519 Test Plan: ``` make all make check ``` Differential Revision: D20400691 Pulled By: ltamasi fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94	2020-03-12 11:00:56 -07:00
Yanqin Jin	fd1da22111	Support options.max_open_files != -1 with FIFO compaction (#6503 ) Summary: Allow user to specify options.max_open_files != -1 with FIFO compaction. If max_open_files != -1, not all table files are kept open. In the past, FIFO style compaction requires all table files to be open in order to read file creation time from table properties. Later, we added file creation time to MANIFEST, making it possible to read file creation time without opening file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503 Test Plan: make check Differential Revision: D20353758 Pulled By: riversand963 fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296	2020-03-09 18:45:06 -07:00
Zhichao Cao	8d73137ae8	Replace Directory with FSDirectory in DB (#6468 ) Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1	2020-03-02 16:16:26 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
Levi Tamasi	f34782a67d	Fix the "records dropped" statistics (#6325 ) Summary: The earlier code used two conflicting definitions for the number of input records going into a compaction, one based on the `rocksdb.num.entries` table property and one based on `CompactionIterationStats`. The first one is correct and in line with how output records are counted, while the second one incorrectly ignores input records in various cases when the `CompactionIterator` advances or reseeks the input iterator (this can happen, amongst other cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s, and compaction filters). This can result in the code undercounting the input records and computing an incorrect value for "records dropped" during the compaction. The patch fixes this by switching over to the correct (table property based) input record count for "records dropped". Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325 Test Plan: Tested using `make check` and `db_bench`. Differential Revision: D19525491 Pulled By: ltamasi fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636	2020-01-23 15:27:22 -08:00
Sagar Vemuri	cfa585611d	Consider all compaction input files to compute the oldest ancestor time (#6279 ) Summary: Look at all compaction input files to compute the oldest ancestor time. In https://github.com/facebook/rocksdb/issues/5992 we changed how creation_time (aka oldest-ancestor-time) table property of compaction output files is computed from max(creation-time-of-all-compaction-inputs) to min(creation-time-of-all-inputs). This exposed a bug where, during compaction, the creation_time:s of only the L0 compaction inputs were being looked at, and all other input levels were being ignored. This PR fixes the issue. Some TTL compactions when using Level-Style compactions might not have run due to this bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6279 Test Plan: Enhanced the unit tests to validate that the correct time is propagated to the compaction outputs. Differential Revision: D19337812 Pulled By: sagar0 fbshipit-source-id: edf8a72f11e405e93032ff5f45590816debe0bb4	2020-01-10 19:02:42 -08:00
Levi Tamasi	02aa22957a	Set CompactionIterator::valid_ to false when PrepareBlobOutput indicates error Summary: With https://github.com/facebook/rocksdb/pull/6121, errors returned by `PrepareBlobValue` result in `CompactionIterator::status_` being set to `Corruption` or `IOError` as appropriate, however, `valid_` is not set to `false`. The error is eventually propagated in `CompactionJob::ProcessKeyValueCompaction` but only after the main loop completes. Setting `valid_` to `false` upon errors enables us to terminate the loop early and fail the compaction sooner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6170 Test Plan: Ran `make check` and used `db_bench` in BlobDB mode. fbshipit-source-id: a2ca88a3ca71115e2605bd34a4c795d8a28bef27	2019-12-17 10:20:16 -08:00
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	2019-12-13 14:48:41 -08:00
Levi Tamasi	583c6953d8	Move out valid blobs from the oldest blob files during compaction (#6121 ) Summary: The patch adds logic that relocates live blobs from the oldest N non-TTL blob files as they are encountered during compaction (assuming the BlobDB configuration option `enable_garbage_collection` is `true`), where N is defined as the number of immutable non-TTL blob files multiplied by the value of a new BlobDB configuration option called `garbage_collection_cutoff`. (The default value of this parameter is 0.25, that is, by default the valid blobs residing in the oldest 25% of immutable non-TTL blob files are relocated.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121 Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`. Differential Revision: D18785357 Pulled By: ltamasi fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e	2019-12-13 10:13:05 -08:00
Jermy Li	c2029f9716	Support concurrent CF iteration and drop (#6147 ) Summary: It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI. This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early. fixed https://github.com/facebook/rocksdb/issues/5982 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147 Differential Revision: D18926378 fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15	2019-12-12 19:04:48 -08:00
奏之章	3717a88289	Fix UniversalCompaction trivial move bug (#6067 ) Summary: `curr.level` is `c->inputs_` index, not real level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6067 Differential Revision: D18935726 fbshipit-source-id: 4354e6e9cd900ca56c96e9d770f0ab6634e45daf	2019-12-11 11:27:53 -08:00
sdong	aa1857e2df	Support options.max_open_files = -1 with periodic_compaction_seconds (#6090 ) Summary: options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090 Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2. Differential Revision: D18702268 fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb	2019-11-26 21:39:56 -08:00
sdong	77eab5c85a	Make default value of options.ttl to be 30 days when it is supported. (#6073 ) Summary: By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases. Make the default UINT64_MAX - 1 to indicate that it is not overridden by users. Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073 Test Plan: Add unit tests for it. Differential Revision: D18669626 fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8	2019-11-26 10:00:32 -08:00
sdong	d8c28e692a	Support options.ttl with options.max_open_files = -1 (#6060 ) Summary: Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead. Note that, this change will break forward compatibility for release 5.1 and older. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060 Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0. Differential Revision: D18631623 fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830	2019-11-22 21:23:00 -08:00
Little-Wallace	e50b64bdba	fix unstable unittest caused by #5958 (#6061 ) Summary: Signed-off-by: Little-Wallace <bupt2013211450@gmail.com> This PR is to fix unstable unit test added by (https://github.com/facebook/rocksdb/pull/5958). I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6061 Differential Revision: D18642301 fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148	2019-11-21 15:24:01 -08:00
Peter Dillinger	0306e01233	Fixes for g++ 4.9.2 compatibility (#6053 ) Summary: Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043 Stackoverflow ref on {{}} vs. {}: https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list Note to reader: .clear() does not empty out an ostringstream, but .str("") suffices because we don't have to worry about clearing error flags. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053 Test Plan: make check, manual run of filter_bench Differential Revision: D18602259 Pulled By: pdillinger fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841	2019-11-19 15:43:37 -08:00
Little-Wallace	ec3e3c3e02	Fix corruption with intra-L0 on ingested files (#5958 ) Summary: ## Problem Description Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated. * RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested. * When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number. * `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal. * If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`. * If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`. * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1. Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files: > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno So I skip pick sst file which was ingested in function `FindIntraL0Compaction ` ## Reason Here is my bug report: https://github.com/facebook/rocksdb/issues/5913 There are two situations that can cause the check to fail. ### First situation： - First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest. - If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst. - Then some data was put into memtable, and memtable was flushed to L0. We called this sst B. - RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested. ### Secondary situaion - First we have flushed many sst in L0, we call them [s1, s2, s3]. - There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1. - m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5. - [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6. - compacted 4@0 files to L0 - When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5. - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958 Differential Revision: D18601316 fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267	2019-11-19 15:09:11 -08:00
Sagar Vemuri	c17384fea4	Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992 ) Summary: When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster. This will help start cascading compactions to move the expired key range to the bottom-most level faster. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992 Test Plan: `make check` Differential Revision: D18257883 Pulled By: sagar0 fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc	2019-11-11 14:09:01 -08:00
sdong	982a7532a7	Add two test cases for single sorted universal periodic compaction (#6002 ) Summary: It's useful to add test coverage for universal compaction's periodic compaction. Add two tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6002 Test Plan: Run the two tests Differential Revision: D18363544 fbshipit-source-id: bbd04b54057315f64f959709006412db1f76d170	2019-11-07 11:14:14 -08:00
sdong	0d91a981e9	Fix assertion in universal compaction periodic compaction (#6000 ) Summary: We recently added periodic compaction to universal compaction. An old assertion that we can't onlyl compact the last sorted run triggered. However, with periodic compaction, it is possible that we only compact the last sorted run, so the assertion now became stricter than needed. Relaxing this assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6000 Test Plan: This should be a low risk change. Will observe whether stress test will pass after it. Differential Revision: D18285396 fbshipit-source-id: 9a6863debdf104c40a7f6c46ab62d84cdf5d8592	2019-11-01 18:33:12 -07:00
sdong	aa6f7d0995	Support periodic compaction in universal compaction (#5970 ) Summary: Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970 Test Plan: Add several test cases. Differential Revision: D18147097 fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1	2019-10-31 11:31:37 -07:00
Maysam Yabandeh	dccaf9f03c	Turn compaction asserts to runtime check (#5935 ) Summary: Compaction iterator has many assert statements that are active only during test runs. Some rare bugs would show up only at runtime could violate the assert condition but go unnoticed since assert statements are not compiled in release mode. Turning the assert statements to runtime check sone pors and cons: Pros: - A bug that would result into incorrect data would be detected early before the incorrect data is written to the disk. Cons: - Runtime overhead: which should be negligible since compaction cpu is the minority in the overall cpu usage - The assert statements might already being violated at runtime, and turning them to runtime failure might result into reliability issues. The patch takes a conservative step in this direction by logging the assert violations at runtime. If we see any violation reported in logs, we investigate. Otherwise, we can go ahead turning them to runtime error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5935 Differential Revision: D18229697 Pulled By: maysamyabandeh fbshipit-source-id: f1890eca80ccd7cca29737f1825badb9aa8038a8	2019-10-30 13:48:38 -07:00
Yanqin Jin	5ef27dea33	Fix clang analyzer error (#5924 ) Summary: Without this PR, clang analyzer complains. ``` $USE_CLANG=1 make analyze db/compaction/compaction_job_test.cc:161:20: warning: The left operand of '==' is a garbage value if (key.type == kTypeBlobIndex) { ~~~~~~~~ ^ 1 warning generated. ``` Test Plan (on devserver) ``` $USE_CLANG=1 make analyze ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5924 Differential Revision: D17923226 Pulled By: riversand963 fbshipit-source-id: 9d1eb769b5e0de7cb3d89dc90d1cfa895db7fdc8	2019-10-14 22:14:24 -07:00
Levi Tamasi	5f025ea832	BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903 ) Summary: This is groundwork for adding garbage collection support to BlobDB. The patch adds logic that keeps track of the oldest blob file referred to by each SST file. The oldest blob file is identified during flush/ compaction (similarly to how the range of keys covered by the SST is identified), and persisted in the manifest as a custom field of the new file edit record. Blob indexes with TTL are ignored for the purposes of identifying the oldest blob file (since such blob files are cleaned up by the TTL logic in BlobDB). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903 Test Plan: Added new unit tests; also ran db_bench in BlobDB mode, inspected the manifest using ldb, and confirmed (by scanning the SST files using sst_dump) that the value of the oldest blob file number field matches the contents of the file for each SST. Differential Revision: D17859997 Pulled By: ltamasi fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90	2019-10-14 15:21:01 -07:00
sdong	e8263dbdaa	Apply formatter to recent 200+ commits. (#5830 ) Summary: Further apply formatter to more recent commits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830 Test Plan: Run all existing tests. Differential Revision: D17488031 fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab	2019-09-20 12:04:26 -07:00
sdong	c06b54d0c6	Apply formatter on recent 45 commits. (#5827 ) Summary: Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827 Test Plan: Run all existing tests. Differential Revision: D17483727 fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f	2019-09-19 12:34:17 -07:00
Maysam Yabandeh	6ec6a4a9a4	Remove snap_refresh_nanos option (#5826 ) Summary: The snap_refresh_nanos option didn't bring much benefit. Remove the feature to simplify the code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5826 Differential Revision: D17467147 Pulled By: maysamyabandeh fbshipit-source-id: 4f950b046990d0d1292d7fc04c2ccafaf751c7f0	2019-09-18 20:26:04 -07:00
andrew	622683000c	Allow users to stop manual compactions (#3971 ) Summary: Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971 Test Plan: add tests at more spots. Differential Revision: D17369043 fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c	2019-09-16 21:01:47 -07:00
sdong	9bd5fce6e8	Refactor UniversalCompactionPicker code a little bit (#5639 ) Summary: Several functions of UniversalCompactionPicker share most of the parameters. Move these functions to a class with those shared arguments as class members. Hopefully this will make code slightly easier to maintain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5639 Test Plan: Run all existing test. Differential Revision: D16996403 fbshipit-source-id: fffafd1897ab132b420b1dec073542cffb5c44de	2019-09-16 10:51:11 -07:00
sdong	b931f84e56	Divide file_reader_writer.h and .cc (#5803 ) Summary: file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/ Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803 Test Plan: Build whole project using make and cmake. Differential Revision: D17374550 fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987	2019-09-16 10:33:51 -07:00
Wilfried Goesgens	fbab9913e2	upgrade gtest 1.7.0 => 1.8.1 for json result writing Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332 Differential Revision: D17242232 fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64	2019-09-09 11:24:11 -07:00
Vijay Nadimpalli	979fbdc696	Persistent globally unique DB ID in manifest (#5725 ) Summary: Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong. The goal of this PR is solve this problem by moving the DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725 Test Plan: Added unit tests. Differential Revision: D16963840 Pulled By: vjnadimpalli fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9	2019-09-03 08:52:24 -07:00
Yanqin Jin	672befea2a	Fix assertion failure in FIFO compaction with TTL (#5754 ) Summary: Before this PR, the following sequence of events can cause assertion failure as shown below. Stack trace (partial): ``` (gdb) bt 2 0x00007f59b350ad15 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:92 3 0x00007f59b350adc3 in __GI___assert_fail (assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:101 4 0x0000000000492ccd in rocksdb::Compaction::MarkFilesBeingCompacted (this=<optimized out>, mark_as_compacted=<optimized out>) at db/compaction/compaction.cc:394 5 0x000000000049467a in rocksdb::Compaction::Compaction (this=0x7f59af013000, vstorage=0x7f581af53030, _immutable_cf_options=..., _mutable_cf_options=..., _inputs=..., _output_level=<optimized out>, _target_file_size=0, _max_compaction_bytes=0, _output_path_id=0, _compression=<incomplete type>, _compression_opts=..., _max_subcompactions=0, _grandparents=..., _manual_compaction=false, _score=4, _deletion_compaction=true, _compaction_reason=rocksdb::CompactionReason::kFIFOTtl) at db/compaction/compaction.cc:241 6 0x00000000004af9bc in rocksdb::FIFOCompactionPicker::PickTTLCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:101 7 0x00000000004b0771 in rocksdb::FIFOCompactionPicker::PickCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:201 8 0x00000000004838cc in rocksdb::ColumnFamilyData::PickCompaction (this=this@entry=0x7f59b31b3700, mutable_options=..., log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/column_family.cc:933 9 0x00000000004f3645 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x7f59b3176000, made_progress=made_progress@entry=0x7f59b1bfa6bf, job_context=job_context@entry=0x7f59b1bfa760, log_buffer=log_buffer@entry=0x7f59b1bfa930, prepicked_compaction=prepicked_compaction@entry=0x0, thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2541 10 0x00000000004f5e2a in rocksdb::DBImpl::BackgroundCallCompaction (this=this@entry=0x7f59b3176000, prepicked_compaction=prepicked_compaction@entry=0x0, bg_thread_pri=bg_thread_pri@entry=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2312 11 0x00000000004f648e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2087 ``` This can be caused by the following sequence of events. ``` Time \| thr bg_compact_thr1 bg_compact_thr2 \| write \| flush \| mark all l0 as being compacted \| write \| flush \| add cf to queue again \| mark all l0 as being \| compacted, fail the \| assertion V ``` Test plan (on devserver) Since bg_compact_thr1 and bg_compact_thr2 are two threads executing the same code, it is difficult to use sync point dependency to coordinate their execution. Therefore, I choose to use db_stress. ``` $TEST_TMPDIR=/dev/shm/rocksdb ./db_stress --periodic_compaction_seconds=1 --max_background_compactions=20 --format_version=2 --memtablerep=skip_list --max_write_buffer_number=3 --cache_index_and_filter_blocks=1 --reopen=20 --recycle_log_file_num=0 --acquire_snapshot_one_in=10000 --delpercent=4 --log2_keys_per_lock=22 --compaction_ttl=1 --block_size=16384 --use_multiget=1 --compact_files_one_in=1000000 --target_file_size_multiplier=2 --clear_column_family_one_in=0 --max_bytes_for_level_base=10485760 --use_full_merge_v1=1 --target_file_size_base=2097152 --checkpoint_one_in=1000000 --mmap_read=0 --compression_type=zstd --writepercent=35 --readpercent=45 --subcompactions=4 --use_merge=0 --write_buffer_size=4194304 --test_batches_snapshots=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --use_direct_reads=0 --compact_range_one_in=1000000 --open_files=-1 --destroy_db_initially=0 --progress_reports=0 --compression_zstd_max_train_bytes=0 --snapshot_hold_ops=100000 --enable_pipelined_write=0 --nooverwritepercent=1 --compression_max_dict_bytes=0 --max_key=1000000 --prefixpercent=5 --flush_one_in=1000000 --ops_per_thread=40000 --index_block_restart_interval=7 --cache_size=1048576 --compaction_style=2 --verify_checksum=1 --delrangepercent=1 --use_direct_io_for_flush_and_compaction=0 ``` This should see no assertion failure. Last but not least, ``` $COMPILE_WITH_ASAN=1 make -j32 all $make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5754 Differential Revision: D17109791 Pulled By: riversand963 fbshipit-source-id: 25fc46101235add158554e096540b72c324be078	2019-08-30 12:42:01 -07:00
Maysam Yabandeh	7bc18e2727	Disable snapshot refresh feature when snap_refresh_nanos is 0 (#5724 ) Summary: The comments of snap_refresh_nanos advertise that the snapshot refresh feature will be disabled when the option is set to 0. This contract is however not honored in the code: https://github.com/facebook/rocksdb/pull/5278 The patch fixes that and also adds an assert to ensure that the feature is not used when the option is zero. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5724 Differential Revision: D16918185 Pulled By: maysamyabandeh fbshipit-source-id: fec167287df7d85093e087fc39c0eb243e3bbd7e	2019-08-20 11:40:07 -07:00
Eli Pozniansky	4834dab578	Improve CPU Efficiency of ApproximateSize (part 2) (#5609 ) Summary: In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609 Differential Revision: D16433481 Pulled By: elipoz fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b	2019-07-31 08:50:00 -07:00
haoyuhuang	705b8eecb4	Add more callers for table reader. (#5454 ) Summary: This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block. 1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name. 2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable. This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32. Differential Revision: D15819451 Pulled By: HaoyuHuang fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d	2019-06-20 14:31:48 -07:00
haoyuhuang	bb4178066d	Integrate block cache tracer into db_impl (#5433 ) Summary: This PR integrates the block cache tracer class into db_impl.cc. db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433 Differential Revision: D15728016 Pulled By: HaoyuHuang fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71	2019-06-13 15:43:10 -07:00
haoyuhuang	5efa0d6b0d	Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421 ) Summary: BlockCacheLookupContext only contains the caller for now. We will trace block accesses at five places: 1. BlockBasedTable::GetFilter. 2. BlockBasedTable::GetUncompressedDict. 3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.) 4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.) 5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.) We create the context at: 1. BlockBasedTable::Get. (kUserGet) 2. BlockBasedTable::MultiGet. (kUserMGet) 3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.) 4. BlockBasedTable::Open. (kPrefetch) 5. Index/Filter::CacheDependencies. (kPrefetch) 6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize). I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable. Throughput of this PR: 231334 ops/s. Throughput of the master branch: 238428 ops/s. Experiment setup: RocksDB: version 6.2 Date: Mon Jun 10 10:42:51 2019 CPU: 24 * Intel Core Processor (Skylake) CPUCache: 16384 KB Keys: 20 bytes each Values: 100 bytes each (100 bytes after compression) Entries: 1000000 Prefix: 20 bytes Keys per prefix: 0 RawSize: 114.4 MB (estimated) FileSize: 114.4 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: NoCompression Compression sampling rate: 0 Memtablerep: skip_list Perf Level: 1 Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120 TODOs: 1. Create a caller for external SST file ingestion and differentiate the callers for iterator. 2. Integrate tracer to trace block cache accesses. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421 Differential Revision: D15704258 Pulled By: HaoyuHuang fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a	2019-06-10 15:33:27 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	2019-05-31 17:23:59 -07:00
Vijay Nadimpalli	cae22c53fb	Make format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5395 Differential Revision: D15581698 Pulled By: vjnadimpalli fbshipit-source-id: f415972f16e784b1361714c202b97defcab46767	2019-05-31 15:24:43 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00

1 2 3 4

181 Commits