rocksdb

Author	SHA1	Message	Date
leipeng	c712b68f5b	Fix num files in single compaction for universal compaction (#9168 ) Summary: https://github.com/facebook/rocksdb/issues/9026 fixed histogram NUM_FILES_IN_SINGLE_COMPACTION for level compaction, but missed fix for universal compaction. This PR fixed NUM_FILES_IN_SINGLE_COMPACTION for universal compaction. Quote from https://github.com/facebook/rocksdb/issues/9026: > currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input. Thanks for ajkr pointed this missed fix! Pull Request resolved: https://github.com/facebook/rocksdb/pull/9168 Reviewed By: akankshamahajan15 Differential Revision: D32434494 Pulled By: ajkr fbshipit-source-id: 93ea092af4afbd8dce67898ffb350cf26b065ed2	2021-11-30 15:11:21 -08:00
Peter Dillinger	e8b5d05e93	HISTORY for #9208 (#9227 ) Summary: Update HISTORY for bug fix. This is going into 6.27 initial release. (Technically 6.27.1) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9227 Test Plan: n/a Reviewed By: ajkr Differential Revision: D32727912 Pulled By: pdillinger fbshipit-source-id: 75e7a81749a188a590d44ef47e261eaaa8667152	2021-11-30 15:01:59 -08:00
Yanqin Jin	8101643611	Update HISTORY and version.h for 6.27 release (#9192 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9192 Reviewed By: ltamasi Differential Revision: D32578141 Pulled By: riversand963 fbshipit-source-id: 16216451c87e383ca8fd309acf15106e46172aaa	2021-11-19 22:11:56 -08:00
Levi Tamasi	3a9f557451	Update HISTORY for PR 9187 (#9191 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9191 Reviewed By: riversand963 Differential Revision: D32577939 Pulled By: ltamasi fbshipit-source-id: 3c52067a0c3e9219c1aafdb711718dfcce5dedf5	2021-11-19 20:07:11 -08:00
Yanqin Jin	43ac7a2774	Fix an assertion failure when ManifestTailer switches to new Manifest in multi-cf mode (#9143 ) Summary: Original unit test fail to test the case of multi-cf mode switching to new manifest. The assertion failure will trigger when the primary instance reopens and secondary continues to tail the newly-created MANIFEST. Fix the assertion failure and update existing unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9143 Test Plan: make check Reviewed By: ltamasi Differential Revision: D32574233 Pulled By: riversand963 fbshipit-source-id: 857ddbe994019091276458abebcf8e2b65340468	2021-11-19 19:53:40 -08:00
Jay Zhuang	6cde8d2190	Deprecating `iter_start_seqnum` and `preserve_deletes` (#9091 ) Summary: `ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are deprecated, please try using user defined timestamp feature instead. The feature is used to support differential snapshots, but not well maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which returns an internal key from the iterator. The user defined timestamp feature is a more flexible feature to support similar usecase, please switch to that if you have such usecase. The deprecated feature will be removed in a future release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091 Test Plan: check LOG Fix https://github.com/facebook/rocksdb/issues/9090 Reviewed By: ajkr Differential Revision: D32071750 Pulled By: jay-zhuang fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104	2021-11-19 16:55:45 -08:00
Yanqin Jin	1e8322c0f5	Fix a bug in FlushJob picking more memtables beyond synced WALs (#9142 ) Summary: After RocksDB 6.19 and before this PR, RocksDB FlushJob may pick more memtables to flush beyond synced WALs. This can be problematic if there are multiple column families, since it can prematurely advance the flushed column family's log_number. Should subsequent attempts fail to sync the latest WALs and the database goes through a recovery, it may detect corrupted WAL number below the flushed column family's log number and complain about column family inconsistency. To fix, we record the maximum memtable ID of the column family being flushed. Then we call SyncClosedLogs() so that all closed WALs at the time when memtable ID is recorded will be synced. I also disabled a unit test temporarily due to reasons described in https://github.com/facebook/rocksdb/issues/9151 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9142 Test Plan: make check Reviewed By: ajkr Differential Revision: D32299956 Pulled By: riversand963 fbshipit-source-id: 0da75888177d91905cf8c9d00605b73afb5970a7	2021-11-19 09:56:00 -08:00
Andrew Kryczka	8cf4294e25	Adhere to per-DB concurrency limit when bottom-pri compactions exist (#9179 ) Summary: - Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect. - Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179 Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency. Reviewed By: riversand963 Differential Revision: D32488048 Pulled By: ajkr fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310	2021-11-18 17:31:50 -08:00
Akanksha Mahajan	4a7c1dc375	Add listener API that notifies on IOError (#9177 ) Summary: Add a new API in listener.h that notifies about IOErrors on Read/Write/Append/Flush etc. The API reports about IOStatus, filename, Operation name, offset and length. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9177 Test Plan: Added new unit tests Reviewed By: anand1976 Differential Revision: D32470627 Pulled By: akankshamahajan15 fbshipit-source-id: 189a717033590ae227b3beae8b1e7e185e4cdc12	2021-11-18 17:11:19 -08:00
Peter Dillinger	230660be73	Improve / clean up meta block code & integrity (#9163 ) Summary: * Checksums are now checked on meta blocks unless specifically suppressed or not applicable (e.g. plain table). (Was other way around.) This means a number of cases that were not checking checksums now are, including direct read TableProperties in Version::GetTableProperties (fixed in meta_blocks ReadTableProperties), reading any block from PersistentCache (fixed in BlockFetcher), read TableProperties in SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open, maybe more. * For that to work, I moved the global_seqno+TableProperties checksum logic to the shared table/ code, because that is used by many utilies such as SstFileDumper. * Also for that to work, we have to know when we're dealing with a block that has a checksum (trailer), so added that capability to Footer based on magic number, and from there BlockFetcher. * Knowledge of trailer presence has also fixed a problem where other table formats were reading blocks including bytes for a non-existant trailer--and awkwardly kind-of not using them, e.g. no shared code checking checksums. (BlockFetcher compression type was populated incorrectly.) Now we only read what is needed. * Minimized code duplication and differing/incompatible/awkward abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block without parsing block handle) * Moved some meta block handling code from table_properties. * Moved some code specific to block-based table from shared table/ code to BlockBasedTable class. The checksum stuff means we can't completely separate it, but things that don't need to be in shared table/ code should not be. * Use unique_ptr rather than raw ptr in more places. (Note: you can std::move from unique_ptr to shared_ptr.) Without enhancements to GetPropertiesOfAllTablesTest (see below), net reduction of roughly 100 lines of code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163 Test Plan: existing tests and * Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that checksums are now checked on direct read of table properties by TableCache (new test would fail before this change) * Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test putting table properties under old meta name * Also generally enhanced that same test to actually test what it was supposed to be testing already, by kicking things out of table cache when we don't want them there. Reviewed By: ajkr, mrambacher Differential Revision: D32514757 Pulled By: pdillinger fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9	2021-11-18 11:43:44 -08:00
Hui Xiao	74544d582f	Account Bloom/Ribbon filter construction memory in global memory limit (#9073 ) Summary: Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge. Context: Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`. The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option. Summary: - Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction: - hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance), - banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`), - final filter (i.e, `mutable_buf`'s size). - Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table. - Added option fo run `filter_bench` with this memory reservation feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073 Test Plan: - Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory` - To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around 3000 - 6000 ms and others take around 1500 - 2000 ms, in total adding 20000 - 25000 ms to the test suit running locally - Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter - Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over 20 run as below: - FastLocalBloom - baseline `./filter_bench -impl=2 -quick -runs 20 \| grep 'Build avg'`: - Build avg ns/key: 29.56295 (DEBUG_LEVEL=1), 29.98153 (DEBUG_LEVEL=0) - new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true \| grep 'Build avg'`: - Build avg ns/key: 30.99046 (DEBUG_LEVEL=1), 30.48867 (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true \| grep 'Build avg'` : - Build avg ns/key: 31.146975 (DEBUG_LEVEL=1), 30.08165 (DEBUG_LEVEL=0) - Ribbon128 - baseline `./filter_bench -impl=3 -quick -runs 20 \| grep 'Build avg'`: - Build avg ns/key: 129.17585 (DEBUG_LEVEL=1), 130.5225 (DEBUG_LEVEL=0) - new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true \| grep 'Build avg' `: - Build avg ns/key: 131.61645 (DEBUG_LEVEL=1), 132.98075 (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true \| grep 'Build avg'` : - Build avg ns/key: 52.032965 (DEBUG_LEVEL=1), 52.597825 (DEBUG_LEVEL=0) - And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console. Reviewed By: pdillinger Differential Revision: D31991348 Pulled By: hx235 fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e	2021-11-18 09:42:20 -08:00
Andrew Kryczka	2225f063d4	Remove incremental ID from background thread pool names (#9165 ) Summary: `pthread_setname_np()` fails on attempts to assign oversized names like "rocksdb:bottom10", which resulted in some thread name updates being lost. We do not need the ID suffix so I removed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9165 Test Plan: ``` $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -max_background_flushes=123 -max_background_compactions=456 -num_bottom_pri_threads=789 -duration=60 ``` While above is running: ``` $ ps -o 'comm' -Lp `pidof db_bench` \| grep '^rocksdb:' \| sort \| uniq -c 789 rocksdb:bottom 123 rocksdb:high 456 rocksdb:low ``` Reviewed By: pdillinger Differential Revision: D32415077 Pulled By: ajkr fbshipit-source-id: a0e013101e26a78bc5eca73509293ef4bf22254f	2021-11-16 18:26:12 -08:00
Zhichao Cao	b694cd0e0d	Add tiered storage related read bytes stats to Statistic (#9123 ) Summary: Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123 Test Plan: added new testing cases. Reviewed By: siying Differential Revision: D32154745 Pulled By: zhichao-cao fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da	2021-11-16 15:17:17 -08:00
Peter Dillinger	f8c685c4fc	Check for and disallow shared key space in block caches (#9172 ) Summary: We have three layers of block cache that often use the same key but map to different physical data: * BlockBasedTableOptions::block_cache * BlockBasedTableOptions::block_cache_compressed * BlockBasedTableOptions::persistent_cache If any two of these happen to share an underlying implementation and key space (insertion into one shows up in another), then memory safety is broken. The simplest case is block_cache == block_cache_compressed. (Credit mrambacher for asking about this case in a review.) With this change, we explicitly check for overlap and preemptively and safely fail with a Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9172 Test Plan: test added. Crashes without new check Reviewed By: anand1976 Differential Revision: D32465659 Pulled By: pdillinger fbshipit-source-id: 3876b45b6dce6167e5a7a642725ddc86b96f8e40	2021-11-16 11:16:05 -08:00
Hui Xiao	cff7819dff	Fix BackupEngine's internal callers of GenericRateLimiter::Request() not honoring bytes <= GetSingleBurstBytes() (#9063 ) Summary: Context: Some existing internal calls of `GenericRateLimiter::Request()` in backupable_db.cc and newly added internal calls in https://github.com/facebook/rocksdb/pull/8722/ do not make sure `bytes <= GetSingleBurstBytes()` as required by rate_limiter https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L47. Impacts of this bug include: (1) In debug build, when `GenericRateLimiter::Request()` requests bytes greater than `GenericRateLimiter:: kMinRefillBytesPerPeriod = 100` byte, process will crash due to assertion failure. See https://github.com/facebook/rocksdb/pull/9063#discussion_r737034133 and for possible scenario (2) In production build, although there will not be the above crash due to disabled assertion, the bug can lead to a request of small bytes being blocked for a long time by a request of same priority with insanely large bytes from a different thread. See updated https://github.com/facebook/rocksdb/wiki/Rate-Limiter ("Notice that although....the maximum bytes that can be granted in a single request have to be bounded...") for more info. There is an on-going effort to move rate-limiting to file wrapper level so rate limiting in `BackupEngine` and this PR might be made obsolete in the future. Summary: - Implemented loop-calling `GenericRateLimiter::Request()` with `bytes <= GetSingleBurstBytes()` as a static private helper function `BackupEngineImpl::LoopRateLimitRequestHelper` -- Considering make this a util function in `RateLimiter` later or do something with `RateLimiter::RequestToken()` - Replaced buggy internal callers with this helper function wherever requested byte is not pre-limited by `GetSingleBurstBytes()` - Removed the minimum refill bytes per period enforced by `GenericRateLimiter` since it is useless and prevents testing `GenericRateLimiter` for extreme case with small refill bytes per period. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9063 Test Plan: - Added a new test that failed the assertion before this change and now passes - It exposed bugs in [the write during creation in `CopyOrCreateFile()`](`df7cc66e17/utilities/backupable/backupable_db.cc (L2034-L2043)`), [the read of table properties in `GetFileDbIdentities()`](`df7cc66e17/utilities/backupable/backupable_db.cc (L2372-L2378)`), [some read of metadata in `BackupMeta::LoadFromFile()`](`df7cc66e17/utilities/backupable/backupable_db.cc (L2726)`) - Passing Existing tests Reviewed By: ajkr Differential Revision: D31824535 Pulled By: hx235 fbshipit-source-id: d2b3dea7a64e2a4b1e6a59fca322f0800a4fcbcc	2021-11-16 09:52:16 -08:00
Yanqin Jin	2035798834	Update TransactionUtil::CheckKeyForConflict to also use timestamps (#9162 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162 Existing TransactionUtil::CheckKeyForConflict() performs only seq-based conflict checking. If user-defined timestamp is enabled, it should perform conflict checking based on timestamps too. Update TransactionUtil::CheckKey-related methods to verify the timestamp of the latest version of a key is smaller than the read timestamp. Note that CheckKeysForConflict() is not updated since it's used only by optimistic transaction, and we do not plan to update it in this upcoming batch of diffs. Existing GetLatestSequenceForKey() returns the sequence of the latest version of a specific user key. Since we support user-defined timestamp, we need to update this method to also return the timestamp (if enabled) of the latest version of the key. This will be needed for snapshot validation. Reviewed By: ltamasi Differential Revision: D31567960 fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44	2021-11-15 12:52:18 -08:00
Andrew Kryczka	9bb13c56b3	Use system-wide thread ID in info log lines (#9164 ) Summary: This makes it easier to debug with tools like `ps`. The change only applies to builds with glibc 2.30+ and _GNU_SOURCE extensions enabled. We could adopt it in more cases by using the syscall but this is enough for our build. Replaces https://github.com/facebook/rocksdb/issues/2973. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9164 Test Plan: - ran some benchmarks and correlated logged thread IDs with those shown by `ps -L`. - verified no noticeable regression in throughput for log heavy (more than 700k log lines and over 5k / second) scenario. Benchmark command: ``` $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom -compression_type=none -max_bytes_for_level_multiplier=2 -write_buffer_size=262144 -num_levels=7 -max_bytes_for_level_base=2097152 -target_file_size_base=524288 -level_compaction_dynamic_level_bytes=true -max_background_jobs=12 -num=20000000 ``` Results before: 15.9MB/s, 15.8MB/s, 16.0MB/s Results after: 16.3MB/s, 16.3MB/s, 15.8MB/s - Rely on CI to test the fallback behavior Reviewed By: riversand963 Differential Revision: D32399660 Pulled By: ajkr fbshipit-source-id: c24d44fdf7782faa616ef0a0964eaca3539d9c24	2021-11-12 19:46:06 -08:00
Akanksha Mahajan	17ce1ca48b	Reuse internal auto readhead_size at each Level (expect L0) for Iterations (#9056 ) Summary: RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB. This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block. 1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB. 2. It maintains readahead_size for L1 - Ln levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056 Test Plan: Added new unit tests Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled. Reviewed By: anand1976 Differential Revision: D31773640 Pulled By: akankshamahajan15 fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98	2021-11-10 16:20:04 -08:00
slk	937fbcbddc	Track per-SST user-defined timestamp information in MANIFEST (#9092 ) Summary: Track per-SST user-defined timestamp information in MANIFEST https://github.com/facebook/rocksdb/issues/8957 Rockdb has supported user-defined timestamp feature. Application can specify a timestamp when writing each k-v pair. When data flush from memory to disk file called SST files, file creation activity will commit to MANIFEST. This commit is for tracking timestamp info in the MANIFEST for each file. The changes involved are as follows: 1) Track max/min timestamp in FileMetaData, and fix invoved codes. 2) Add NewFileCustomTag::kMinTimestamp and NewFileCustomTag::kMinTimestamp in NewFileCustomTag ( in the kNewFile4 part ), and support invoved codes such as VersionEdit Encode and Decode etc. 3) Add unit test code for VersionEdit EncodeDecodeNewFile4, and fix invoved test codes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9092 Reviewed By: ajkr, akankshamahajan15 Differential Revision: D32252323 Pulled By: riversand963 fbshipit-source-id: d2642898d6e3ad1fef0eb866b98045408bd4e162	2021-11-10 10:49:04 -08:00
Yanqin Jin	a113cecfc9	Fix a bug in timestamp-related GC (#9116 ) Summary: For multiple versions (ts + seq) of the same user key, if they cross the boundary of `full_history_ts_low_`, we should retain the version that is visible to the `full_history_ts_low_`. Namely, we keep the internal key with the largest timestamp smaller than `full_history_ts_low`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9116 Test Plan: make check Reviewed By: ltamasi Differential Revision: D32261514 Pulled By: riversand963 fbshipit-source-id: e10f47c254c04c05261440051e4f50cb7d95474e	2021-11-09 13:08:55 -08:00
Yanqin Jin	5237b39d2e	Fix assertion error during compaction with write-prepared txn enabled (#9105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105 The user contract of SingleDelete is that: a SingleDelete can only be issued to a key that exists and has NOT been updated. For example, application can insert one key `key`, and uses a SingleDelete to delete it in the future. The `key` cannot be updated or removed using Delete. In reality, especially when write-prepared transaction is being used, things can get tricky. For example, a prepared transaction already writes `key` to the memtable after a successful Prepare(). Afterwards, should the transaction rollback, it will insert a Delete into the memtable to cancel out the prior Put. Consider the following sequence of operations. ``` // operation sequence 1 Begin txn Put(key) Prepare() Flush() Rollback txn Flush() ``` There will be two SSTs resulting from above. One of the contains a PUT, while the second one contains a Delete. It is also known that releasing a snapshot can lead to an L0 containing only a SD for a particular key. Consider the following operations following the above block. ``` // operation sequence 2 db->Put(key) db->SingleDelete(key) Flush() ``` The operation sequence 2 can result in an L0 with only the SD. Should there be a snapshot for conflict checking created before operation sequence 1, then an attempt to compact the db may hit the assertion failure below, because ikey_.type is Delete (from a rollback). ``` else if (clear_and_output_next_key_) { assert(ikey_.type == kTypeValue \|\| ikey_.type == kTypeBlobIndex); } ``` To fix the assertion failure, we can skip the SingleDelete if we detect an earlier Delete in the same snapshot interval. Reviewed By: ltamasi Differential Revision: D32056848 fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748	2021-11-05 15:29:18 -07:00
Hui Xiao	1ababeb76a	Deallocate payload of BlockBasedTableBuilder::Rep::FilterBlockBuilder earlier for Full/PartitionedFilter (#9070 ) Summary: Note: This PR is the 1st part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073). Context: Previously, the payload (i.e, filter data) within `BlockBasedTableBuilder::Rep::FilterBlockBuilder` object is not deallocated until `BlockBasedTableBuilder` is deallocated, despite it is no longer useful after its related `filter_content` being written. - Transferred the payload (i.e, the filter data) out of `BlockBasedTableBuilder::Rep::FilterBlockBuilder` object - For PartitionedFilter: - Unified `filters` and `filter_gc` lists into one `std::deque<FilterEntry> filters` by adding a new field `last_filter_entry_key` and storing the `std::unique_ptr filter_data` with the `Slice filter` in the same entry - Reset `last_filter_data` in the case where `filters` is empty, which should be as by then we would've finish using all the `Slice filter` - Deallocated the payload by going out of scope as soon as we're done with using the `filter_content` associated with the payload - This is an internal interface change at the level of `FilterBlockBuilder::Finish()`, which leads to touching the inherited interface in `BlockBasedFilterBlockBuilder`. But for that, the payload transferring is ignored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9070 Test Plan: - The main focus is to catch segment fault error during `FilterBlockBuilder::Finish()` and `BlockBasedTableBuilder::Finish()` and interface mismatch. Relying on existing CI tests is enough as `assert(false)` was temporarily added to verify the new logic of transferring ownership indeed run Reviewed By: pdillinger Differential Revision: D31884933 Pulled By: hx235 fbshipit-source-id: f73ecfbea13788d4fc058013ace27230110b52f4	2021-11-04 13:35:38 -07:00
Hui Xiao	a64c8ca7a8	Sanitize negative request bytes in GenericRateLimiter::Request and clarify API (#9112 ) Summary: Context: Surprisingly, there isn't any sanitization against negative `int64_t bytes` in `GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri, Statistics* stats)`. A negative `bytes` can be passed in and incorrectly increases `available_bytes_` by subtracting the negative `bytes` from `available_bytes_`, such as [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L138) and [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L283), which are incorrect behaviors. - Sanitized negative request bytes by rounding it up to 0 - Added notes to public and internal API Pull Request resolved: https://github.com/facebook/rocksdb/pull/9112 Test Plan: - Rely on existing tests Reviewed By: ajkr Differential Revision: D32085364 Pulled By: hx235 fbshipit-source-id: b1b6066b2dd5ffc7bcbfb07069ca65a33578251b	2021-11-04 10:11:53 -07:00
Yanqin Jin	9b53f14a35	Fixed a bug in CompactionIterator when write-preared transaction is used (#9060 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060 RocksDB bottommost level compaction may zero out an internal key's sequence if the key's sequence is in the earliest_snapshot. In write-prepared transaction, checking the visibility of a certain sequence in a specific released snapshot may return a "snapshot released" result. Therefore, it is possible, after a certain sequence of events, a PUT has its sequence zeroed out, but a subsequent SingleDelete of the same key will still be output with its original sequence. This violates the ascending order of keys and leads to incorrect result. The solution is to use an extra variable `last_key_seq_zeroed_` to track the information about visibility in earliest snapshot. With this variable, we can know for sure that a SingleDelete is in the earliest snapshot even if the said snapshot is released during compaction before processing the SD. Reviewed By: ltamasi Differential Revision: D31813016 fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1	2021-11-03 15:55:00 -07:00
Jay Zhuang	29102641dd	Skip directory fsync for filesystem btrfs (#8903 ) Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 4 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncAfterFileRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is forced by set `IOOptions.force_dir_fsync = true` 4. renaming multiple files (like backup and checkpoint): dir-fsync is forced, the same as above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903 Test Plan: run tests on btrfs and non btrfs Reviewed By: ajkr Differential Revision: D30885059 Pulled By: jay-zhuang fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a	2021-11-03 12:21:27 -07:00
Peter Dillinger	2b60621f16	Don't call OnTableFileCreated with OK for empty+deleted file (#9118 ) Summary: EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. This could lead to a stress test assertion failure added in https://github.com/facebook/rocksdb/issues/9054. This changes the status to Aborted, to align with the API doc: "... if the file is successfully created. Now it will also be called on failure case. User can check info.status to see if it succeeded or not." For internal purposes, this case is considered "success" but for listener purposes, no SST file is (successfully) created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9118 Test Plan: test case added + existing db_stress Reviewed By: ajkr, riversand963 Differential Revision: D32120232 Pulled By: pdillinger fbshipit-source-id: a804e2e0a52598018d3b182da97804d402ffcdfa	2021-11-03 08:43:27 -07:00
Peter Dillinger	82afa01815	Some API clarifications (#9080 ) Summary: * Clarify that RocksDB is not exception safe on many of our callback and extension interfaces * Clarify FSRandomAccessFile::MultiRead implementations must accept non-sorted inputs (see https://github.com/facebook/rocksdb/issues/8953) * Clarify ConcurrentTaskLimiter and SstFileManager are not (currently) extensible interfaces * Mark WriteBufferManager as `final`, so it is then clearly not a callback interface, even though it smells like one * Clarify TablePropertiesCollector Status returns are mostly ignored Pull Request resolved: https://github.com/facebook/rocksdb/pull/9080 Test Plan: comments only (except WriteBufferManager final) Reviewed By: ajkr Differential Revision: D31968782 Pulled By: pdillinger fbshipit-source-id: 11b648ce3ce3c5e5bdc02d2eafc7ea4b864bd1d2	2021-11-02 20:30:07 -07:00
mrambacher	f72c834eab	Make FileSystem a Customizable Class (#8649 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649 Reviewed By: zhichao-cao Differential Revision: D32036059 Pulled By: mrambacher fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295	2021-11-02 09:07:11 -07:00
Levi Tamasi	cfc57f55b5	Mention PR 9100 in HISTORY.md (#9111 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9111 Reviewed By: riversand963 Differential Revision: D32076310 Pulled By: ltamasi fbshipit-source-id: 81e30a02ded87c0f1a42985db42e80b62235ba11	2021-11-01 15:22:32 -07:00
sdong	a2b9be42b6	Try to start TTL earlier with kMinOverlappingRatio is used (#8749 ) Summary: Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles. When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed. In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749 Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end. Reviewed By: jay-zhuang Differential Revision: D30735261 fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e	2021-11-01 14:36:31 -07:00
leipeng	230c98f3ce	fix histogram NUM_FILES_IN_SINGLE_COMPACTION (#9026 ) Summary: currently histogram `NUM_FILES_IN_SINGLE_COMPACTION` just counted files in first level of compaction input, this fix counts files in all levels of compaction input. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9026 Reviewed By: ajkr Differential Revision: D31668241 Pulled By: jay-zhuang fbshipit-source-id: c02f6c4a5df9fbf0b7510036594811152e8738af	2021-11-01 12:57:27 -07:00
leipeng	2b70224f82	remove bad extra RecordTick(stats_, WRITE_WITH_WAL) (#9064 ) Summary: This PR fix wrong ticker `WRITE_WITH_WAL`. `RecordTick(WRITE_WITH_WAL)` will be called later in `WriteToWAL` and `ConcurrentWriteToWAL`. Fixes: 1. Delete these two extra `RecordTick(WRITE_WITH_WAL)` 2. Fix corresponding test case Pull Request resolved: https://github.com/facebook/rocksdb/pull/9064 Reviewed By: ajkr Differential Revision: D31944459 Pulled By: riversand963 fbshipit-source-id: f1aa8d2a4320456bc357bc5b0902032f7dcad086	2021-11-01 11:43:14 -07:00
Yanqin Jin	fdf2a0d7eb	Fix a compaction bug for write-prepared txn (#9061 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061 In write-prepared txn, checking a sequence's visibility in a released (old) snapshot may return "Snapshot released". Suppose we have two snapshots: ``` earliest_snap < earliest_write_conflict_snap ``` If we release `earliest_write_conflict_snap` but keep `earliest_snap` during bottommost level compaction, then it is possible that certain sequence of events can lead to a PUT being seq-zeroed followed by a SingleDelete of the same key. This violates the ascending order of keys, and will cause data inconsistency. Reviewed By: ltamasi Differential Revision: D31813017 fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285	2021-10-29 15:23:17 -07:00
Peter Dillinger	a7d4bea43a	Implement XXH3 block checksum type (#9069 ) Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. Existing functionality change: * reject bad ChecksumType in options with InvalidArgument This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069 Test Plan: tests updated, and substantially expanded. Unit tests now check that we don't accidentally change the values generated by the checksum algorithms ("schema test") and that we properly handle invalid/unrecognized checksum types in options or in file footer. DBTestBase::ChangeOptions (etc.) updated from two to one configuration changing from default CRC32c ChecksumType. The point of this test code is to detect possible interactions among features, and the likelihood of some bad interaction being detected by including configurations other than XXH3 and CRC32c--and then not detected by stress/crash test--is extremely low. Stress/crash test also updated (manual run long enough to see it accepts new checksum type). db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 \| grep 'micros/op' \| tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3 Reviewed By: mrambacher Differential Revision: D31905249 Pulled By: pdillinger fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1	2021-10-28 22:15:17 -07:00
Andrew Kryczka	f24c39ab3d	Prevent corruption with parallel manual compactions and `change_level == true` (#9077 ) Summary: The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well. Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction. The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps. Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default). The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled. Thanks to siying for finding the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077 Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions. Reviewed By: siying Differential Revision: D31921908 Pulled By: ajkr fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb	2021-10-27 23:08:56 -07:00
Yanqin Jin	f72fd58565	Fix atomic flush waiting forever for MANIFEST write (#9034 ) Summary: In atomic flush, concurrent background flush threads will commit to the MANIFEST one by one, in the order of the IDs of their picked memtables for all included column families. Each time, a background flush thread decides whether to wait based on two criteria: - Is db stopped? If so, don't wait. - Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go. When atomic flush was implemented, error writing to or syncing the MANIFEST would cause the db to be stopped. Therefore, this background thread does not have to check for the background error while waiting. If there has been such an error, `DBStopped()` would have been true, and this thread will not wait forever. After we improved error handling, RocksDB may map an IOError while writing to MANIFEST to a soft error, if there is no WAL. This requires the background threads to check for background error while waiting. Otherwise, a background flush thread may wait forever. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D31639225 Pulled By: riversand963 fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd	2021-10-20 21:34:47 -07:00
sdong	633f069c29	Update Release Version to 6.26 (#9059 ) Summary: Before cutting release branch 6.26, update version.h and release notes Pull Request resolved: https://github.com/facebook/rocksdb/pull/9059 Reviewed By: ajkr Differential Revision: D31805126 fbshipit-source-id: ae85ccf06ec756fa21163161f53fd0b728e6e32e	2021-10-20 15:32:01 -07:00
Andrew Kryczka	4217d1bce7	Support `GetMapProperty()` with "rocksdb.dbstats" (#9057 ) Summary: This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057 Test Plan: new unit test Reviewed By: zhichao-cao Differential Revision: D31781495 Pulled By: ajkr fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b	2021-10-20 13:17:00 -07:00
sdong	c66b4429ff	Incremental Space Amp Compactions in Universal Style (#8655 ) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb	2021-10-20 10:04:13 -07:00
Zhichao Cao	6d93b87588	Add lowest_used_cache_tier to ImmutableDBOptions to enable or disable Secondary Cache (#9050 ) Summary: Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050 Test Plan: added new tests Reviewed By: anand1976 Differential Revision: D31744769 Pulled By: zhichao-cao fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6	2021-10-19 15:54:23 -07:00
Jay Zhuang	f20b07cebb	Add "Java API Changes" session in HISTORY (#9055 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9055 Reviewed By: ajkr Differential Revision: D31765398 Pulled By: jay-zhuang fbshipit-source-id: 77ed67d69415c9fbbfc1132b15310b293e3939c6	2021-10-19 15:23:06 -07:00
Peter Dillinger	b234a3f569	Improve data block construction performance (#9040 ) Summary: ... by bypassing tracking of last_key in BlockBuilder when last_key is already known (for BlockBasedTableBuilder::data_block). I tried extracting a base class of BlockBuilder without the last_key tracking at all, but that became complicated by NewFlushBlockPolicy() in the public API referencing BlockBuilder, which would need to be the base class, and I don't want to replace nearly all the internal references to BlockBuilder. Possible follow-up: * Investigate / consider using AddWithLastKey in more places This improvement should stack with https://github.com/facebook/rocksdb/issues/9039 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9040 Test Plan: TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000 Compiled with DEBUG_LEVEL=0 Test vs. control runs simulaneous for better accuracy, units = ops/sec Run 1: 278929 vs. 267799 (+4.2%) Run 2: 281836 vs. 267432 (+5.4%) Run 3: 278279 vs. 270454 (+2.9%) (This benchmark is chosen to have detectable signal-to-noise, not to represent expected improvement percent on real workloads.) Reviewed By: mrambacher Differential Revision: D31706033 Pulled By: pdillinger fbshipit-source-id: 8a50fe6fefdd67b6d7665ffa687bbdcf5ad0d5ec	2021-10-19 12:36:21 -07:00
Alan Paxton	8d615a2b1d	New-style blob option bindings, Java option getter and improve/fix option parsing (#8999 ) Summary: Implementation of https://github.com/facebook/rocksdb/issues/8221, plus/including extension of Java options API to allow the get() of options from RocksDB. The extension allows more comprehensive testing of options at the Java side, by validating that the options are set at the C++ side. Variations on methods: MutableColumnFamilyOptions.MutableColumnFamilyOptionsBuilder getOptions() MutableDBOptions.MutableDBOptionsBuilder getDBOptions() retrieve the options via RocksDB C++ interfaces, and parse the resulting string into one of the Java-style option objects. This necessitated generalising the parsing of option strings in Java, which now parses the full range of option strings returned by the C++ interface, rather than a useful subset. This necessitates the list-separator being changed to :(colon) from , (comma). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8999 Reviewed By: jay-zhuang Differential Revision: D31655487 Pulled By: ltamasi fbshipit-source-id: c38e98145c81c61dc38238b0df580db176ce4efd	2021-10-19 09:21:52 -07:00
Peter Dillinger	ad5325a736	Experimental support for SST unique IDs (#8990 ) Summary: * New public header unique_id.h and function GetUniqueIdFromTableProperties which computes a universally unique identifier based on table properties of table files from recent RocksDB versions. * Generation of DB session IDs is refactored so that they are guaranteed unique in the lifetime of a process running RocksDB. (SemiStructuredUniqueIdGen, new test included.) Along with file numbers, this enables SST unique IDs to be guaranteed unique among SSTs generated in a single process, and "better than random" between processes. See https://github.com/pdillinger/unique_id * In addition to public API producing 'external' unique IDs, there is a function for producing 'internal' unique IDs, with functions for converting between the two. In short, the external ID is "safe" for things people might do with it, and the internal ID enables more "power user" features for the future. Specifically, the external ID goes through a hashing layer so that any subset of bits in the external ID can be used as a hash of the full ID, while also preserving uniqueness guarantees in the first 128 bits (bijective both on first 128 bits and on full 192 bits). Intended follow-up: * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into the third 64-bit value of the unique ID.) * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990 Test Plan: Unit tests added, and checking of unique ids in stress test. NOTE in stress test we do not generate nearly enough files to thoroughly stress uniqueness, but the test trims off pieces of the ID to check for uniqueness so that we can infer (with some assumptions) stronger properties in the aggregate. Reviewed By: zhichao-cao, mrambacher Differential Revision: D31582865 Pulled By: pdillinger fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243	2021-10-18 23:32:01 -07:00
Jay Zhuang	314de7e7de	Make `DB::Close()` thread-safe (#8970 ) Summary: If `DB::Close()` is called in multi-thread env, the resource could be double released, which causes exception or assert. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970 Test Plan: Test with multi-thread benchmark, with each thread try to close the DB at the end. Reviewed By: pdillinger Differential Revision: D31242042 Pulled By: jay-zhuang fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb	2021-10-18 20:32:35 -07:00
Alan Paxton	86cf7266c3	keyMayExist() supports ByteBuffer (#9013 ) Summary: closes https://github.com/facebook/rocksdb/issues/7917 Implemented ByteBuffer API variants of Java keyMayExist() uniformly with and without column families, read options and return data values. Implemented 2 supporting C++ JNI methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9013 Reviewed By: mrambacher Differential Revision: D31665989 Pulled By: jay-zhuang fbshipit-source-id: 8adc1730217dba38d6fa7b31d788650a33e28af1	2021-10-18 17:20:07 -07:00
Peter Dillinger	3ffb3baa0b	Add (Live)FileStorageInfo API (#8968 ) Summary: New classes FileStorageInfo and LiveFileStorageInfo and 'experimental' function DB::GetLiveFilesStorageInfo, which is intended to largely replace several fragmented DB functions needed to create checkpoints and backups. This function is now used to create checkpoints and backups, because it fixes many (probably not all) of the prior complexities of checkpoint not having atomic access to DB metadata. This also ensures strong functional test coverage of the new API. Specifically, much of the old CheckpointImpl::CreateCustomCheckpoint has been migrated to and updated in DBImpl::GetLiveFilesStorageInfo, with the former now calling the latter. Also, the class FileStorageInfo in metadata.h compatibly replaces BackupFileInfo and serves as a new base class for SstFileMetaData. Some old fields of SstFileMetaData are still provided (for now) but deprecated. Although FileStorageInfo::directory is accurate when using db_paths and/or cf_paths, these have never been supported by Checkpoint nor BackupEngine and still are not. This change does now detect these cases and return NotSupported when appropriate. (More work needed for support.) Somehow this change broke ProgressCallbackDuringBackup, but the progress_callback logic was dubious to begin with because it would call the callback based on copy buffer size, not size actually copied. Logic and test updated to track size actually copied per-thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968 Test Plan: tests updated. DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl. DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo, including reading the data after DB close. Added CheckpointTest.CheckpointWithDbPath (NotSupported). Reviewed By: siying Differential Revision: D31242045 Pulled By: pdillinger fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0	2021-10-16 10:04:32 -07:00
Andrew Kryczka	ffc48b6cad	Update HISTORY.md for #9009 (#9036 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9036 Reviewed By: zhichao-cao Differential Revision: D31640901 Pulled By: ajkr fbshipit-source-id: 0b1e6e36094a74bb7906af44e29ecbeaa258de58	2021-10-14 09:36:32 -07:00
Giuseppe Ottaviano	4bfd415e34	Fix sequence number bump logic in multi-CF SST ingestion (#9005 ) Summary: The code in `IngestExternalFiles()` that bumps the DB's sequence number depending on what seqnos were assigned to the files has 3 bugs: 1) There is an assertion that the sequence number is increased in all the affected column families, but this is unnecessary, it is fine if some files can stick to a lower sequence number. It is very easy to hit the assertion: it is sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that doesn't (for example the CF is empty). The line added in the `IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail. 2) SetLastSequence() is called with the sum of all the bumps across CFs, but we should take the maximum instead, as all CFs start with the current seqno and bump it independently. 3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in optimized builds, so some files may be assigned seqnos from the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005 Test Plan: Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that triggers the assertion, verified that the test (and all the others) pass after the fix. Reviewed By: ajkr Differential Revision: D31597892 Pulled By: ot fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770	2021-10-12 20:39:52 -07:00
Levi Tamasi	7cc52cd8f5	Update HISTORY for PR 8994 (#9017 ) Summary: Also, expand on/clarify a comment in `VersionStorageInfoTest`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9017 Reviewed By: riversand963 Differential Revision: D31566130 Pulled By: ltamasi fbshipit-source-id: 1d30c7af084c4de7b2030bc6c768838d65746010	2021-10-12 10:19:56 -07:00

1 2 3 4 5 ...

1105 Commits