rocksdb

Author	SHA1	Message	Date
Levi Tamasi	68d8b28389	Log the amount of blob garbage generated by compactions in the MANIFEST (#8450 ) Summary: The patch builds on `BlobGarbageMeter` and `BlobCountingIterator` (introduced in https://github.com/facebook/rocksdb/issues/8426 and https://github.com/facebook/rocksdb/issues/8443 respectively) and ties it all together. It measures the amount of garbage generated by a compaction and logs the corresponding `BlobFileGarbage` records as part of the compaction job's `VersionEdit`. Note: in order to have accurate results, `kRemoveAndSkipUntil` for compaction filters is implemented using iteration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8450 Test Plan: Ran `make check` and the crash test script. Reviewed By: jay-zhuang Differential Revision: D29338207 Pulled By: ltamasi fbshipit-source-id: 4381c432ac215139439f6d6fb801a6c0e4d8c128	2021-06-24 16:11:56 -07:00
Levi Tamasi	d44ef2ed4d	Remove obsolete method VersionSet::VerifyCompactionFileConsistency (#8449 ) Summary: `VersionSet::VerifyCompactionFileConsistency` was superseded by the LSM tree consistency checks introduced in https://github.com/facebook/rocksdb/pull/6901, which are more comprehensive, more efficient, and are performed unconditionally even in release builds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8449 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D29337441 Pulled By: ltamasi fbshipit-source-id: a05324f88e3400e27e6a00406c878a6276e0c9cc	2021-06-23 13:28:34 -07:00
Levi Tamasi	6adc39e1bf	Add an internal iterator that can measure the inflow of blobs (#8443 ) Summary: Follow-up to https://github.com/facebook/rocksdb/issues/8426 . The patch adds a new kind of `InternalIterator` that wraps another one and passes each key-value encountered to `BlobGarbageMeter` as inflow. This iterator will be used as an input iterator for compactions when the input SSTs reference blob files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8443 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D29311987 Pulled By: ltamasi fbshipit-source-id: b4493b4c0c0c2e3c2ecc33c8969a5ef02de5d9d8	2021-06-23 10:25:47 -07:00
Jay Zhuang	f89423a57a	Revert "Revert "Snapshot release triggered compaction without multiple tombstones (#8357 )" (#8410 )" (#8438 ) Summary: This reverts commit `25be1ed66a`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438 Test Plan: Run the impacted mysql test 40 times Reviewed By: ajkr Differential Revision: D29286247 Pulled By: jay-zhuang fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04	2021-06-22 11:10:03 -07:00
Levi Tamasi	cbb3b25915	Print blob file checksums as hex (#8437 ) Summary: Currently, blob file checksums are incorrectly dumped as raw bytes in the `ldb manifest_dump` output (i.e. they are not printed as hex). The patch fixes this and also updates some test cases to reflect that the checksum value field in `BlobFileAddition` and `SharedBlobFileMetaData` contains the raw checksum and not a hex string. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8437 Test Plan: `make check` Tested using `ldb manifest_dump` Reviewed By: akankshamahajan15 Differential Revision: D29284170 Pulled By: ltamasi fbshipit-source-id: d11cfb3435b14cd73c8a3d3eb14fa0f9fa1d2228	2021-06-22 09:49:44 -07:00
Jay Zhuang	54d73d6429	Fix DeleteFilesInRange may cause inconsistent compaction error (#8434 ) Summary: `DeleteFilesInRange()` marks deleting files to `being_compacted` before deleting, which may cause ongoing compactions report corruption exception or ASSERT for debug build. Adding the missing `ComputeCompactionScore()` when `being_compacted` is set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8434 Test Plan: Unittest Reviewed By: ajkr Differential Revision: D29276127 Pulled By: jay-zhuang fbshipit-source-id: f5b223e3c1fc6d821e100e3f3442bc70c1d50cf7	2021-06-22 09:17:37 -07:00
Levi Tamasi	065bea1587	Add a class for measuring the amount of garbage generated during compaction (#8426 ) Summary: This is part of an alternative approach to https://github.com/facebook/rocksdb/issues/8316. Unlike that approach, this one relies on key-values getting processed one by one during compaction, and does not involve persistence. Specifically, the patch adds a class `BlobGarbageMeter` that can track the number and total size of blobs in a (sub)compaction's input and output on a per-blob file basis. This information can then be used to compute the amount of additional garbage generated by the compaction for any given blob file by subtracting the "outflow" from the "inflow." Note: this patch only adds `BlobGarbageMeter` and associated unit tests. I plan to hook up this class to the input and output of `CompactionIterator` in a subsequent PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8426 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D29242250 Pulled By: ltamasi fbshipit-source-id: 597e50ad556540e413a50e804ba15bc044d809bb	2021-06-21 22:25:30 -07:00
mwish	19a89267ca	typo: fix typo in db/write_thread's state (#8423 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8423 Reviewed By: mrambacher Differential Revision: D29232587 Pulled By: jay-zhuang fbshipit-source-id: 04d4937cf0605cbf341a920d1305369a7b8f0574	2021-06-18 17:14:51 -07:00
Zhichao Cao	82a70e1470	Trace MultiGet Keys and CF_IDs to the trace file (#8421 ) Summary: Tracing the MultiGet information including timestamp, keys, and CF_IDs to the trace file for analyzing and replay. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8421 Test Plan: make check, add test to trace_analyzer_test Reviewed By: anand1976 Differential Revision: D29221195 Pulled By: zhichao-cao fbshipit-source-id: 30c677d6c39ab31ef4bbdf7e0d1fa1fd79f295ff	2021-06-18 15:04:05 -07:00
Baptiste Lemaire	e817bc9628	Added memtable garbage statistics (#8411 ) Summary: Summary: 2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations. Unit test: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test. This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR can be exactly predicted, showing that we have a full understanding of the underlying operations. Performance testing: `./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times. Timing done using "date" function in a bash script. _Results_: Original Rocksdb fork: mean 66.6 sec, std 1.18 sec. This feature branch: mean 67.4 sec, std 1.35 sec. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411 Reviewed By: akankshamahajan15 Differential Revision: D29150629 Pulled By: bjlemaire fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1	2021-06-18 04:57:27 -07:00
Akanksha Mahajan	5ba1b6e549	Cache warming data blocks during flush (#8242 ) Summary: This PR prepopulates warm/hot data blocks which are already in memory into block cache at the time of flush. On a flush, the data block that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this data back into memory again, which is avoided by enabling newly added option. Right now, this is enabled only for flush for data blocks. We plan to expand this option to cover compactions in the future and for other types of blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242 Test Plan: Add new unit test Reviewed By: anand1976 Differential Revision: D28521703 Pulled By: akankshamahajan15 fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05	2021-06-17 21:56:47 -07:00
anand76	575ea26ec9	Don't log a warning if file system doesn't support ReopenWritableFile() (#8414 ) Summary: RocksDB logs a warning if WAL truncation on DB open fails. Its possible that on some file systems, truncation is not required and they would return ```Status::NotSupported()``` for ```ReopenWritableFile```. Don't log a warning in such cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8414 Reviewed By: akankshamahajan15 Differential Revision: D29181738 Pulled By: anand1976 fbshipit-source-id: 6e01e9117e1e4c1d67daa4dcee7fa59d06e057a7	2021-06-17 12:05:40 -07:00
mrambacher	d5bd0039b9	Rename ImmutableOptions variables (#8409 ) Summary: This is the next part of the ImmutableOptions cleanup. After changing the use of ImmutableCFOptions to ImmutableOptions, there were places in the code that had did something like "ImmutableOptions* immutable_cf_options", where "cf" referred to the "old" type. This change simply renames the variables to match the current type. No new functionality is introduced. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8409 Reviewed By: pdillinger Differential Revision: D29166248 Pulled By: mrambacher fbshipit-source-id: 96de97f8e743f5c5160f02246e3ed8269556dc6f	2021-06-16 16:51:38 -07:00
Andrew Kryczka	25be1ed66a	Revert "Snapshot release triggered compaction without multiple tombstones (#8357 )" (#8410 ) Summary: This reverts commit `9167ece586`. It was found to reliably trip a compaction picking conflict assertion in a MyRocks unit test. We don't understand why yet so reverting in the meantime. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8410 Test Plan: `make check -j48` Reviewed By: jay-zhuang Differential Revision: D29150300 Pulled By: ajkr fbshipit-source-id: 2de8664f355d6da015e84e5fec2e3f90f49741c8	2021-06-15 18:15:15 -07:00
mrambacher	281ac9c89e	Add CreateFrom methods to Env/FileSystem (#8174 ) Summary: - Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load. This method/signature is a precursor to making these classes extend Customizable. - Added CreateFromSystem to Env. This method standardizes creating an Env from the environment variables. Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI. Now the code is more command/standardized. - Added CreateFromFlags to Env. These method allows Env to be create from string options (such as GFLAGS options) in a more standard way. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174 Reviewed By: zhichao-cao Differential Revision: D28999603 Pulled By: mrambacher fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb	2021-06-15 03:43:48 -07:00
Hui Xiao	dcddc1065e	Make CompactionService derived from Customizable (#8395 ) Summary: (1)Make CompactionService derived from Customizable by defining two extra functions that are needed, as described in customizable.h comment section (2)Revise the MyTestCompactionService class in compaction_service_test.cc to satisfy the class inheritance requirement (3)Specify namespace of ToString() in compaction_service_test.cc to avoid function collision with CompactionService's ancestor classes Test did: make -j24 compaction_service_test ./compaction_service_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8395 Reviewed By: jay-zhuang Differential Revision: D29076068 Pulled By: hx235 fbshipit-source-id: c130100fa466939b3137e917f5fdc4b2ae8e37d4	2021-06-14 11:41:57 -07:00
Peter Dillinger	d5a46c40e5	Pin CacheEntryStatsCollector to fix performance bug (#8385 ) Summary: If the block Cache is full with strict_capacity_limit=false, then our CacheEntryStatsCollector could be immediately evicted on release, so iterating through column families with shared block cache could trigger re-scan for each CF. This change fixes that problem by pinning the CacheEntryStatsCollector from InternalStats so that it's not evicted. I had originally thought that this object could participate in LRU like everything else, but even though a re-load+re-scan only touches memory, it can be orders of magnitude more expensive than other cache misses. One service in Facebook has scans that take ~20s over 100GB block cache that is mostly 4KB entries. (The up-side of this bug and https://github.com/facebook/rocksdb/issues/8369 is that we had a natural experiment on the effect on some service metrics even with block cache scans running continuously in the background--a kind of worst case scenario. Metrics like latency were not affected enough to trigger warnings.) Other smaller fixes: 20s is already a sizable portion of 600s stats dump period, or 180s default max age to force re-scan, so added logic to ensure that (for each block cache) we don't spend more than 0.2% of our background thread time scanning it. Nevertheless, "foreground" requests for cache entry stats (calls to `db->GetMapProperty(DB::Properties::kBlockCacheEntryStats)`) are permitted to consume more CPU. Renamed field to cache_entry_stats_ to match code style. This change is intended for patching in 6.21 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8385 Test Plan: unit test expanded to cover new logic (detect regression), some manual testing with db_bench Reviewed By: ajkr Differential Revision: D29042759 Pulled By: pdillinger fbshipit-source-id: 236faa902397f50038c618f50fbc8cf3f277308c	2021-06-14 08:15:11 -07:00
Jay Zhuang	d60ae5b1c7	Fix flaky ManualCompactionMax test (#8396 ) Summary: Recalculate the total size after generate new sst files. New generated files might have different size as the previous time which could cause the test failed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8396 Test Plan: ``` gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionMax -r 1000 -w 100 ``` Reviewed By: akankshamahajan15 Differential Revision: D29083299 Pulled By: jay-zhuang fbshipit-source-id: 49d4bd619cefc0f9a1f452f8759ff4c2ba1b6fdb	2021-06-14 08:11:40 -07:00
Levi Tamasi	146263887f	Disable subcompactions for user-defined timestamps (#8393 ) Summary: The subcompaction boundary picking logic does not currently guarantee that all user keys that differ only by timestamp get processed by the same subcompaction. This can cause issues with the `CompactionIterator` state machine: for instance, one subcompaction that processes a subset of such KVs might drop a tombstone based on the KVs it sees, while in reality the tombstone might not have been eligible to be optimized out. (See also https://github.com/facebook/rocksdb/issues/6645, which adjusted the way compaction inputs are picked for the same reason.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8393 Test Plan: Ran `make check` and the crash test script with timestamps enabled. Reviewed By: jay-zhuang Differential Revision: D29071635 Pulled By: ltamasi fbshipit-source-id: f6c72442122b4e581871e096fabe3876a9e8a5a6	2021-06-12 12:09:25 -07:00
Peter Dillinger	b3dbeadc34	Fix double-dumping CF stats to log (#8380 ) Summary: DBImpl::DumpStats is supposed to do this: Dump DB stats to LOG For each CF, dump CFStatsNoFileHistogram to LOG For each CF, dump CFFileHistogram to LOG Instead, due to a longstanding bug from 2017 (https://github.com/facebook/rocksdb/issues/2126), it would dump CFStats, which includes both CFStatsNoFileHistogram and CFFileHistogram, in both loops, resulting in near-duplicate output. This fixes the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8380 Test Plan: Manual inspection of LOG after db_bench Reviewed By: jay-zhuang Differential Revision: D29017535 Pulled By: pdillinger fbshipit-source-id: 3010604c4a629a80347f129cd746ce9b0d0cbda6	2021-06-11 17:06:09 -07:00
Zhichao Cao	58162835d1	All the NoSpace() errors will be handled by regular SetBGError and RecoverFromNoSpace() (#8376 ) Summary: In the current logic, any IO Error with retryable flag == true will be handled by the special logic and in most cases, StartRecoverFromRetryableBGIOError will be called to do the auto resume. If the NoSpace error with retryable flag is set during WAL write, it is mapped as a hard error, which will trigger the auto recovery. During the recover process, if write continues and append to the WAL, the write process sees that bg_error is set to HardError and it calls WriteStatusCheck(), which calls SetBGError() with Status (not IOStatus). This will redirect to the regular SetBGError interface, in which recovery_error_ will be set to the corresponding error. With the recovery_error_ set, the auto resume thread created in StartRecoverFromRetryableBGIOError will keep failing as long as user keeps trying to write. To fix this issue. All the NoSpace error (no matter retryable flag is set or not) will be redirect to the regular SetBGError, and RecoverFromNoSpace() will do the recovery job which calls SstFileManager::StartErrorRecovery(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8376 Test Plan: make check and added the new testing case Reviewed By: anand1976 Differential Revision: D29071828 Pulled By: zhichao-cao fbshipit-source-id: 7171d7e14cc4620fdab49b7eff7a2fe9a89942c2	2021-06-11 14:48:28 -07:00
Akanksha Mahajan	3897ce3125	Support for Merge in Integrated BlobDB with base values (#8292 ) Summary: This PR add support for Merge operation in Integrated BlobDB with base values(i.e DB::Put). Merged values can be retrieved through DB::Get, DB::MultiGet, DB::GetMergeOperands and Iterator operation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8292 Test Plan: Add new unit tests Reviewed By: ltamasi Differential Revision: D28415896 Pulled By: akankshamahajan15 fbshipit-source-id: e9b3478bef51d2f214fb88c31ed3c8d2f4a531ff	2021-06-10 12:58:37 -07:00
Baptiste Lemaire	d61a449364	Fixed manifest_dump issues when printing keys and values containing null characters (#8378 ) Summary: Changed fprintf function to fputc in ApplyVersionEdit, and replaced null characters with whitespaces. Added unit test in ldb_test.py - verifies that manifest_dump --verbose output is correct when keys and values containing null characters are inserted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8378 Reviewed By: pdillinger Differential Revision: D29034584 Pulled By: bjlemaire fbshipit-source-id: 50833687a8a5f726e247c38457eadc3e6dbab862	2021-06-10 12:55:20 -07:00
Zhichao Cao	f44e69c64a	Use DbSessionId as cache key prefix when secondary cache is enabled (#8360 ) Summary: Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360 Test Plan: add the test to lru_cache_test Reviewed By: pdillinger Differential Revision: D29006215 Pulled By: zhichao-cao fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814	2021-06-10 11:02:43 -07:00
Levi Tamasi	db325a5904	Add a clipping internal iterator (#8327 ) Summary: Logically, subcompactions process a key range [start, end); however, the way this is currently implemented is that the `CompactionIterator` for any given subcompaction keeps processing key-values until it actually outputs a key that is out of range, which is then discarded. Instead of doing this, the patch introduces a new type of internal iterator called `ClippingIterator` which wraps another internal iterator and "clips" its range of key-values so that any KVs returned are strictly in the [start, end) interval. This does eliminate a (minor) inefficiency by stopping processing in subcompactions exactly at the limit; however, the main motivation is related to BlobDB: namely, we need this to be able to measure the amount of garbage generated by a subcompaction precisely and prevent off-by-one errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8327 Test Plan: `make check` Reviewed By: siying Differential Revision: D28761541 Pulled By: ltamasi fbshipit-source-id: ee0e7229f04edabbc7bed5adb51771fbdc287f69	2021-06-09 15:41:16 -07:00
Peter Dillinger	2f93a3b809	Fix a major performance bug in 6.21 for cache entry stats (#8369 ) Summary: In final polishing of https://github.com/facebook/rocksdb/issues/8297 (after most manual testing), I broke my own caching layer by sanitizing an input parameter with std::min(0, x) instead of std::max(0, x). I resisted unit testing the timing part of the result caching because historically, these test are either flaky or difficult to write, and this was not a correctness issue. This bug is essentially unnoticeable with a small number of column families but can explode background work with a large number of column families. This change fixes the logical error, removes some unnecessary related optimization, and adds mock time/sleeps to the unit test to ensure we can cache hit within the age limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8369 Test Plan: added time testing logic to existing unit test Reviewed By: ajkr Differential Revision: D28950892 Pulled By: pdillinger fbshipit-source-id: e79cd4ff3eec68fd0119d994f1ed468c38026c3b	2021-06-08 05:03:32 -07:00
David Devecsery	80a59a03a7	Cancel compact range (#8351 ) Summary: Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure. Tested via two tests added to db_tests2.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351 Reviewed By: ajkr Differential Revision: D28808894 Pulled By: ddevec fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd	2021-06-07 11:41:31 -07:00
Andrew Kryczka	9167ece586	Snapshot release triggered compaction without multiple tombstones (#8357 ) Summary: This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase. This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop. The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are: - files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys - it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357 Test Plan: - make check Reviewed By: jay-zhuang Differential Revision: D28855340 Pulled By: ajkr fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798	2021-06-04 00:21:40 -07:00
PiyushDatta	2655477c67	Fix "Interval WAL" bytes to say GB instead of MB (#8350 ) Summary: Reference: https://github.com/facebook/rocksdb/issues/7201 Before fix: `/tmp/rocksdb_test_file/LOG.old.1622492586055679:Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s` After fix: `/tmp/rocksdb_test_file/LOG:Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s` Tests: ``` Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete ETA: 0s Left: 0 AVG: 0.05s local:0/7720/100%/0.0s rm -rf /dev/shm/rocksdb.CLRh /usr/bin/python3 tools/check_all_python.py No syntax errors in 34 .py files /usr/bin/python3 tools/ldb_test.py Running testCheckConsistency... .Running testColumnFamilies... .Running testCountDelimDump... .Running testCountDelimIDump... .Running testDumpLiveFiles... .Running testDumpLoad... Warning: 7 bad lines ignored. .Running testGetProperty... .Running testHexPutGet... .Running testIDumpBasics... .Running testIngestExternalSst... .Running testInvalidCmdLines... .Running testListColumnFamilies... .Running testManifestDump... .Running testMiscAdminTask... Sequence,Count,ByteSize,Physical Offset,Key(s) .Running testSSTDump... .Running testSimpleStringPutGet... .Running testStringBatchPut... .Running testTtlPutGet... .Running testWALDump... . ---------------------------------------------------------------------- Ran 19 tests in 15.945s OK sh tools/rocksdb_dump_test.sh make check-format make[1]: Entering directory '/home/piydatta/Documents/rocksdb' $DEBUG_LEVEL is 1 Makefile:176: Warning: Compiling in debug mode. Don't use the resulting binary in production build_tools/format-diff.sh -c Checking format of uncommitted changes... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8350 Reviewed By: jay-zhuang Differential Revision: D28790567 Pulled By: zhichao-cao fbshipit-source-id: dcb1e4c124361156435122f21f0a288335b2c8c8	2021-06-01 15:19:21 -07:00
Peter (Stig) Edwards	c75ef03e58	Do not truncate WAL if in read_only mode (#8313 ) Summary: I noticed ```openat``` system call with ```O_WRONLY``` flag and ```sync_file_range``` and ```truncate``` on WAL file when using ```rocksdb::DB::OpenForReadOnly``` by way of ```db_bench --readonly=true --benchmarks=readseq --use_existing_db=1 --num=1 ...``` Noticed in ```strace``` after seeing the last modification time of the WAL file change after each run (with ```--readonly=true```). I think introduced by `7d7f14480e` from https://github.com/facebook/rocksdb/pull/8122 I added a test to catch the WAL file being truncated and the modification time on it changing. I am not sure if a mock filesystem with mock clock could be used to avoid having to sleep 1.1s. The test could also check the set of files is the same and that the sizes are also unchanged. Before: ``` [ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged db/db_basic_test.cc:182: Failure Expected equality of these values: file_mtime_after_readonly_reopen Which is: 1621611136 file_mtime_before_readonly_reopen Which is: 1621611135 file is: 000010.log [ FAILED ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms) ``` After: ``` [ RUN ] DBBasicTest.ReadOnlyReopenMtimeUnchanged [ OK ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8313 Reviewed By: pdillinger Differential Revision: D28656925 Pulled By: jay-zhuang fbshipit-source-id: ea9e215cb53e7c830e76bc5fc75c45e21f12a1d6	2021-05-27 10:27:55 -07:00
sdong	a607b88240	SequenceIterWrapper should use internal comparator (#8328 ) Summary: https://github.com/facebook/rocksdb/pull/8288 introduces a bug: SequenceIterWrapper should do next for seek key using internal key comparator rather than user comparator. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8328 Test Plan: Pass all existing tests Reviewed By: ltamasi Differential Revision: D28647263 fbshipit-source-id: 4081d684fd8a86d248c485ef8a1563c7af136447	2021-05-24 12:46:38 -07:00
Jay Zhuang	55853de661	Fix clang-analyze: use uninitiated variable (#8325 ) Summary: Error: ``` db/db_compaction_test.cc:5211:47: warning: The left operand of '' is a garbage value uint64_t total = (l1_avg_size + l2_avg_size 10) * 10; ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8325 Test Plan: `$ make analyze` Reviewed By: pdillinger Differential Revision: D28620916 Pulled By: jay-zhuang fbshipit-source-id: f6d58ab84eefbcc905cda45afb9522b0c6d230f8	2021-05-21 19:06:47 -07:00
Zhichao Cao	7303d02bdf	Use new Insert and Lookup APIs in table reader to support secondary cache (#8315 ) Summary: Secondary cache is implemented to achieve the secondary cache tier for block cache. New Insert and Lookup APIs are introduced in https://github.com/facebook/rocksdb/issues/8271 . To support and use the secondary cache in block based table reader, this PR introduces the corresponding callback functions that will be used in secondary cache, and update the Insert and Lookup APIs accordingly. benchmarking: ./db_bench --benchmarks="fillrandom" -num=1000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/tmp/rocks_t/db -partition_index_and_filters=true ./db_bench -db=/tmp/rocks_t/db -use_existing_db=true -benchmarks=readrandom -num=1000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=5 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -stats_dump_period_sec=30 -reads=50000000 master benchmarking results: readrandom : 3.923 micros/op 254881 ops/sec; 33.4 MB/s (23849796 of 50000000 found) rocksdb.db.get.micros P50 : 2.820992 P95 : 5.636716 P99 : 16.450553 P100 : 8396.000000 COUNT : 50000000 SUM : 179947064 Current PR benchmarking results readrandom : 4.083 micros/op 244925 ops/sec; 32.1 MB/s (23849796 of 50000000 found) rocksdb.db.get.micros P50 : 2.967687 P95 : 5.754916 P99 : 15.665912 P100 : 8213.000000 COUNT : 50000000 SUM : 187250053 About 3.8% throughput reduction. P50: 5.2% increasing, P95, 2.09% increasing, P99 4.77% improvement Pull Request resolved: https://github.com/facebook/rocksdb/pull/8315 Test Plan: added the testing case Reviewed By: anand1976 Differential Revision: D28599774 Pulled By: zhichao-cao fbshipit-source-id: 098c4df0d7327d3a546df7604b2f1602f13044ed	2021-05-21 18:29:12 -07:00
Peter Dillinger	3469d60fcc	Add table properties for number of entries added to filters (#8323 ) Summary: With Ribbon filter work and possible variance in actual bits per key (or prefix; general term "entry") to achieve certain FP rates, I've received a request to be able to track actual bits per key in generated filters. This change adds a num_filter_entries table property, which can be combined with filter_size to get bits per key (entry). This can vary from num_entries in at least these ways: * Different versions of same key are only counted once in filters. * With prefix filters, several user keys map to the same filter entry. * A single filter can include both prefixes and user keys. Note that FilterBlockBuilder::NumAdded() didn't do anything useful except distinguish empty from non-empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323 Test Plan: basic unit test included, others updated Reviewed By: jay-zhuang Differential Revision: D28596210 Pulled By: pdillinger fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376	2021-05-21 17:11:32 -07:00
Jay Zhuang	6c86543590	Fix manual compaction `max_compaction_bytes` under-calculated issue (#8269 ) Summary: Fix a bug that for manual compaction, `max_compaction_bytes` is only limit the SST files from input level, but not overlapped files on output level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8269 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D28231044 Pulled By: jay-zhuang fbshipit-source-id: 9d7d03004f30cc4b1b9819830141436907554b7c	2021-05-21 14:03:44 -07:00
sdong	2f1984dd45	Compare memtable insert and flush count (#8288 ) Summary: When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\ heck against memory corruption. This change will also allow more counters to be added in the future for better validation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288 Test Plan: Pass all existing tests Reviewed By: ajkr Differential Revision: D28369194 fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad	2021-05-20 16:07:28 -07:00
Jay Zhuang	94b4faa0f1	Deflake ExternalSSTFileTest.PickedLevelBug (#8307 ) Summary: The test want to make sure these's no compaction during `AddFile` (between `DBImpl::AddFile:MutexLock` and `DBImpl::AddFile:MutexUnlock`) but the mutex could be unlocked by `EnterUnbatched()`. Move the lock start point after bumping the ingest file number. Also fix the dead lock when ASSERT fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8307 Reviewed By: ajkr Differential Revision: D28479849 Pulled By: jay-zhuang fbshipit-source-id: b3c50f66aa5d5f59c5c27f815bfea189c4cd06cb	2021-05-20 09:29:57 -07:00
Jay Zhuang	3786181a90	Add remote compaction public API (#8300 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8300 Reviewed By: ajkr Differential Revision: D28464726 Pulled By: jay-zhuang fbshipit-source-id: 49e9f4fb791808a6cbf39a7b1a331373f645fc5e	2021-05-19 21:41:31 -07:00
Peter Dillinger	311a544c2a	Use deleters to label cache entries and collect stats (#8297 ) Summary: This change gathers and publishes statistics about the kinds of items in block cache. This is especially important for profiling relative usage of cache by index vs. filter vs. data blocks. It works by iterating over the cache during periodic stats dump (InternalStats, stats_dump_period_sec) or on demand when DB::Get(Map)Property(kBlockCacheEntryStats), except that for efficiency and sharing among column families, saved data from the last scan is used when the data is not considered too old. The new information can be seen in info LOG, for example: Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0 Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%) And also through DB::GetProperty and GetMapProperty (here using ldb just for demonstration): $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats rocksdb.block-cache-entry-stats.bytes.data-block: 0 rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-block: 0 rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0 rocksdb.block-cache-entry-stats.bytes.index-block: 178992 rocksdb.block-cache-entry-stats.bytes.misc: 0 rocksdb.block-cache-entry-stats.bytes.other-block: 0 rocksdb.block-cache-entry-stats.bytes.write-buffer: 0 rocksdb.block-cache-entry-stats.capacity: 8388608 rocksdb.block-cache-entry-stats.count.data-block: 0 rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-block: 0 rocksdb.block-cache-entry-stats.count.filter-meta-block: 0 rocksdb.block-cache-entry-stats.count.index-block: 215 rocksdb.block-cache-entry-stats.count.misc: 1 rocksdb.block-cache-entry-stats.count.other-block: 0 rocksdb.block-cache-entry-stats.count.write-buffer: 0 rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290 rocksdb.block-cache-entry-stats.percent.data-block: 0.000000 rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000 rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000 rocksdb.block-cache-entry-stats.percent.index-block: 2.133751 rocksdb.block-cache-entry-stats.percent.misc: 0.000000 rocksdb.block-cache-entry-stats.percent.other-block: 0.000000 rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000 rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052 rocksdb.block-cache-entry-stats.secs_since_last_collection: 0 Solution detail - We need some way to flag what kind of blocks each entry belongs to, preferably without changing the Cache API. One of the complications is that Cache is a general interface that could have other users that don't adhere to whichever convention we decide on for keys and values. Or we would pay for an extra field in the Handle that would only be used for this purpose. This change uses a back-door approach, the deleter, to indicate the "role" of a Cache entry (in addition to the value type, implicitly). This has the added benefit of ensuring proper code origin whenever we recognize a particular role for a cache entry; if the entry came from some other part of the code, it will use an unrecognized deleter, which we simply attribute to the "Misc" role. An internal API makes for simple instantiation and automatic registration of Cache deleters for a given value type and "role". Another internal API, CacheEntryStatsCollector, solves the problem of caching the results of a scan and sharing them, to ensure scans are neither excessive nor redundant so as not to harm Cache performance. Because code is added to BlocklikeTraits, it is pulled out of block_based_table_reader.cc into its own file. This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option (could still be added), and with actual stat gathering. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297 Test Plan: manual testing with db_bench, and a couple of basic unit tests Reviewed By: ltamasi Differential Revision: D28488721 Pulled By: pdillinger fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb	2021-05-19 16:51:13 -07:00
anand76	9d61a0856d	Sync ingested files only if reopen is supported by the FS (#8296 ) Summary: Some file systems (especially distributed FS) do not support reopening a file for writing. The ExternalSstFileIngestionJob calls ReopenWritableFile in order to sync the ingested file, which typically makes sense only on a local file system with a page cache (i.e Posix). So this change tries to sync the ingested file only if ReopenWritableFile doesn't return Status::NotSupported(). Tests: Add a new unit test in external_sst_file_basic_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8296 Reviewed By: jay-zhuang Differential Revision: D28420865 Pulled By: anand1976 fbshipit-source-id: 380e7f5ff95324997f7a59864a9ac96ebbd0100c	2021-05-18 19:33:55 -07:00
Stanislav Tkach	83d1a66598	Expose CompressionOptions::parallel_threads through C API (#8302 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8302 Reviewed By: jay-zhuang Differential Revision: D28499262 Pulled By: ajkr fbshipit-source-id: 7b17b79af871d874dfca76db9bca0d640a6cd854	2021-05-17 22:53:04 -07:00
Levi Tamasi	d83542ca83	Make it possible to apply only a subrange of table property collectors (#8298 ) Summary: This patch does two things: 1) Introduces some aliases in order to eliminate/prevent long-winded type names w/r/t the internal table property collectors (see e.g. `std::vector<std::unique_ptr<IntTblPropCollectorFactory>>`). 2) Makes it possible to apply only a subrange of table property collectors during table building by turning `TableBuilderOptions::int_tbl_prop_collector_factories` from a pointer to a `vector` into a range (i.e. a pair of iterators). Rationale: I plan to introduce a BlobDB related table property collector, which should only be applied during table creation if blob storage is enabled at the moment (which can be changed dynamically). This change will make it possible to include/ exclude the BlobDB related collector as needed without having to introduce a second `vector` of collectors in `ColumnFamilyData` with pretty much the same contents. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8298 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D28430910 Pulled By: ltamasi fbshipit-source-id: a81d28f2c59495865300f43deb2257d2e6977c8e	2021-05-17 18:28:39 -07:00
sdong	0ed8cb666d	Write file temperature information to manifest (#8284 ) Summary: As a part of tiered storage, writing tempeature information to manifest is needed so that after DB recovery, RocksDB still has the tiering information, to implement some further necessary functionalities. Also fix some issues in simulated hybrid FS. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8284 Test Plan: Add a new unit test to validate that the information is indeed written and read back. Reviewed By: zhichao-cao Differential Revision: D28335801 fbshipit-source-id: 56aeb2e6ea090be0200181dd968c8a7278037def	2021-05-17 15:15:23 -07:00
anand76	feb06e83b2	Initial support for secondary cache in LRUCache (#8271 ) Summary: Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache. Tests: Results from cache_bench and db_bench don't show any regression due to these changes. cache_bench results before and after this change - Command ```./cache_bench -ops_per_thread=10000000 -threads=1``` Before ```Complete in 40.688 s; QPS = 245774``` ```Complete in 40.486 s; QPS = 246996``` ```Complete in 42.019 s; QPS = 237989``` After ```Complete in 40.672 s; QPS = 245869``` ```Complete in 44.622 s; QPS = 224107``` ```Complete in 42.445 s; QPS = 235599``` db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 - Commands ```./db_bench --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true``` ```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300``` Before ``` DB path: [/home/anand76/nvm_cache/db] readrandom : 80.702 micros/op 198104 ops/sec; 54.4 MB/s (3708999 of 3708999 found) ``` ``` DB path: [/home/anand76/nvm_cache/db] readrandom : 87.124 micros/op 183625 ops/sec; 50.4 MB/s (3439999 of 3439999 found) ``` After ``` DB path: [/home/anand76/nvm_cache/db] readrandom : 77.653 micros/op 206025 ops/sec; 56.6 MB/s (3866999 of 3866999 found) ``` ``` DB path: [/home/anand76/nvm_cache/db] readrandom : 84.962 micros/op 188299 ops/sec; 51.7 MB/s (3535999 of 3535999 found) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271 Reviewed By: zhichao-cao Differential Revision: D28357511 Pulled By: anand1976 fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2	2021-05-13 22:58:40 -07:00
Jay Zhuang	d15fbae449	Refactor Option obj address from char* to void* (#8295 ) Summary: And replace `reinterpret_cast` with `static_cast` or no cast. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8295 Test Plan: `make check` Reviewed By: mrambacher Differential Revision: D28420303 Pulled By: jay-zhuang fbshipit-source-id: 645be123a0df624dc2bea37cd54a35403fc494fa	2021-05-13 14:29:42 -07:00
Jay Zhuang	a79b46c503	Add De/Serialization for CompactionInput/Result (#8247 ) Summary: The functions will be used for remote compaction parameter input and result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8247 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D28104680 Pulled By: jay-zhuang fbshipit-source-id: c0a5178e6277125118384278efea2acbf90aa6cb	2021-05-12 12:36:43 -07:00
Peter Dillinger	78a309bf86	New Cache API for gathering statistics (#8225 ) Summary: Adds a new Cache::ApplyToAllEntries API that we expect to use (in follow-up PRs) for efficiently gathering block cache statistics. Notable features vs. old ApplyToAllCacheEntries: * Includes key and deleter (in addition to value and charge). We could have passed in a Handle but then more virtual function calls would be needed to get the "fields" of each entry. We expect to use the 'deleter' to identify the origin of entries, perhaps even more. * Heavily tuned to minimize latency impact on operating cache. It does this by iterating over small sections of each cache shard while cycling through the shards. * Supports tuning roughly how many entries to operate on for each lock acquire and release, to control the impact on the latency of other operations without excessive lock acquire & release. The right balance can depend on the cost of the callback. Good default seems to be around 256. * There should be no need to disable thread safety. (I would expect uncontended locks to be sufficiently fast.) I have enhanced cache_bench to validate this approach: * Reports a histogram of ns per operation, so we can look at the ditribution of times, not just throughput (average). * Can add a thread for simulated "gather stats" which calls ApplyToAllEntries at a specified interval. We also generate a histogram of time to run ApplyToAllEntries. To make the iteration over some entries of each shard work as cleanly as possible, even with resize between next set of entries, I have re-arranged which hash bits are used for sharding and which for indexing within a shard. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225 Test Plan: A couple of unit tests are added, but primary validation is manual, as the primary risk is to performance. The primary validation is using cache_bench to ensure that neither the minor hashing changes nor the simulated stats gathering significantly impact QPS or latency distribution. Note that adding op latency histogram seriously impacts the benchmark QPS, so for a fair baseline, we need the cache_bench changes (except remove simulated stat gathering to make it compile). In short, we don't see any reproducible difference in ops/sec or op latency unless we are gathering stats nearly continuously. Test uses 10GB block cache with 8KB values to be somewhat realistic in the number of items to iterate over. Baseline typical output: ``` Complete in 92.017 s; Rough parallel ops/sec = 869401 Thread ops/sec = 54662 Operation latency (ns): Count: 80000000 Average: 11223.9494 StdDev: 29.61 Min: 0 Median: 7759.3973 Max: 9620500 Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58 ------------------------------------------------------ [ 0, 1 ] 68 0.000% 0.000% ( 2900, 4400 ] 89 0.000% 0.000% ( 4400, 6600 ] 33630240 42.038% 42.038% ######## ( 6600, 9900 ] 18129842 22.662% 64.700% ##### ( 9900, 14000 ] 7877533 9.847% 74.547% ## ( 14000, 22000 ] 15193238 18.992% 93.539% #### ( 22000, 33000 ] 3037061 3.796% 97.335% # ( 33000, 50000 ] 1626316 2.033% 99.368% ( 50000, 75000 ] 421532 0.527% 99.895% ( 75000, 110000 ] 56910 0.071% 99.966% ( 110000, 170000 ] 16134 0.020% 99.986% ( 170000, 250000 ] 5166 0.006% 99.993% ( 250000, 380000 ] 3017 0.004% 99.996% ( 380000, 570000 ] 1337 0.002% 99.998% ( 570000, 860000 ] 805 0.001% 99.999% ( 860000, 1200000 ] 319 0.000% 100.000% ( 1200000, 1900000 ] 231 0.000% 100.000% ( 1900000, 2900000 ] 100 0.000% 100.000% ( 2900000, 4300000 ] 39 0.000% 100.000% ( 4300000, 6500000 ] 16 0.000% 100.000% ( 6500000, 9800000 ] 7 0.000% 100.000% ``` New, gather_stats=false. Median thread ops/sec of 5 runs: ``` Complete in 92.030 s; Rough parallel ops/sec = 869285 Thread ops/sec = 54458 Operation latency (ns): Count: 80000000 Average: 11298.1027 StdDev: 42.18 Min: 0 Median: 7722.0822 Max: 6398720 Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78 ------------------------------------------------------ [ 0, 1 ] 109 0.000% 0.000% ( 2900, 4400 ] 793 0.001% 0.001% ( 4400, 6600 ] 34054563 42.568% 42.569% ######### ( 6600, 9900 ] 17482646 21.853% 64.423% #### ( 9900, 14000 ] 7908180 9.885% 74.308% ## ( 14000, 22000 ] 15032072 18.790% 93.098% #### ( 22000, 33000 ] 3237834 4.047% 97.145% # ( 33000, 50000 ] 1736882 2.171% 99.316% ( 50000, 75000 ] 446851 0.559% 99.875% ( 75000, 110000 ] 68251 0.085% 99.960% ( 110000, 170000 ] 18592 0.023% 99.983% ( 170000, 250000 ] 7200 0.009% 99.992% ( 250000, 380000 ] 3334 0.004% 99.997% ( 380000, 570000 ] 1393 0.002% 99.998% ( 570000, 860000 ] 700 0.001% 99.999% ( 860000, 1200000 ] 293 0.000% 100.000% ( 1200000, 1900000 ] 196 0.000% 100.000% ( 1900000, 2900000 ] 69 0.000% 100.000% ( 2900000, 4300000 ] 32 0.000% 100.000% ( 4300000, 6500000 ] 10 0.000% 100.000% ``` New, gather_stats=true, 1 second delay between scans. Scans take about 1 second here so it's spending about 50% time scanning. Still the effect on ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs: ``` Complete in 91.890 s; Rough parallel ops/sec = 870608 Thread ops/sec = 54551 Operation latency (ns): Count: 80000000 Average: 11311.2629 StdDev: 45.28 Min: 0 Median: 7686.5458 Max: 10018340 Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86 ------------------------------------------------------ [ 0, 1 ] 71 0.000% 0.000% ( 2900, 4400 ] 291 0.000% 0.000% ( 4400, 6600 ] 34492060 43.115% 43.116% ######### ( 6600, 9900 ] 16727328 20.909% 64.025% #### ( 9900, 14000 ] 7845828 9.807% 73.832% ## ( 14000, 22000 ] 15510654 19.388% 93.220% #### ( 22000, 33000 ] 3216533 4.021% 97.241% # ( 33000, 50000 ] 1680859 2.101% 99.342% ( 50000, 75000 ] 439059 0.549% 99.891% ( 75000, 110000 ] 60540 0.076% 99.967% ( 110000, 170000 ] 14649 0.018% 99.985% ( 170000, 250000 ] 5242 0.007% 99.991% ( 250000, 380000 ] 3260 0.004% 99.995% ( 380000, 570000 ] 1599 0.002% 99.997% ( 570000, 860000 ] 1043 0.001% 99.999% ( 860000, 1200000 ] 471 0.001% 99.999% ( 1200000, 1900000 ] 275 0.000% 100.000% ( 1900000, 2900000 ] 143 0.000% 100.000% ( 2900000, 4300000 ] 60 0.000% 100.000% ( 4300000, 6500000 ] 27 0.000% 100.000% ( 6500000, 9800000 ] 7 0.000% 100.000% ( 9800000, 14000000 ] 1 0.000% 100.000% Gather stats latency (us): Count: 46 Average: 980387.5870 StdDev: 60911.18 Min: 879155 Median: 1033777.7778 Max: 1261431 Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00 ------------------------------------------------------ ( 860000, 1200000 ] 45 97.826% 97.826% #################### ( 1200000, 1900000 ] 1 2.174% 100.000% Most recent cache entry stats: Number of entries: 1295133 Total charge: 9.88 GB Average key size: 23.4982 Average charge: 8.00 KB Unique deleters: 3 ``` Reviewed By: mrambacher Differential Revision: D28295742 Pulled By: pdillinger fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95	2021-05-11 16:17:10 -07:00
mrambacher	9f2d255aed	Add ObjectRegistry to ConfigOptions (#8166 ) Summary: This change enables a couple of things: - Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production" - The ObjectRegistry is created fewer times and can be re-used The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166 Reviewed By: zhichao-cao Differential Revision: D27657952 Pulled By: mrambacher fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd	2021-05-11 06:47:22 -07:00
mrambacher	ff463742b5	Add Merge Operator support to WriteBatchWithIndex (#8135 ) Summary: The WBWI has two differing modes of operation dependent on the value of the constructor parameter `overwrite_key`. Currently, regardless of the parameter, neither mode performs as expected when using Merge. This PR remedies this by correctly invoking the appropriate Merge Operator before returning results from the WBWI. Examples of issues that exist which are solved by this PR: ## Example 1 with `overwrite_key=false` Currently, from an empty database, the following sequence: ``` Put('k1', 'v1') Merge('k1', 'v2') Get('k1') ``` Incorrectly yields `v2`, that is to say that the Merge behaves like a Put. ## Example 2 with o`verwrite_key=true` Currently, from an empty database, the following sequence: ``` Put('k1', 'v1') Merge('k1', 'v2') Get('k1') ``` Incorrectly yields `ERROR: kMergeInProgress`. ## Example 3 with `overwrite_key=false` Currently, with a database containing `('k1' -> 'v1')`, the following sequence: ``` Merge('k1', 'v2') GetFromBatchAndDB('k1') ``` Incorrectly yields `v1,v2` ## Example 4 with `overwrite_key=true` Currently, with a database containing `('k1' -> 'v1')`, the following sequence: ``` Merge('k1', 'v1') GetFromBatchAndDB('k1') ``` Incorrectly yields `ERROR: kMergeInProgress`. ## Example 5 with `overwrite_key=false` Currently, from an empty database, the following sequence: ``` Put('k1', 'v1') Merge('k1', 'v2') GetFromBatchAndDB('k1') ``` Incorrectly yields `v1,v2` ## Example 6 with `overwrite_key=true` Currently, from an empty database, `('k1' -> 'v1')`, the following sequence: ``` Put('k1', 'v1') Merge('k1', 'v2') GetFromBatchAndDB('k1') ``` Incorrectly yields `ERROR: kMergeInProgress`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135 Reviewed By: pdillinger Differential Revision: D27657938 Pulled By: mrambacher fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df	2021-05-10 12:50:25 -07:00
Andrew Kryczka	a639c02f8e	Allow applying `CompactionFilter` outside of compaction (#8243 ) Summary: From HISTORY.md release note: - Allow `CompactionFilter`s to apply in more table file creation scenarios such as flush and recovery. For compatibility, `CompactionFilter`s by default apply during compaction. Users can customize this behavior by overriding `CompactionFilterFactory::ShouldFilterTableFileCreation()`. - Removed unused structure `CompactionFilterContext` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8243 Test Plan: added unit tests Reviewed By: pdillinger Differential Revision: D28088089 Pulled By: ajkr fbshipit-source-id: 0799be7908e3b39fea09fc3f1ab00e13ad817fae	2021-05-07 16:01:40 -07:00
sdong	a4919d6b62	Cap automatic arena block size to 1 MB (#7907 ) Summary: Larger arena block size does provide the benefit of reducing allocation overhead, however it may cause other troubles. For example, allocator is more likely not to allocate them to physical memory and trigger page fault. Weighing the risk, we cap the arena block size to 1MB. Users can always use a larger value if they want. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7907 Test Plan: Run all existing tests Reviewed By: pdillinger Differential Revision: D26135269 fbshipit-source-id: b7f55afd03e6ee1d8715f90fa11b6c33944e9ea8	2021-05-07 13:15:34 -07:00
sdong	e19908cba6	Refactor kill point (#8241 ) Summary: Refactor kill point to one single class, rather than several extern variables. The intention was to drop unflushed data before killing to simulate some job, and I tried to a pointer to fault ingestion fs to the killing class, but it ended up with harder than I thought. Perhaps we'll need to do this in another way. But I thought the refactoring itself is good so I send it out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8241 Test Plan: make release and run crash test for a while. Reviewed By: anand1976 Differential Revision: D28078486 fbshipit-source-id: f9182c1455f52e6851c13f88a21bade63bcec45f	2021-05-05 15:50:29 -07:00
mrambacher	8948dc8524	Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262 ) Summary: The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form). Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes. Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262 Reviewed By: pdillinger Differential Revision: D28226540 Pulled By: mrambacher fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf	2021-05-05 14:00:17 -07:00
Andrew Kryczka	0f42e50fec	Fix `GetLiveFiles()` returning OPTIONS-000000 (#8268 ) Summary: See release note in HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268 Test Plan: unit test repro Reviewed By: siying Differential Revision: D28227901 Pulled By: ajkr fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339	2021-05-05 12:54:46 -07:00
Andrew Kryczka	c70bae1b05	Fix ConcurrentTaskLimiter token release for shutdown (#8253 ) Summary: Previously the shutdown process did not properly wait for all `compaction_thread_limiter` tokens to be released before proceeding to delete the DB's C++ objects. When this happened, we saw tests like "DBCompactionTest.CompactionLimiter" flake with the following error: ``` virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed. ``` There is a case where a token can still be alive even after the shutdown process has waited for BG work to complete. In particular, this happens because the shutdown process only waits for flush/compaction scheduled/unscheduled counters to all reach zero. These counters are decremented in `BackgroundCallCompaction()` functions. However, tokens are released in `BGWorkCompaction()` functions, which actually wrap the `BackgroundCallCompaction()` function. A simple sleep could repro the race condition: ``` $ diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 806bc548a..ba59efa89 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -2442,6 +2442,7 @@ void DBImpl::BGWorkCompaction(void arg) { static_cast<PrepickedCompaction*>(ca.prepicked_compaction); static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction( prepicked_compaction, Env::Priority::LOW); + sleep(1); delete prepicked_compaction; } $ ./db_compaction_test --gtest_filter=DBCompactionTest.CompactionLimiter db_compaction_test: util/concurrent_task_limiter_impl.cc:24: virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed. Received signal 6 (Aborted) #0 /usr/local/fbcode/platform007/lib/libc.so.6(gsignal+0xcf) [0x7f02673c30ff] ?? ??:0 https://github.com/facebook/rocksdb/issues/1 /usr/local/fbcode/platform007/lib/libc.so.6(abort+0x134) [0x7f02673ac934] ?? ??:0 ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8253 Test Plan: sleeps to expose race conditions Reviewed By: akankshamahajan15 Differential Revision: D28168064 Pulled By: ajkr fbshipit-source-id: 9e5167c74398d323e7975980c5cc00f450631160	2021-05-04 17:27:24 -07:00
Andrew Kryczka	c2a3424de5	Deflake DBTest.L0L1L2AndUpHitCounter (#8259 ) Summary: Previously we saw flakes on platforms like arm on CircleCI, such as the following: ``` Note: Google Test filter = DBTest.L0L1L2AndUpHitCounter [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBTest [ RUN ] DBTest.L0L1L2AndUpHitCounter db/db_test.cc:5345: Failure Expected: (TestGetTickerCount(options, GET_HIT_L0)) > (100), actual: 30 vs 100 [ FAILED ] DBTest.L0L1L2AndUpHitCounter (150 ms) [----------] 1 test from DBTest (150 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (150 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DBTest.L0L1L2AndUpHitCounter ``` The test was totally non-deterministic, e.g., flush/compaction timing would affect how many files on each level. Furthermore, it depended heavily on platform-specific details, e.g., by having a 32KB memtable, it could become full with a very different number of entries depending on the platform. This PR rewrites the test to build a deterministic LSM with one file per level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8259 Reviewed By: mrambacher Differential Revision: D28178100 Pulled By: ajkr fbshipit-source-id: 0a03b26e8d23c29d8297c1bccb1b115dce33bdcd	2021-05-04 11:02:59 -07:00
sdong	c3ff14e2c1	Hint temperature of bottommost level files to FileSystem (#8222 ) Summary: As the first part of the effort of having placing different files on different storage types, this change introduces several things: (1) An experimental interface in FileSystem that specify temperature to a new file created. (2) A test FileSystemWrapper, SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature. (3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file. (4) A db_bench parameter that applies the (2) and (3) to db_bench. The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222 Test Plan: ./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000 followed by ./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000 and see results as expected. Reviewed By: ajkr Differential Revision: D28003028 fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce	2021-05-03 13:34:04 -07:00
Peter Dillinger	d2ca04e3ed	Add more LSM info to FilterBuildingContext (#8246 ) Summary: Add `num_levels`, `is_bottommost`, and table file creation `reason` to `FilterBuildingContext`, in anticipation of more powerful Bloom-like filter support. To support this, added `is_bottommost` and `reason` to `TableBuilderOptions`, which allowed removing `reason` parameter from `rocksdb::BuildTable`. I attempted to remove `skip_filters` from `TableBuilderOptions`, because filter construction decisions should arise from options, not one-off parameters. I could not completely remove it because the public API for SstFileWriter takes a `skip_filters` parameter, and translating this into an option change would mean awkwardly replacing the table_factory if it is BlockBasedTableFactory with new filter_policy=nullptr option. I marked this public skip_filters option as deprecated because of this oddity. (skip_filters on the read side probably makes sense.) At least `skip_filters` is now largely hidden for users of `TableBuilderOptions` and is no longer used for implementing the optimize_filters_for_hits option. Bringing the logic for that option closer to handling of FilterBuildingContext makes it more obvious that hese two are using the same notion of "bottommost." (Planned: configuration options for Bloom-like filters that generalize `optimize_filters_for_hits`) Recommended follow-up: Try to get away from "bottommost level" naming of things, which is inaccurate (see VersionStorageInfo::RangeMightExistAfterSortedRun), and move to "bottommost run" or just "bottommost." Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246 Test Plan: extended an existing unit test to exercise and check various filter building contexts. Also, existing tests for optimize_filters_for_hits validate some of the "bottommost" handling, which is now closely connected to FilterBuildingContext::is_bottommost through TableBuilderOptions::is_bottommost Reviewed By: mrambacher Differential Revision: D28099346 Pulled By: pdillinger fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a	2021-04-30 13:50:13 -07:00
Peter Dillinger	85becd94c1	Refactor: use TableBuilderOptions to reduce parameter lists (#8240 ) Summary: Greatly reduced the not-quite-copy-paste giant parameter lists of rocksdb::NewTableBuilder, rocksdb::BuildTable, BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor. Moved weird separate parameter `uint32_t column_family_id` of TableFactory::NewTableBuilder into TableBuilderOptions. Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t target_file_size` is not randomly placed between uint64_t timestamps (was easy to mix up). Replaced a couple of fields of BlockBasedTableBuilder::Rep with a FilterBuildingContext. The motivation for this change is making it easier to pass along more data into new fields in FilterBuildingContext (follow-up PR). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240 Test Plan: ASAN make check Reviewed By: mrambacher Differential Revision: D28075891 Pulled By: pdillinger fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f	2021-04-29 07:00:50 -07:00
anand76	0db4cde6e2	Fix a memory leak in c_test (#8237 ) Summary: Don't call ```rocksdb_cache_disown_data()``` as it causes the memory allocated for ```shards_``` to be leaked. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8237 Reviewed By: jay-zhuang Differential Revision: D28039061 Pulled By: anand1976 fbshipit-source-id: c3464efe2c006b93b4be87030116a12a124598c4	2021-04-28 12:29:33 -07:00
Duarte Nunes	3949731de3	Add WAL flush API to C client (#8226 ) Summary: The C client is missing the`manual_wal_flush` option and the `flush_wal` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8226 Reviewed By: ajkr Differential Revision: D28000869 Pulled By: jay-zhuang fbshipit-source-id: ed44937e7e7e75bc0dfa870a14147fbeef0c38f8	2021-04-27 14:56:23 -07:00
Sahir Hoda	13c655a887	New C API to expose NewCompactOnDeletionCollectorFactory (#8233 ) Summary: New C API rocksdb_options_add_compact_on_deletion_collector_factory to expose NewCompactOnDeletionCollectorFactory Pull Request resolved: https://github.com/facebook/rocksdb/pull/8233 Reviewed By: mrambacher Differential Revision: D28018381 Pulled By: anand1976 fbshipit-source-id: 674c9ed902c91ff0d9f09e7a60c5f37b907604c6	2021-04-27 10:14:04 -07:00
mrambacher	0ca6d6297f	Rename variables in ImmutableCFOptions to avoid conflicts with ImmutableDBOptions (#8227 ) Summary: Renaming ImmutableCFOptions::info_log and statistics to logger and stats. This is stage 2 in creating an ImmutableOptions class. It is necessary because the names match those in ImmutableOptions and have different types. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8227 Reviewed By: jay-zhuang Differential Revision: D28000967 Pulled By: mrambacher fbshipit-source-id: 3bf2aa04e8f1e8724d825b7deacf41080c14420b	2021-04-26 12:43:45 -07:00
Sahir Hoda	d65d7d657d	Expose JemallocNodumpAllocator to C API (#8178 ) Summary: Add new C APIs to create the JemallocNodumpAllocator and set it on a Cache object. `make test` passes with and without `DISABLE_JEMALLOC=1`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8178 Reviewed By: jay-zhuang Differential Revision: D27944631 Pulled By: ajkr fbshipit-source-id: 2531729aa285a8985c58f22f093c4d53029c4a7b	2021-04-22 22:22:34 -07:00
mrambacher	01e460d538	Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176 ) Summary: This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively. readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release). There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176 Reviewed By: pdillinger Differential Revision: D27954339 Pulled By: mrambacher fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad	2021-04-22 20:43:54 -07:00
Jay Zhuang	f0fca2b1d5	Add internal compaction API for Secondary instance (#8171 ) Summary: Add compaction API for secondary instance, which compact the files to a secondary DB path without installing to the LSM tree. The API will be used to remote compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8171 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D27694545 Pulled By: jay-zhuang fbshipit-source-id: 8ff3ec1bffdb2e1becee994918850c8902caf731	2021-04-22 13:02:28 -07:00
Zhichao Cao	09a9ec3ac0	Fix the false positive alert of CF consistency check in WAL recovery (#8207 ) Summary: In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue. Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here. Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207 Test Plan: added unit test, make crash_test Reviewed By: riversand963 Differential Revision: D27898839 Pulled By: zhichao-cao fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e	2021-04-22 10:28:37 -07:00
Yanqin Jin	314352761f	Ignore comparator name mismatch in ldb manifest dump (#8216 ) Summary: RocksDB allows user-specified custom comparators which may not be known to `ldb`, a built-in tool for checking/mutating the database. Therefore, column family comparator names mismatch encountered during manifest dump should not prevent the dumping from proceeding. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8216 Test Plan: ``` make check ``` Also manually do the following ``` KEEP_DB=1 ./db_with_timestamp_basic_test ./ldb --db=<db> manifest_dump --verbose ``` The ldb should succeed and print something like: ``` ... --------------- Column family "default" (ID 0) -------------- log number: 6 comparator: <TestComparator>, but the comparator object is not available. ... ``` Reviewed By: ltamasi Differential Revision: D27927581 Pulled By: riversand963 fbshipit-source-id: f610b2c842187d17f575362070209ee6b74ec6d4	2021-04-21 20:43:10 -07:00
sdong	4985cea141	Add comment to DisableManualCompaction() (#8186 ) Summary: Add comment to DisableManualCompaction() which was missing. Also explictly return from DBImpl::CompactRange() to avoid memtable flush when manual compaction is disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8186 Test Plan: Run existing unit tests. Reviewed By: jay-zhuang Differential Revision: D27744517 fbshipit-source-id: 449548a48905903b888dc9612bd17480f6596a71	2021-04-21 15:23:46 -07:00
Akanksha Mahajan	596e9008e4	Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898 ) Summary: When WriteBufferManager is shared across DBs and column families to maintain memory usage under a limit, OOMs have been observed when flush cannot finish but writes continuously insert to memtables. In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write, this change will stall incoming writers until flush is completed and memory_usage drops. Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_ (memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true. DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall() (which adds dummy stall object to the writer's queue). Then DB is blocked on a state State::Blocked (current write doesn't go through). WBStallInterface object maintained by every DB instance is added to the queue of WriteBufferManager. If multiple DBs tries to write during this stall, they will also be blocked when check WriteBufferManager::ShouldStall() returns true. End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush to complete and avoid continous stalling if memory usage remains close to buffer_size. WriterBufferManager::EndWriteStall() is called, which removes all instances from its queue and signal them to continue. Their state is changed to State::Running and they are unblocked. DBImpl then signal all incoming writers of that DB to continue by calling write_thread_.EndWriteStall() (which removes dummy stall object from the queue). DB instance creates WBMStallInterface which is an interface to block and signal DBs during stall. When DB needs to be blocked or signalled by WriteBufferManager, state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898 Test Plan: Added a new test db/db_write_buffer_manager_test.cc Reviewed By: anand1976 Differential Revision: D26093227 Pulled By: akankshamahajan15 fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae	2021-04-21 13:54:02 -07:00
Andrew Kryczka	905dd17b35	Fix seqno in ingested file boundary key metadata (#8209 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/6245. Adapted from https://github.com/facebook/rocksdb/issues/8201 and https://github.com/facebook/rocksdb/issues/8205. Previously we were writing the ingested file's smallest/largest internal keys with sequence number zero, or `kMaxSequenceNumber` in case of range tombstone. The former (sequence number zero) is incorrect and can lead to files being incorrectly ordered. The fix in this PR is to overwrite boundary keys that have sequence number zero with the ingested file's assigned sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8209 Test Plan: repro unit test Reviewed By: riversand963 Differential Revision: D27885678 Pulled By: ajkr fbshipit-source-id: 4a9f2c6efdfff81c3a9923e915ea88b250ee7b6a	2021-04-20 14:00:21 -07:00
Jay Zhuang	a89740fbc6	Fix unittest no space issue (#8204 ) Summary: Unittest reports no space from time to time, which can be reproduced on a small memory machine with SHM. It's caused by large WAL files generated during the test, which is preallocated, but didn't truncate during close(). Adding the missing APIs to set preallocation. It added arm test as nightly build, as the test runs more than 1 hour. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8204 Test Plan: test on small memory arm machine Reviewed By: mrambacher Differential Revision: D27873145 Pulled By: jay-zhuang fbshipit-source-id: f797c429d6bc13cbcc673bc03fcc72adda55f506	2021-04-20 08:42:28 -07:00
Yanqin Jin	a376c22066	Handle rename() failure in non-local FS (#8192 ) Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4	2021-04-19 18:11:13 -07:00
Levi Tamasi	0c6e4674a6	Fix a data race related to DB properties (#8206 ) Summary: Historically, the DB properties `rocksdb.cur-size-active-mem-table`, `rocksdb.cur-size-all-mem-tables`, and `rocksdb.size-all-mem-tables` called the method `MemTable::ApproximateMemoryUsage` for mutable memtables, which is not safe without synchronization. This resulted in data races with memtable inserts. The patch changes the code handling these properties to use `MemTable::ApproximateMemoryUsageFast` instead, which returns a cached value backed by an atomic variable. Two test cases had to be updated for this change. `MemoryTest.MemTableAndTableReadersTotal` was fixed by increasing the value size used so each value ends up in its own memtable, which was the original intention (note: the test has been broken in the sense that the test code didn't consider that memtable sizes below 64 KB get increased to 64 KB by `SanitizeOptions`, and has been passing only by accident). `DBTest.MemoryUsageWithMaxWriteBufferSizeToMaintain` relies on completely up-to-date values and thus was changed to use `ApproximateMemoryUsage` directly instead of going through the DB properties. Note: this should be safe in this case since there's only a single thread involved. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8206 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D27866811 Pulled By: ltamasi fbshipit-source-id: 7bd754d0565e0a65f1f7f0e78ffc093beef79394	2021-04-19 16:38:02 -07:00
Yanqin Jin	b0e20194ea	Handle blob files when options.best_efforts_recovery is true (#8180 ) Summary: If `options.best_efforts_recovery == true`, RocksDB currently tolerates missing table files and recovers to the latest version without missing table files (not considering WAL). It is necessary to handle blob files as well to make the feature more complete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8180 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27840556 Pulled By: riversand963 fbshipit-source-id: 041685d0dc2e7779ac4f0374c07a8a327704aa5e	2021-04-19 11:56:14 -07:00
mrambacher	4c41e51c07	Add Blob Options to C API (#8148 ) Summary: Added the Blob option settings from the AdvancedColmnFamilyOptions to the C API. There are no tests for getting/setting options in the C API currently, hence no specific test plans. Should there be a some? Pull Request resolved: https://github.com/facebook/rocksdb/pull/8148 Reviewed By: ltamasi Differential Revision: D27568495 Pulled By: mrambacher fbshipit-source-id: 3a52b784467ea2c4bc58be5f75c5d41f0a5c55d6	2021-04-16 05:56:00 -07:00
Akanksha Mahajan	00803d619c	Fix flaky failure in DBSSTest.DBWithSstFileManagerForBlobFilesWithGC (#8196 ) Summary: Updated the test to wait until all trash files are deleted by SSTFileManager in the background. Since deletion runs in background so number of files deleted might not always be as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8196 Reviewed By: jay-zhuang Differential Revision: D27812273 Pulled By: akankshamahajan15 fbshipit-source-id: d3ace1db34f91254b52fa455e09844d02801f58e	2021-04-15 20:18:57 -07:00
Akanksha Mahajan	83031e7343	Fix for LITE mode failure on MacOS (#8189 ) Summary: Fix for failure to build in LITE mode on MacOs from BlobFileCompletionCallback unused private fields. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8189 Reviewed By: jay-zhuang Differential Revision: D27768341 Pulled By: akankshamahajan15 fbshipit-source-id: 14d31d7a9b52d308d9f9f27feff1977c5550622f	2021-04-15 09:45:02 -07:00
Akanksha Mahajan	296b47db25	Extend file_checksum_dump ldb command and DB::GetLiveFilesChecksumInfo to blob files (#8179 ) Summary: Extend the DB::GetLiveFilesChecksumInfo API to blob files. This API is also used by the file_checksum_dump ldb command to dump checksum of SST files which now also dumps blob files checksum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8179 Test Plan: Add new unit test Reviewed By: zhichao-cao Differential Revision: D27714965 Pulled By: akankshamahajan15 fbshipit-source-id: d8b7343ea845a64c83800336d88cced7152a8c92	2021-04-15 09:38:13 -07:00
Yanqin Jin	b1f62be10e	Use the right level (L0) for files written during WAL recovery (#8187 ) Summary: As the name of `DBImpl::WriteLevel0TableForRecovery` suggests, the resulting table file should be placed on L0. However, the argument `level` passed to `BuildTable()` is -1. We need to correct this since the level information will be useful to determine file placement. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8187 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27748570 Pulled By: riversand963 fbshipit-source-id: e1cd23128a8de31f14b1edc2ea92754c154e4f10	2021-04-14 23:40:22 -07:00
Justin Chapman	d89483098f	Assert unlimited max_open_files for FIFO compaction. (#8172 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/8014 - Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used. - This is to align with what the docs mention and to prevent premature data deletion. - Update tests to work with this assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172 Test Plan: ```bash $ make check -j$(nproc) Generated TARGETS Summary: - 6 libs - 0 binarys - 180 tests ``` Reviewed By: ajkr Differential Revision: D27768792 Pulled By: thejchap fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3	2021-04-14 12:05:47 -07:00
Sahir Hoda	139778dfb3	Expose Cache::DisownData in C API (#8160 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8160 Reviewed By: riversand963 Differential Revision: D27672474 Pulled By: ajkr fbshipit-source-id: fdbbc3398f0b1d4cef6b68636e5caf369c34b3a7	2021-04-09 10:39:11 -07:00
Giuseppe Ottaviano	48cd7a3aae	Fix flush reason attribution (#8150 ) Summary: Current flush reason attribution is misleading or incorrect (depending on what the original intention was): - Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager` - Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull` This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150 Reviewed By: zhichao-cao Differential Revision: D27569645 Pulled By: ot fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc	2021-04-07 23:18:37 -07:00
Peter Dillinger	a4e82a3cca	Fix read-only DB writing to filesystem with write_dbid_to_manifest (#8164 ) Summary: Fixing another crash test failure in the case of write_dbid_to_manifest=true and reading a backup as read-only DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8164 Test Plan: enhanced unit test for backup as read-only DB, ran blackbox_crash_test more with elevated backup_one_in Reviewed By: zhichao-cao Differential Revision: D27622237 Pulled By: pdillinger fbshipit-source-id: 680d0f99ddb465a601737f2e3f2c80efd47384fb	2021-04-07 10:26:47 -07:00
Peter Dillinger	879357fdb0	Make backups openable as read-only DBs (#8142 ) Summary: A current limitation of backups is that you don't know the exact database state of when the backup was taken. With this new feature, you can at least inspect the backup's DB state without restoring it by opening it as a read-only DB. Rather than add something like OpenAsReadOnlyDB to the BackupEngine API, which would inhibit opening stackable DB implementations read-only (if/when their APIs support it), we instead provide a DB name and Env that can be used to open as a read-only DB. Possible follow-up work: * Add a version of GetBackupInfo for a single backup. * Let CreateNewBackup return the BackupID of the newly-created backup. Implementation details: Refactored ChrootFileSystem to split off new base class RemapFileSystem, which allows more general remapping of files. We use this base class to implement BackupEngineImpl::RemapSharedFileSystem. To minimize API impact, I decided to just add these fields `name_for_open` and `env_for_open` to those set by GetBackupInfo when include_file_details=true. Creating the RemapSharedFileSystem adds a bit to the memory consumption, perhaps unnecessarily in some cases, but this has been mitigated by (a) only initialize the RemapSharedFileSystem lazily when GetBackupInfo with include_file_details=true is called, and (b) using the existing `shared_ptr<FileInfo>` objects to hold most of the mapping data. To enhance API safety, RemapSharedFileSystem is wrapped by new ReadOnlyFileSystem which rejects any attempts to write. This uncovered a couple of places in which DB::OpenForReadOnly would write to the filesystem, so I fixed these. Added a release note because this affects logging. Additional minor refactoring in backupable_db.cc to support the new functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142 Test Plan: new test (run with ASAN and UBSAN), added to stress test and ran it for a while with amplified backup_one_in Reviewed By: ajkr Differential Revision: D27535408 Pulled By: pdillinger fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148	2021-04-06 14:37:53 -07:00
Yanqin Jin	09528f9fa1	Fix a bug for SeekForPrev with partitioned filter and prefix (#8137 ) Summary: According to https://github.com/facebook/rocksdb/issues/5907, each filter partition "should include the bloom of the prefix of the last key in the previous partition" so that SeekForPrev() in prefix mode can return correct result. The prefix of the last key in the previous partition does not necessarily have the same prefix as the first key in the current partition. Regardless of the first key in current partition, the prefix of the last key in the previous partition should be added. The existing code, however, does not follow this. Furthermore, there is another issue: when finishing current filter partition, `FullFilterBlockBuilder::AddPrefix()` is called for the first key in next filter partition, which effectively overwrites `last_prefix_str_` prematurely. Consequently, when the filter block builder proceeds to the next partition, `last_prefix_str_` will be the prefix of its first key, leaving no way of adding the bloom of the prefix of the last key of the previous partition. Prefix extractor is FixedLength.2. ``` [ filter part 1 ] [ filter part 2 ] abc d ``` When SeekForPrev("abcd"), checking the filter partition will land on filter part 2 because "abcd" > "abc" but smaller than "d". If the filter in filter part 2 happens to return false for the test for "ab", then SeekForPrev("abcd") will build incorrect iterator tree in non-total-order mode. Also fix a unit test which starts to fail following this PR. `InDomain` should not fail due to assertion error when checking on an arbitrary key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8137 Test Plan: ``` make check ``` Without this fix, the following command will fail pretty soon. ``` ./db_stress --acquire_snapshot_one_in=10000 --avoid_flush_during_recovery=0 \ --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 \ --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=17 \ --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 \ --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 \ --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 \ --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox \ --db_write_buffer_size=8388608 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --enable_blob_files=0 \ --enable_compaction_filter=0 --enable_pipelined_write=1 --file_checksum_impl=big --flush_one_in=1000000 \ --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=2 --ingest_external_file_one_in=0 \ --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \ --log2_keys_per_lock=10 --long_running_snapshots=1 --mark_for_compaction_one_file_in=0 \ --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000000 --max_key_len=3 \ --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 \ --max_write_buffer_size_to_maintain=8388608 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False \ --nooverwritepercent=0 --open_files=500000 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=0 --pause_background_one_in=1000000 \ --periodic_compaction_seconds=0 --prefixpercent=5 --progress_reports=0 --read_fault_one_in=0 --read_only=0 \ --readpercent=45 --recycle_log_file_num=0 --reopen=20 --secondary_catch_up_one_in=0 \ --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 \ --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False \ --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 \ --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_blob_db=0 --use_block_based_filter=0 \ --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 \ --use_multiget=0 --use_ribbon_filter=0 --use_txn=0 --user_timestamp_size=8 --verify_checksum=1 \ --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=4194304 \ --write_dbid_to_manifest=1 --writepercent=35 ``` Reviewed By: pdillinger Differential Revision: D27553054 Pulled By: riversand963 fbshipit-source-id: 60e391e4a2d8d98a9a3172ec5d6176b90ec3de98	2021-04-06 12:14:08 -07:00
Yanqin Jin	dd3fbbbf95	Use separate db dir for different tests hoping to remove flakiness (#8147 ) Summary: DBWALTestWithParam relies on `SstFileManager` to have the expected behavior. However, if this test shares db directories with other DBSSTTest, then the SstFileManager may see non-empty data, thus will change its behavior to be different from expectation, introducing flakiness. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8147 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D27553362 Pulled By: riversand963 fbshipit-source-id: a2d86343e8e2220bc553b6695ce87dd21a97ddec	2021-04-03 11:48:56 -07:00
Peter Dillinger	0fccc6225e	Fix db_test2 parallelism (#8145 ) Summary: With thread/process-specific dirs. (Errors seen in FB infra.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8145 Test Plan: see in FB infra tests Reviewed By: riversand963 Differential Revision: D27542355 Pulled By: pdillinger fbshipit-source-id: b3c8e66f91a6a6b3a775f6fc0c3cf71e63c29ade	2021-04-02 13:38:04 -07:00
Akanksha Mahajan	689b13e639	Add request_id in IODebugContext. (#8045 ) Summary: Add request_id in IODebugContext which will be populated by underlying FileSystem for IOTracing purposes. Update IOTracer to trace request_id in the tracing records. Provided API IODebugContext::SetRequestId which will set the request_id and enable tracing for request_id. The API hides the implementation and underlying file system needs to call this API directly. Update DB::StartIOTrace API and remove redundant Env* from the argument as its not used and DB already has Env that is passed down to IOTracer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045 Test Plan: Update unit test. Differential Revision: D26899871 Pulled By: akankshamahajan15 fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b	2021-04-01 13:14:51 -07:00
rockeet	5025c7ec09	version_set_test.cc: remove a redundent obj copy (#7880 ) Summary: Remove redundant obj copy Pull Request resolved: https://github.com/facebook/rocksdb/pull/7880 Reviewed By: akankshamahajan15 Differential Revision: D26921119 Pulled By: riversand963 fbshipit-source-id: f227da688b067870a069e728a67799a8a95fee99	2021-04-01 11:28:54 -07:00
Andrew Kryczka	c43a37a922	Fix compression dictionary sampling with dedicated range tombstone SSTs (#8141 ) Summary: Return early in case there are zero data blocks when `BlockBasedTableBuilder::EnterUnbuffered()` is called. This crash can only be triggered by applying dictionary compression to SST files that contain only range tombstones. It cannot be triggered by a low buffer limit alone since we only consider entering unbuffered mode after buffering a data block causing the limit to be breached, or `Finish()`ing the file. It also cannot be triggered by a totally empty file because those go through `Abandon()` rather than `Finish()` so unbuffered mode is never entered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8141 Test Plan: added a unit test that repro'd the "Floating point exception" Reviewed By: riversand963 Differential Revision: D27495640 Pulled By: ajkr fbshipit-source-id: a463cfba476919dc5c5c380800a75a86c31ffa23	2021-04-01 05:08:17 -07:00
darionyaphet	a3a943bf63	Merge checks into one (#8138 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8138 Reviewed By: zhichao-cao Differential Revision: D27475616 Pulled By: riversand963 fbshipit-source-id: d2815eed578a90c53d6a4e0dc4aaa232516eb4f8	2021-03-31 19:13:10 -07:00
Andrew Kryczka	1ba2b8a568	Add sample_for_compression results to table properties (#8139 ) Summary: Added `TableProperties::{fast,slow}_compression_estimated_data_size`. These properties are present in block-based tables when `ColumnFamilyOptions::sample_for_compression > 0` and the necessary compression library is supported when the file is generated. They contain estimates of what `TableProperties::data_size` would be if the "fast"/"slow" compression library had been used instead. One limitation is we do not record exactly which "fast" (ZSTD or Zlib) or "slow" (LZ4 or Snappy) compression library produced the result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139 Test Plan: - new unit test - ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression Reviewed By: riversand963 Differential Revision: D27454338 Pulled By: ajkr fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4	2021-03-31 18:21:50 -07:00
Zhichao Cao	335c5a6be5	Fix error_handler_fs_test failure due to statistics (#8136 ) Summary: Fix error_handler_fs_test failure due to statistics, it will fails due to multi-thread running and resume is different. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8136 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D27448828 Pulled By: zhichao-cao fbshipit-source-id: b94255c45e9e66e93334b5ca2e4e1bfcba23fc20	2021-03-30 21:44:44 -07:00
sherriiiliu	e6534900bd	Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125 ) Summary: In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_ and bg_flush_scheduled_ to drop to 0. Unschedule is called prior to cancel any unscheduled flushes/compactions. It is assumed that anything in the high priority is a flush, and anything in the low priority pool is a compaction. This assumption, however, is broken when the high-pri pool is full. As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_ will remain > 0 and DB can be in hang state. The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_` inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB `mutex_` will make the counts atomic in `Unschedule`. Related discussion: https://github.com/facebook/rocksdb/issues/7928 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125 Test Plan: Added new test case which hangs without the fix. Reviewed By: jay-zhuang Differential Revision: D27390043 Pulled By: ajkr fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110	2021-03-30 18:35:20 -07:00
Jay Zhuang	a037bb35e9	Compaction should not move data to up level (#8116 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8116 Reviewed By: ajkr, mrambacher Differential Revision: D27353828 Pulled By: jay-zhuang fbshipit-source-id: 42703fb01b04d92cc097d7979e64798448852e88	2021-03-29 17:10:42 -07:00
anand76	7d7f14480e	Always truncate the latest WAL file on DB Open (#8122 ) Summary: Currently, we only truncate the latest alive WAL files when the DB is opened. If the latest WAL file is empty or was flushed during Open, its not truncated since the file will be deleted later on in the Open path. However, before deletion, a new WAL file is created, and if the process crash loops between the new WAL file creation and deletion of the old WAL file, the preallocated space will keep accumulating and eventually use up all disk space. To prevent this, always truncate the latest WAL file, even if its empty or the data was flushed. Tests: Add unit tests to db_wal_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8122 Reviewed By: riversand963 Differential Revision: D27366132 Pulled By: anand1976 fbshipit-source-id: f923cc03ef033ccb32b140d36c6a63a8152f0e8e	2021-03-28 10:00:08 -07:00
anand76	c5f52714fb	Use malloc in rocksdb_transaction_get_snapshot (#8114 ) Summary: The snapshot structure returned by rocksdb_transaction_get_snapshot is supposed to be freed by calling rocksdb_free(), so allocate using malloc rather than new. Fixes https://github.com/facebook/rocksdb/issues/6112 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8114 Reviewed By: akankshamahajan15 Differential Revision: D27362923 Pulled By: anand1976 fbshipit-source-id: e93a8b1ffe26dafbe22529907f72b796ae971214	2021-03-26 15:51:34 -07:00
Zhichao Cao	7f27767efa	Remove disabled tests (#8123 ) Summary: Remove disabled tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/8123 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27367066 Pulled By: zhichao-cao fbshipit-source-id: 71fa1d492d9b0144decff0a1d0e0ef25c0ecc4ba	2021-03-26 12:49:00 -07:00
Levi Tamasi	303cb23a0f	Introduce a ThreadGuard class and use it in ExternalSSTFileTest.PickedLevelBug (#8112 ) Summary: The patch adds a resource management/RAII class called `ThreadGuard`, which can be used to ensure that the managed thread is joined when the `ThreadGuard` is destroyed, regardless of whether it is due to the object going out of scope, an early return, an exception etc. This is important because if an `std::thread` object is destroyed without having been joined (or detached) first, the process is aborted (via `std::terminate`). For now, `ThreadGuard` is only used in the test case `ExternalSSTFileTest.PickedLevelBug`; however, it could come in handy elsewhere in the codebase as well (both in test code and "real" code). Case in point: in the `PickedLevelBug` test case, with the earlier code we could end up in the above situation when the following assertion (which is before the threads are joined) is triggered: ``` ASSERT_FALSE(bg_compact_started.load()); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8112 Test Plan: ``` make check gtest-parallel --repeat=10000 ./external_sst_file_test --gtest_filter="*PickedLevelBug" ``` Reviewed By: riversand963 Differential Revision: D27343185 Pulled By: ltamasi fbshipit-source-id: 2a8c3aa68bc78cc03ec0dbae909fb25c2cd15c69	2021-03-25 22:08:58 -07:00
Zhichao Cao	af80a78ba4	Fix flush no wal IO error bug (#8107 ) Summary: There is bug in the current code base introduced in https://github.com/facebook/rocksdb/issues/8049 , we still set the SST file write IO Error only case as hard error. Fix it by removing the logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8107 Test Plan: make check, error_handler_fs_test Reviewed By: anand1976 Differential Revision: D27321422 Pulled By: zhichao-cao fbshipit-source-id: c014afc1553ca66b655e3bbf9d0bf6eb417ccf94	2021-03-25 21:42:50 -07:00
storagezhang	711881bc25	Fix some typos in comments (#8066 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8066 Reviewed By: jay-zhuang Differential Revision: D27280799 Pulled By: mrambacher fbshipit-source-id: 68f91f5af4ffe0a84be581961bf9366887f47702	2021-03-25 21:18:08 -07:00
Andrew Kryczka	c20a7cd6c7	Apply `sample_for_compression` to all block-based tables (#8105 ) Summary: Previously it only applied to block-based tables generated by flush. This restriction was undocumented and blocked a new use case. Now compression sampling applies to all block-based tables we generate when it is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8105 Test Plan: new unit test Reviewed By: riversand963 Differential Revision: D27317275 Pulled By: ajkr fbshipit-source-id: cd9fcc5178d6515e8cb59c6facb5ac01893cb5b0	2021-03-25 15:00:45 -07:00
Jay Zhuang	45c65d6dcf	Use thread-safe `strerror_r()` to get error message (#8087 ) Summary: `strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from `0deef031cb/folly/String.cpp (L457)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087 Reviewed By: mrambacher Differential Revision: D27267151 Pulled By: jay-zhuang fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb	2021-03-24 23:07:27 -07:00
Connor	f06b761185	Fix unexpected compaction error for compact files (#8024 ) Summary: Summary: When doing CompactFiles on the files of multiple levels(num_level > 2) with L0 is included, the compaction would fail like this. ![image](https://user-images.githubusercontent.com/13497871/109975371-8b601280-7d35-11eb-830f-f732dc1f9246.png) The reason is that in `VerifyCompactionFileConsistency` it checks the levels between the L0 and base level should be empty, but it regards the compaction triggered by `CompactFiles` as an L0 -> base level compaction wrongly. The condition is committed several years ago, whereas it isn't correct anymore. ```c++ if (vstorage->compaction_style_ == kCompactionStyleLevel && c->start_level() == 0 && c->num_input_levels() > 2U) ``` So this PR just deletes the incorrect check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8024 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D26907060 Pulled By: ajkr fbshipit-source-id: 538cef32faf464cd422e3f8de236ea3e58880c2b	2021-03-24 21:18:03 -07:00
Akanksha Mahajan	41e554da2b	Fix Race condition in db_sst_test (#8092 ) Summary: Fix race condition in DBSSTTest.DBWithMaxSpaceAllowedWithBlobFiles where background flush thread updates delete_blob_file but in test thread Flush() already completes after getting bg_error and delete_blob_file remains false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8092 Test Plan: Ran ASAN job few times on CircleCI Reviewed By: riversand963 Differential Revision: D27275815 Pulled By: akankshamahajan15 fbshipit-source-id: 2939ad1671403881573bbe07c71aa474c5019130	2021-03-23 17:38:52 -07:00
Yanqin Jin	9f7c02dad5	Move compacted_db_impl.[c\|h] to db/db_impl (#8082 ) Summary: As title. All core db implementations should stay in db_impl. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8082 Test Plan: make check Reviewed By: ajkr Differential Revision: D27211442 Pulled By: riversand963 fbshipit-source-id: e0953fde75064740e899aaff7989ff033b7f5232	2021-03-23 13:49:26 -07:00
storagezhang	c8b0842bcd	Remove unused variable (#8067 ) Summary: Remove unused variable `Slice blob_to_write` in `db/blob/blob_file_cache_test.cc`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8067 Reviewed By: zhichao-cao Differential Revision: D27107693 Pulled By: riversand963 fbshipit-source-id: 9bfd4d296a6a1714ad5c1fa5bb231a0c52dbd56d	2021-03-19 12:13:59 -07:00
storagezhang	d9be6556aa	Include C++ standard library headers instead of C compatibility headers (#8068 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8068 Reviewed By: zhichao-cao Differential Revision: D27147685 Pulled By: riversand963 fbshipit-source-id: 5428b1c0142ecae17c977fba31a6d49b52983d1c	2021-03-19 12:09:47 -07:00
storagezhang	c706324208	Add default in switch (#8065 ) Summary: switch may not cover all branch in `db/c.cc`: ```c++ void rocksdb_options_set_access_hint_on_compaction_start( rocksdb_options_t* opt, int v) { switch(v) { case 0: opt->rep.access_hint_on_compaction_start = ROCKSDB_NAMESPACE::Options::NONE; break; case 1: opt->rep.access_hint_on_compaction_start = ROCKSDB_NAMESPACE::Options::NORMAL; break; case 2: opt->rep.access_hint_on_compaction_start = ROCKSDB_NAMESPACE::Options::SEQUENTIAL; break; case 3: opt->rep.access_hint_on_compaction_start = ROCKSDB_NAMESPACE::Options::WILLNEED; break; } } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8065 Reviewed By: riversand963 Differential Revision: D27102892 Pulled By: zhichao-cao fbshipit-source-id: ad1d20d192712878e61597311ba75b55df0066d7	2021-03-19 11:57:52 -07:00
Zhichao Cao	dd0447ae2c	Add new Append API with DataVerificationInfo to Env WritableFile (#8071 ) Summary: Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071 Test Plan: make check, added new unit test. Reviewed By: anand1976 Differential Revision: D27177043 Pulled By: zhichao-cao fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e	2021-03-19 11:44:13 -07:00
Yanqin Jin	7ee41a5d25	Fix a test failure when built with ASSERT_STATUS_CHECKED=1 (#8075 ) Summary: As title. Test plan ASSERT_STATUS_CHECKED=1 make -j20 backupable_db_test error_handler_fs_test ./backupable_db_test ./error_handler_fs_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8075 Reviewed By: zhichao-cao Differential Revision: D27173832 Pulled By: riversand963 fbshipit-source-id: 37dac50f7c89127804ff2572abddd4174642de30	2021-03-18 21:52:48 -07:00
Zhichao Cao	c810947184	Separate handling of WAL Sync io error with SST flush io error (#8049 ) Summary: In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL. Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049 Test Plan: added new unit test, make check Reviewed By: anand1976 Differential Revision: D26965529 Pulled By: zhichao-cao fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410	2021-03-18 14:33:16 -07:00
Peter Dillinger	e7a60d01b2	Revamp WriteController (#8064 ) Summary: WriteController had a number of issues: * It could introduce a delay of 1ms even if the write rate never exceeded the configured delayed_write_rate. * The DB-wide delayed_write_rate could be exceeded in a number of ways with multiple column families: * Wiping all pending delay "debts" when another column family joins the delay with GetDelayToken(). * Resetting last_refill_time_ to (now + sleep amount) means each column family can write with delayed_write_rate for large writes. * Updating bytes_left_ for a partial refill without updating last_refill_time_ would essentially give out random bonuses, especially to medium-sized writes. Now the code is much simpler, with these issues fixed. See comments in the new code and new (replacement) tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064 Test Plan: new tests, better than old tests Reviewed By: mrambacher Differential Revision: D27064936 Pulled By: pdillinger fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f	2021-03-18 09:47:31 -07:00
Zhichao Cao	08ec5e7321	Add the statistics and info log for Error handler (#8050 ) Summary: Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050 Test Plan: make check and add test to error_handler_fs_test Reviewed By: anand1976 Differential Revision: D26990565 Pulled By: zhichao-cao fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f	2021-03-17 22:38:13 -07:00
Akanksha Mahajan	27d57a035e	Use SST file manager to track blob files as well (#8037 ) Summary: Extend support to track blob files in SST File manager. This PR notifies SstFileManager whenever a new blob file is created, via OnAddFile and an obsolete blob file deleted via OnDeleteFile and delete file via ScheduleFileDeletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037 Test Plan: Add new unit tests Reviewed By: ltamasi Differential Revision: D26891237 Pulled By: akankshamahajan15 fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6	2021-03-17 20:44:49 -07:00
Yanqin Jin	0304352882	Fix a bug in key comparison when index type is kBinarySearchWithFirstKey (#8062 ) Summary: When timestamp is enabled, key comparison should take this into account. In `BlockBasedTableReader::Get()`, `BlockBasedTableReader::MultiGet()`, assume the target key is `key`, and the timestamp upper bound is `ts`. The highest key in current block is (key, ts1), while the lowest key in next block is (key, ts2). If ``` ts1 > ts > ts2 ``` then ``` (key, ts1) < (key, ts) < (key, ts2) ``` It can be shown that if `Compare()` is used, then we will mistakenly skip the next block. Instead, we should use `CompareWithoutTimestamp()`. The majority of this PR makes some existing tests in `db_with_timestamp_basic_test.cc` parameterized so that different index types can be tested. A new unit test is also added for more coverage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8062 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27057557 Pulled By: riversand963 fbshipit-source-id: c1062fa7c159ed600a1ad7e461531d52265021f1	2021-03-15 17:44:52 -07:00
Yanqin Jin	85d4f2c8b3	Move a test file to a better location (#8054 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8054 Test Plan: make check Reviewed By: mrambacher Differential Revision: D27017955 Pulled By: riversand963 fbshipit-source-id: 829497d507bc89afbe982f8a8cf3555e52fd7098	2021-03-15 15:03:27 -07:00
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	2021-03-15 04:34:11 -07:00
Andrew Kryczka	b8f40f7f7b	Deflake tests of compaction based on compensated file size (#8036 ) Summary: CompactionDeletionTriggerReopen was observed to be flaky recently: https://app.circleci.com/pipelines/github/facebook/rocksdb/6030/workflows/787af4f3-b9f7-4645-8e8d-1fb0ebf05539/jobs/101451. I went through it and the related tests and arrived at different conclusions on what constraints we can expect on DB size. Some constraints got looser and some got tighter. The particular constraint that flaked got a lot looser so at least the flake linked above would have been prevented. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8036 Reviewed By: riversand963 Differential Revision: D26862566 Pulled By: ajkr fbshipit-source-id: 3512b86b4fb41aeecae32e1c7382c03916d88d88	2021-03-14 20:25:42 -07:00
Levi Tamasi	b708b166dc	Fix a harmless data race affecting two test cases (#8055 ) Summary: `DBTest.GetLiveBlobFiles` and `ObsoleteFilesTest.BlobFiles` both modify the current `Version` in their setup phase, implicitly assuming that no other threads would touch the `Version` while this is happening. The periodic stats dumper thread violates this assumption; the patch fixes this by disabling it in the affected test cases. (Note: the data race is harmless in the sense that it only affects test code.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8055 Test Plan: ``` COMPILE_WITH_TSAN=1 make db_test -j24 gtest-parallel --repeat=10000 ./db_test --gtest_filter="GetLiveBlobFiles" COMPILE_WITH_TSAN=1 make obsolete_files_test -j24 gtest-parallel --repeat=10000 ./obsolete_files_test --gtest_filter="BlobFiles" ``` Reviewed By: riversand963 Differential Revision: D27022715 Pulled By: ltamasi fbshipit-source-id: b6cc77ed63d8bc1cbe0603522ff1a572182fc9ab	2021-03-12 16:44:35 -08:00
Peter Dillinger	119dda2195	Instantiate tests DBIteratorTestForPinnedData (#8051 ) Summary: a trial gtest upgrade discovered some parameterized tests missing instantiation. By some miracle, they still pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8051 Test Plan: thisisthetest Reviewed By: mrambacher Differential Revision: D27003684 Pulled By: pdillinger fbshipit-source-id: cde1cab1551fb282f67d462d46574bd30bd5e61f	2021-03-12 12:31:29 -08:00
Yanqin Jin	82b3888433	Enable backward iterator for keys with user-defined timestamp (#8035 ) Summary: This PR does the following: - Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet. - Introduces a new helper API `Comparator::EqualWithoutTimestamp()`. - Fix a typo in `SetTimestamp()`. - Add/update unit tests Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps. 1. Disable timestamp, run: ``` ./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5 ``` Results: > Baseline > - seekrandom [AVG 6 runs] : 96115 ops/sec; 53.2 MB/sec > - seekrandom [MEDIAN 6 runs] : 98075 ops/sec; 54.2 MB/sec > > This PR > - seekrandom [AVG 6 runs] : 95521 ops/sec; 52.8 MB/sec > - seekrandom [MEDIAN 6 runs] : 96338 ops/sec; 53.3 MB/sec 2. Enable timestamp, run: ``` ./db_bench -user_timestamp_size=8 -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5 ``` Result: > Baseline: not supported > > This PR > - seekrandom [AVG 6 runs] : 90514 ops/sec; 50.1 MB/sec > - seekrandom [MEDIAN 6 runs] : 90834 ops/sec; 50.2 MB/sec Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035 Reviewed By: ltamasi Differential Revision: D26926668 Pulled By: riversand963 fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5	2021-03-10 11:15:46 -08:00
Yanqin Jin	64517d184a	Make secondary instance use ManifestTailer (#7998 ) Summary: This PR - adds a class `ManifestTailer` that inherits from `VersionEditHandlerPointInTime`. `ManifestTailer::Iterate()` can be called multiple times to tail the primary instance's MANIFEST and apply the changes to the secondary, - updates the implementation of `ReactiveVersionSet::ReadAndApply` to use this class, - removes unused code in version_set.cc, - updates existing tests, e.g. removing deleted sync points from unit tests, - adds a new test to address the bug in https://github.com/facebook/rocksdb/issues/7815. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7998 Test Plan: make check Existing and newly-added tests in version_set_test.cc and db_secondary_test.cc Reviewed By: jay-zhuang Differential Revision: D26926641 Pulled By: riversand963 fbshipit-source-id: 8d4dd15db0ba863c213f743e33b5a207e948c980	2021-03-10 10:59:44 -08:00
qinzuoyan	6fad38ebe8	Fix compile error (#7908 ) Summary: OS: Ubuntu 14.04 Compiler: GCC 4.9.4 Compile error: ``` db/forward_iterator.cc:996:62: error: declaration of ‘key’ shadows a member of 'this' [-Werror=shadow] auto cmp = [&](const FileMetaData* f, const Slice& key) -> bool { ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7908 Reviewed By: jay-zhuang Differential Revision: D26899986 Pulled By: ajkr fbshipit-source-id: 66b0b97aefd0f13a085e063491f8207366a9f848	2021-03-09 20:53:33 -08:00
Ed rodriguez	7381dad1b1	make:Fix c header prototypes (#7994 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7994 Reviewed By: jay-zhuang Differential Revision: D26904603 Pulled By: ajkr fbshipit-source-id: 0af92a51de895b40c7faaa4f0870b3f63279fe21	2021-03-09 20:44:23 -08:00
Peter Dillinger	0028e3398b	Make format_version=5 new default (#8017 ) Summary: Haven't seen any production issues with new Bloom filter and it's now > 1 year old (added in 6.6.0). Updated check_format_compatible.sh and HISTORY.md Pull Request resolved: https://github.com/facebook/rocksdb/pull/8017 Test Plan: tests updated (or prior bugs fixed) Reviewed By: ajkr Differential Revision: D26762197 Pulled By: pdillinger fbshipit-source-id: 0e755c46b443087c1544da0fd545beb9c403d1c2	2021-03-09 12:42:53 -08:00
fanrui03	67d72fb5dc	Fix checkpoint stuck (#7921 ) Summary: ## 1. Bug description: When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method. ## 2. Simple analysis of the reasons: ### 2.1 Configuration parameters: ```yaml Compaction Style : Universal max_write_buffer_number : 4 min_write_buffer_number_to_merge : 3 ``` Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code. ```c++ // method: GetWriteStallConditionAndCause if (mutable_cf_options.max_write_buffer_number> 3 && num_unflushed_memtables >= mutable_cf_options.max_write_buffer_number-1) { return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit}; } ``` code link: `fbed72f03c/db/column_family.cc (L847)` Checkpoint thought there was a FlushJob, but it didn't. So will always wait. ### 2.2 solution: Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`. If there are other better solutions, you can correct me. ### 2.3 Code that can reproduce the problem: https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java ## 3. Interesting point This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`. Because there is a break in WaitUntilFlushWouldNotStallWrites. ```c++ if (cfd->imm()->NumNotFlushed() < cfd->ioptions()->min_write_buffer_number_to_merge && vstorage->l0_delay_trigger_count() < mutable_cf_options.level0_file_num_compaction_trigger) { break; } ``` code link: `fbed72f03c/db/db_impl/db_impl_compaction_flush.cc (L1974)` Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7921 Reviewed By: jay-zhuang Differential Revision: D26900559 Pulled By: ajkr fbshipit-source-id: 133c1252dad7393753f04a47590b68c7d8e670df	2021-03-09 02:21:25 -08:00
Andrew Kryczka	0ff0b625a1	Deflake DBTest2.PartitionedIndexUserToInternalKey on ppc64le (#8044 ) Summary: For some reason I still cannot figure out, the manual flush in this test was sometimes producing a third tiny file. I saw it a bunch of times on ppc64le, but even running a qemu system with that architecture (and playing with various other options) could not repro. However we did get an instrumented Travis run to confirm the problem is indeed a third tiny file - https://travis-ci.org/github/facebook/rocksdb/jobs/761986592. We can avoid it by filling memtables less full and using manual flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8044 Reviewed By: akankshamahajan15 Differential Revision: D26892635 Pulled By: ajkr fbshipit-source-id: 775c04176931cf01d07cc78fb82cfe3a11beebcf	2021-03-08 14:47:56 -08:00
Levi Tamasi	cb25bc1128	Update compaction statistics to include the amount of data read from blob files (#8022 ) Summary: The patch does the following: 1) Exposes the amount of data (number of bytes) read from blob files from `BlobFileReader::GetBlob` / `Version::GetBlob`. 2) Tracks the total number and size of blobs read from blob files during a compaction (due to garbage collection or compaction filter usage) in `CompactionIterationStats` and propagates this data to `InternalStats::CompactionStats` / `CompactionJobStats`. 3) Updates the formulae for write amplification calculations to include the amount of data read from blob files. 4) Extends the compaction stats dump with a new column `Rblob(GB)` and a new line containing the total number and size of blob files in the current `Version` to complement the information about the shape and size of the LSM tree that's already there. 5) Updates `CompactionJobStats` so that the number of files and amount of data written by a compaction are broken down per file type (i.e. table/blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26801199 Pulled By: ltamasi fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2	2021-03-04 00:43:48 -08:00
Yanqin Jin	72d1e258cd	Possibly bump NUMBER_OF_RESEEKS_IN_ITERATION (#8015 ) Summary: When changing db iterator direction, we may perform a reseek. Therefore, we should bump the NUMBER_OF_RESEEKS_IN_ITERATION counter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8015 Test Plan: make check Reviewed By: ltamasi Differential Revision: D26755415 Pulled By: riversand963 fbshipit-source-id: 211f51f1a454bcda768fc46c0dce51edeb7f05fe	2021-03-02 22:41:04 -08:00
Levi Tamasi	a46f080cce	Break down the amount of data written during flushes/compactions per file type (#8013 ) Summary: The patch breaks down the "bytes written" (as well as the "number of output files") compaction statistics into two, so the values are logged separately for table files and blob files in the info log, and are shown in separate columns (`Write(GB)` for table files, `Wblob(GB)` for blob files) when the compaction statistics are dumped. This will also come in handy for fixing the write amplification statistics, which currently do not consider the amount of data read from blob files during compaction. (This will be fixed by an upcoming patch.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013 Test Plan: Ran `make check` and `db_bench`. Reviewed By: riversand963 Differential Revision: D26742156 Pulled By: ltamasi fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5	2021-03-02 09:48:00 -08:00
Akanksha Mahajan	f19612970d	Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003 ) Summary: The checkpointing logic supports passing file level checksums to the copy_file_cb callback function which is used by the backup code for detecting corruption during file copies. However, this is currently implemented only for table files. This PR extends the checksum retrieval to blob files as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003 Test Plan: Add new test units Reviewed By: ltamasi Differential Revision: D26680701 Pulled By: akankshamahajan15 fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5	2021-03-01 20:07:07 -08:00
Yanqin Jin	c370d8aa12	Remove unused/incorrect fwd declaration (#8002 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8002 Reviewed By: anand1976 Differential Revision: D26659354 Pulled By: riversand963 fbshipit-source-id: 6b464dbea9fd8240ead8cc5af393f0b78e8f9dd1	2021-02-25 23:07:31 -08:00
Yanqin Jin	cef4a6c49f	Compaction filter support for (new) BlobDB (#7974 ) Summary: Allow applications to implement a custom compaction filter and pass it to BlobDB. The compaction filter's custom logic can operate on blobs. To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method. Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in db/blob/db_blob_compaction_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974 Test Plan: make check Reviewed By: ltamasi Differential Revision: D26509280 Pulled By: riversand963 fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39	2021-02-25 16:32:35 -08:00
sherriiiliu	e017af15c1	Fix testcase failures on windows (#7992 ) Summary: Fixed 5 test case failures found on Windows 10/Windows Server 2016 1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles. 2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes. 3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end. 4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case. 5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992 Reviewed By: jay-zhuang Differential Revision: D26615283 Pulled By: ajkr fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d	2021-02-23 14:35:06 -08:00
Akanksha Mahajan	46cf5fbfdd	Extend VerifyFileChecksums API for blob files (#7979 ) Summary: Extend VerifyFileChecksums API to verify blob files in case of use_file_checksum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7979 Test Plan: New unit test db_blob_corruption_test Reviewed By: ltamasi Differential Revision: D26534040 Pulled By: akankshamahajan15 fbshipit-source-id: 7dc5951a3df9d265ea1265e0122b43c966856ade	2021-02-22 22:09:22 -08:00
Andrew Kryczka	d904233d2f	Limit buffering for collecting samples for compression dictionary (#7970 ) Summary: For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file. However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage. Related changes include: - Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks - Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary - Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970 Test Plan: - updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level - looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set. Reviewed By: pdillinger Differential Revision: D26467994 Pulled By: ajkr fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465	2021-02-19 14:09:54 -08:00
mrambacher	4bc9df9459	Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936 ) Summary: Added a "only_mutable_options" flag to the ConfigOptions. When set, the Configurable methods will only look at/update options that are marked as kMutable. Fixed DB::SetOptions to allow for the update of any mutable TableFactory options. Fixes https://github.com/facebook/rocksdb/issues/7385. Added tests for the new flag. Updated HISTORY.md Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936 Reviewed By: akankshamahajan15 Differential Revision: D26389646 Pulled By: mrambacher fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0	2021-02-19 10:29:02 -08:00
Zhichao Cao	b0fd1cc45a	Introduce a new trace file format (v 0.2) for better extension (#7977 ) Summary: The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977 Test Plan: make check. tested with old trace file in replay and analyzing. Reviewed By: anand1976 Differential Revision: D26529948 Pulled By: zhichao-cao fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76	2021-02-18 23:05:35 -08:00
Jay Zhuang	59ba104e4a	Fix txn `MultiGet()` return un-committed data with snapshot (#7963 ) Summary: TransactionDB uses read callback to filter out un-committed data before a snapshot. But `MultiGet()` API doesn't use that at all, which causes returning unwanted data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963 Test Plan: Added unittest to reproduce Reviewed By: anand1976 Differential Revision: D26455851 Pulled By: jay-zhuang fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55	2021-02-18 08:49:00 -08:00
Akanksha Mahajan	6a85aea5b1	Bug fix for status overridden by Status::NotFound in db_impl_readonly (#7972 ) Summary: Bug fix for status returned being overridden by Status::NotFound in DBImpl::OpenForReadOnlyCheckExistence. This was casuing some service owners to misinterpret the actual error and take appropriate steps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7972 Reviewed By: riversand963 Differential Revision: D26499598 Pulled By: akankshamahajan15 fbshipit-source-id: 05e9fedbe2a2e0e53135760f8ff578a2816d2b8e	2021-02-17 19:35:57 -08:00
Akanksha Mahajan	ea8bb82fc7	Add support for IOTracing in blob files (#7958 ) Summary: Add support for IOTracing in blob files Pull Request resolved: https://github.com/facebook/rocksdb/pull/7958 Test Plan: Add a new test and checked manually the trace_file for blob files being recorded during read and write. Reviewed By: ltamasi Differential Revision: D26415950 Pulled By: akankshamahajan15 fbshipit-source-id: 49c2859b3a4f8307e7cb69a92704403a4da46d44	2021-02-16 09:49:10 -08:00
Jay Zhuang	9df78a94f1	Disable flaky error_handler_fs_test that could hang (#7964 ) Summary: The test is hang on `95013df278/db/error_handler_fs_test.cc (L947)` Seems db.mutex_ is lock twice in the test: `cf160b98e1/db/db_impl/db_impl_compaction_flush.cc (L3208)` `0a9a05ae12/db/db_impl/db_impl.cc (L469)` As it's just a test issue, disable it for now until the test is fixed. The hang could be reproduced by: `gtest-parallel ./error_handler_fs_test --gtest_filter=DBErrorHandlingFSTest.CompactionWriteFileScopeError -r 1000` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7964 Reviewed By: zhichao-cao Differential Revision: D26447325 Pulled By: jay-zhuang fbshipit-source-id: 72f6a346458e059d10e9cc3347bd6bde040cf89e	2021-02-15 09:45:23 -08:00
Zhichao Cao	d1c510baec	Handoff checksum Implementation (#7523 ) Summary: in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523 Test Plan: add new unit test, pass make check/ make asan_check Reviewed By: pdillinger Differential Revision: D24313271 Pulled By: zhichao-cao fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0	2021-02-10 22:20:32 -08:00
Jay Zhuang	cf160b98e1	Add full_history_ts_low option to compaction (#7884 ) Summary: The full_history_ts_low is used for user-defined timestamp GC compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884 Reviewed By: ltamasi Differential Revision: D25982553 Pulled By: jay-zhuang fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee	2021-02-08 13:45:48 -08:00
Levi Tamasi	974458891c	Revert "Turn on memtable bloom filter by default. (#6584 )" (#7939 ) Summary: This reverts commit `ee79a28963`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7939 Reviewed By: siying Differential Revision: D26298564 Pulled By: ltamasi fbshipit-source-id: 6d663516e82e6de436f8d5317932ca9a98e152bd	2021-02-06 22:34:30 -08:00
sdong	ee79a28963	Turn on memtable bloom filter by default. (#6584 ) Summary: Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584 Test Plan: Run all existing tests. Reviewed By: pdillinger Differential Revision: D20626739 fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9	2021-02-05 12:59:46 -08:00
Stanislav Tkach	3feee6db17	Add get/set deadline and io_timeout C functions (read options) (#7914 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7914 Reviewed By: jay-zhuang Differential Revision: D26184409 Pulled By: ajkr fbshipit-source-id: 8e30faac5223ec80c22e2b617af67775322065d8	2021-02-04 17:00:58 -08:00
Levi Tamasi	e5311a8ea4	Fix a SingleDelete related optimization for blob indexes (#7904 ) Summary: There is a small `SingleDelete` related optimization in the `CompactionIterator` code: when a `SingleDelete`-`Put` pair is preserved solely for the purposes of transaction conflict checking, the value itself gets cleared. (This is referred to as "optimization 3" in the `CompactionIterator` code.) Though the rest of the code got updated to support `SingleDelete`'ing blob indexes, this chunk was apparently missed, resulting in an assertion failure (or `ROCKS_LOG_FATAL` in release builds) when triggered. Note: in addition to clearing the value, we also need to update the type of the KV to regular value when dealing with blob indexes here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7904 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D26118009 Pulled By: ltamasi fbshipit-source-id: 6bf78043d20265e2b15c2e1ab8865025040c42ae	2021-01-29 12:41:25 -08:00

1 2 3 4 5 ...

4586 Commits