rocksdb

Author	SHA1	Message	Date
Andrew Kryczka	0f42e50fec	Fix `GetLiveFiles()` returning OPTIONS-000000 (#8268 ) Summary: See release note in HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268 Test Plan: unit test repro Reviewed By: siying Differential Revision: D28227901 Pulled By: ajkr fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339	2021-05-05 12:54:46 -07:00
Peter Dillinger	3b981eaa1d	Fix use-after-free threading bug in ClockCache (#8261 ) Summary: In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with -use_clock_cache, as well as db_bench -use_clock_cache, but not single-threaded. Smaller cache size hits failure much faster. ASAN reported the failuer as calling malloc_usable_size on the `key` pointer of a ClockCache handle after it was reportedly freed. On detailed inspection I found this bad sequence of operations for a cache entry: state=InCache=1,refs=1 [thread 1] Start ClockCacheShard::Unref (from Release, no mutex) [thread 1] Decrement ref count state=InCache=1,refs=0 [thread 1] Suspend before CalcTotalCharge (no mutex) [thread 2] Start UnsetInCache (from Insert, mutex held) [thread 2] clear InCache bit state=InCache=0,refs=0 [thread 2] Calls RecycleHandle (based on pre-updated state) [thread 2] Returns to Insert which calls Cleanup which deletes `key` [thread 1] Resume ClockCacheShard::Unref [thread 1] Read `key` in CalcTotalCharge To fix this, I've added a field to the handle to store the metadata charge so that we can efficiently remember everything we need from the handle in Unref. We must not read from the handle again if we decrement the count to zero with InCache=1, which means we don't own the entry and someone else could eject/overwrite it immediately. Note before this change, on amd64 sizeof(Handle) == 56 even though there are only 48 bytes of data. Grouping together the uint32_t fields would cut it down to 48, but I've added another uint32_t, which takes it back up to 56. Not a big deal. Also fixed DisownData to cooperate with ASAN as in LRUCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261 Test Plan: Manual + adding use_clock_cache to db_crashtest.py Base performance ./cache_bench -use_clock_cache Complete in 17.060 s; QPS = 2458513 New performance ./cache_bench -use_clock_cache Complete in 17.052 s; QPS = 2459695 Any difference is easily buried in small noise. Crash test shows still more bug(s) in ClockCache, so I'm expecting to disable ClockCache from production code in a follow-up PR (if we can't find and fix the bug(s)) Reviewed By: mrambacher Differential Revision: D28207358 Pulled By: pdillinger fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294	2021-05-04 22:18:00 -07:00
Andrew Kryczka	c70bae1b05	Fix ConcurrentTaskLimiter token release for shutdown (#8253 ) Summary: Previously the shutdown process did not properly wait for all `compaction_thread_limiter` tokens to be released before proceeding to delete the DB's C++ objects. When this happened, we saw tests like "DBCompactionTest.CompactionLimiter" flake with the following error: ``` virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed. ``` There is a case where a token can still be alive even after the shutdown process has waited for BG work to complete. In particular, this happens because the shutdown process only waits for flush/compaction scheduled/unscheduled counters to all reach zero. These counters are decremented in `BackgroundCallCompaction()` functions. However, tokens are released in `BGWorkCompaction()` functions, which actually wrap the `BackgroundCallCompaction()` function. A simple sleep could repro the race condition: ``` $ diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 806bc548a..ba59efa89 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -2442,6 +2442,7 @@ void DBImpl::BGWorkCompaction(void arg) { static_cast<PrepickedCompaction*>(ca.prepicked_compaction); static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction( prepicked_compaction, Env::Priority::LOW); + sleep(1); delete prepicked_compaction; } $ ./db_compaction_test --gtest_filter=DBCompactionTest.CompactionLimiter db_compaction_test: util/concurrent_task_limiter_impl.cc:24: virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed. Received signal 6 (Aborted) #0 /usr/local/fbcode/platform007/lib/libc.so.6(gsignal+0xcf) [0x7f02673c30ff] ?? ??:0 https://github.com/facebook/rocksdb/issues/1 /usr/local/fbcode/platform007/lib/libc.so.6(abort+0x134) [0x7f02673ac934] ?? ??:0 ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8253 Test Plan: sleeps to expose race conditions Reviewed By: akankshamahajan15 Differential Revision: D28168064 Pulled By: ajkr fbshipit-source-id: 9e5167c74398d323e7975980c5cc00f450631160	2021-05-04 17:27:24 -07:00
Andrew Kryczka	c2a3424de5	Deflake DBTest.L0L1L2AndUpHitCounter (#8259 ) Summary: Previously we saw flakes on platforms like arm on CircleCI, such as the following: ``` Note: Google Test filter = DBTest.L0L1L2AndUpHitCounter [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBTest [ RUN ] DBTest.L0L1L2AndUpHitCounter db/db_test.cc:5345: Failure Expected: (TestGetTickerCount(options, GET_HIT_L0)) > (100), actual: 30 vs 100 [ FAILED ] DBTest.L0L1L2AndUpHitCounter (150 ms) [----------] 1 test from DBTest (150 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (150 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DBTest.L0L1L2AndUpHitCounter ``` The test was totally non-deterministic, e.g., flush/compaction timing would affect how many files on each level. Furthermore, it depended heavily on platform-specific details, e.g., by having a 32KB memtable, it could become full with a very different number of entries depending on the platform. This PR rewrites the test to build a deterministic LSM with one file per level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8259 Reviewed By: mrambacher Differential Revision: D28178100 Pulled By: ajkr fbshipit-source-id: 0a03b26e8d23c29d8297c1bccb1b115dce33bdcd	2021-05-04 11:02:59 -07:00
Jay Zhuang	8a92564a82	Update CircleCI MacOS Xcode version to 11.3.0 (#8256 ) Summary: To fix CircleCI pyenv installation failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8256 Reviewed By: ajkr Differential Revision: D28191772 Pulled By: jay-zhuang fbshipit-source-id: 2bbb1d5ded473e510c11c8ed27884c4ad073973f	2021-05-04 10:34:31 -07:00
sdong	c3ff14e2c1	Hint temperature of bottommost level files to FileSystem (#8222 ) Summary: As the first part of the effort of having placing different files on different storage types, this change introduces several things: (1) An experimental interface in FileSystem that specify temperature to a new file created. (2) A test FileSystemWrapper, SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature. (3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file. (4) A db_bench parameter that applies the (2) and (3) to db_bench. The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222 Test Plan: ./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000 followed by ./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000 and see results as expected. Reviewed By: ajkr Differential Revision: D28003028 fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce	2021-05-03 13:34:04 -07:00
Peter Dillinger	d2ca04e3ed	Add more LSM info to FilterBuildingContext (#8246 ) Summary: Add `num_levels`, `is_bottommost`, and table file creation `reason` to `FilterBuildingContext`, in anticipation of more powerful Bloom-like filter support. To support this, added `is_bottommost` and `reason` to `TableBuilderOptions`, which allowed removing `reason` parameter from `rocksdb::BuildTable`. I attempted to remove `skip_filters` from `TableBuilderOptions`, because filter construction decisions should arise from options, not one-off parameters. I could not completely remove it because the public API for SstFileWriter takes a `skip_filters` parameter, and translating this into an option change would mean awkwardly replacing the table_factory if it is BlockBasedTableFactory with new filter_policy=nullptr option. I marked this public skip_filters option as deprecated because of this oddity. (skip_filters on the read side probably makes sense.) At least `skip_filters` is now largely hidden for users of `TableBuilderOptions` and is no longer used for implementing the optimize_filters_for_hits option. Bringing the logic for that option closer to handling of FilterBuildingContext makes it more obvious that hese two are using the same notion of "bottommost." (Planned: configuration options for Bloom-like filters that generalize `optimize_filters_for_hits`) Recommended follow-up: Try to get away from "bottommost level" naming of things, which is inaccurate (see VersionStorageInfo::RangeMightExistAfterSortedRun), and move to "bottommost run" or just "bottommost." Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246 Test Plan: extended an existing unit test to exercise and check various filter building contexts. Also, existing tests for optimize_filters_for_hits validate some of the "bottommost" handling, which is now closely connected to FilterBuildingContext::is_bottommost through TableBuilderOptions::is_bottommost Reviewed By: mrambacher Differential Revision: D28099346 Pulled By: pdillinger fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a	2021-04-30 13:50:13 -07:00
Peter Dillinger	85becd94c1	Refactor: use TableBuilderOptions to reduce parameter lists (#8240 ) Summary: Greatly reduced the not-quite-copy-paste giant parameter lists of rocksdb::NewTableBuilder, rocksdb::BuildTable, BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor. Moved weird separate parameter `uint32_t column_family_id` of TableFactory::NewTableBuilder into TableBuilderOptions. Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t target_file_size` is not randomly placed between uint64_t timestamps (was easy to mix up). Replaced a couple of fields of BlockBasedTableBuilder::Rep with a FilterBuildingContext. The motivation for this change is making it easier to pass along more data into new fields in FilterBuildingContext (follow-up PR). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240 Test Plan: ASAN make check Reviewed By: mrambacher Differential Revision: D28075891 Pulled By: pdillinger fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f	2021-04-29 07:00:50 -07:00
Akanksha Mahajan	a0e0feca62	Improve BlockPrefetcher to prefetch only for sequential scans (#7394 ) Summary: BlockPrefetcher is used by iterators to prefetch data if they anticipate more data to be used in future and this is valid for forward sequential scans. But BlockPrefetcher tracks only num_file_reads_ and not if reads are sequential. This presents problem for MultiGet with large number of keys when it reseeks index iterator and data block. FilePrefetchBuffer can end up doing large readahead for reseeks as readahead size increases exponentially once readahead is enabled. Same issue is with BlockBasedTableIterator. Add previous length and offset read as well in BlockPrefetcher (creates FilePrefetchBuffer) and FilePrefetchBuffer (does prefetching of data) to determine if reads are sequential and then prefetch. Update the last block read after cache hit to take reads from cache also in account. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7394 Test Plan: Add new unit test case Reviewed By: anand1976 Differential Revision: D23737617 Pulled By: akankshamahajan15 fbshipit-source-id: 8e6917c25ed87b285ee495d1b68dc623d71205a3	2021-04-28 12:53:46 -07:00
anand76	0db4cde6e2	Fix a memory leak in c_test (#8237 ) Summary: Don't call ```rocksdb_cache_disown_data()``` as it causes the memory allocated for ```shards_``` to be leaked. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8237 Reviewed By: jay-zhuang Differential Revision: D28039061 Pulled By: anand1976 fbshipit-source-id: c3464efe2c006b93b4be87030116a12a124598c4	2021-04-28 12:29:33 -07:00
anand76	8fe33a0a9f	Change CircleCI Windows to previous known good image (#8220 ) Summary: This is to try to resolve the VS2015 install failure in CircleCI Windows builds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8220 Reviewed By: jay-zhuang Differential Revision: D28061834 Pulled By: anand1976 fbshipit-source-id: b2663eb60babee603669a2c2cb55f182df1cc7b1	2021-04-28 11:30:30 -07:00
sdong	cde69a7cfd	db_stress to add --open_metadata_write_fault_one_in (#8235 ) Summary: DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place. If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully. This option is enabled in crash test in half of the time. Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235 Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory. Reviewed By: anand1976 Differential Revision: D28010944 fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558	2021-04-28 10:58:05 -07:00
Duarte Nunes	3949731de3	Add WAL flush API to C client (#8226 ) Summary: The C client is missing the`manual_wal_flush` option and the `flush_wal` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8226 Reviewed By: ajkr Differential Revision: D28000869 Pulled By: jay-zhuang fbshipit-source-id: ed44937e7e7e75bc0dfa870a14147fbeef0c38f8	2021-04-27 14:56:23 -07:00
Akanksha Mahajan	65abb0cf71	Add 6.18, 6.19 and 6.20 to check_format_compatible.sh (#8236 ) Summary: Add 6.18, 6.19 and 6.20 to check_format_compatible.sh Pull Request resolved: https://github.com/facebook/rocksdb/pull/8236 Test Plan: ./tools/check_format_compatible.sh (tested without 2.7.fb as it was failing as mentioned in the script) Reviewed By: mrambacher Differential Revision: D28019160 Pulled By: akankshamahajan15 fbshipit-source-id: b59a7c5c14cb4c115926e9ae7c74ea586b22c9ed	2021-04-27 10:24:27 -07:00
Sahir Hoda	13c655a887	New C API to expose NewCompactOnDeletionCollectorFactory (#8233 ) Summary: New C API rocksdb_options_add_compact_on_deletion_collector_factory to expose NewCompactOnDeletionCollectorFactory Pull Request resolved: https://github.com/facebook/rocksdb/pull/8233 Reviewed By: mrambacher Differential Revision: D28018381 Pulled By: anand1976 fbshipit-source-id: 674c9ed902c91ff0d9f09e7a60c5f37b907604c6	2021-04-27 10:14:04 -07:00
mrambacher	0ca6d6297f	Rename variables in ImmutableCFOptions to avoid conflicts with ImmutableDBOptions (#8227 ) Summary: Renaming ImmutableCFOptions::info_log and statistics to logger and stats. This is stage 2 in creating an ImmutableOptions class. It is necessary because the names match those in ImmutableOptions and have different types. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8227 Reviewed By: jay-zhuang Differential Revision: D28000967 Pulled By: mrambacher fbshipit-source-id: 3bf2aa04e8f1e8724d825b7deacf41080c14420b	2021-04-26 12:43:45 -07:00
Mr-Leshiy	c2c7d5e916	Fix cast-function-type warning (#8230 ) Summary: Fixing cast-function-type which is appears during the following build: ```bash cmake .. -DFAIL_ON_WARNINGS=ON -DCMAKE_C_COMPILER=x86_64-w64-mingw32-gcc -DCMAKE_CXX_COMPILER=x86_64-w64-mingw32-g++ -DCMAKE_SYSTEM_NAME=Windows make rocksdb ``` Here is the log: ``` /home/leshiy/Work/rocksdb/port/win/env_win.cc: In constructor ‘rocksdb::port::WinClock::WinClock()’: /home/leshiy/Work/rocksdb/port/win/env_win.cc:92:9: error: cast between incompatible function types from ‘FARPROC’ {aka ‘long long int ()()’} to ‘rocksdb::port::WinClock::FnGetSystemTimePreciseAsFileTime’ {aka ‘void ()(_FILETIME)’} [-Werror=cast-function-type] 92 \| (FnGetSystemTimePreciseAsFileTime)GetProcAddress( \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 93 \| module, "GetSystemTimePreciseAsFileTime"); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1plus: all warnings being treated as errors make[2]: [CMakeFiles/rocksdb.dir/build.make:4337: CMakeFiles/rocksdb.dir/port/win/env_win.cc.obj] Error 1 make[1]: * [CMakeFiles/Makefile2:83: CMakeFiles/rocksdb.dir/all] Error 2 make: *** [Makefile:91: all] Error 2 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8230 Reviewed By: jay-zhuang Differential Revision: D28000215 Pulled By: mrambacher fbshipit-source-id: 874782cf48f70470e3fbd9097585bf42e810ca61	2021-04-26 10:13:55 -07:00
Adam Retter	2760c2aef8	WBWI Internal Move implementation from .h into .cpp (#8229 ) Summary: Moves some of the structural refactoring from https://github.com/facebook/rocksdb/pull/8135 into this PR. This just cleans up the code by moving implementation out of the .h file and into the .cc file. Should be considered for merge before both https://github.com/facebook/rocksdb/pull/7214 and https://github.com/facebook/rocksdb/pull/8135 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8229 Reviewed By: jay-zhuang Differential Revision: D27999669 Pulled By: mrambacher fbshipit-source-id: 6eccecbf1f11bb9f5a173e86d1e7bc448bc96071	2021-04-26 09:48:22 -07:00
Adam Retter	69c986825e	Fix javadoc for keyMayExist (#8232 ) Summary: Closes https://github.com/facebook/rocksdb/issues/6985 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8232 Reviewed By: jay-zhuang Differential Revision: D27999779 Pulled By: mrambacher fbshipit-source-id: a37c88d93bde2692b8be9e46e673dda7bea701b2	2021-04-26 08:34:10 -07:00
mrambacher	6bab3a34e9	Move RegisterOptions into the Configurable API (#8223 ) Summary: As previously coded, a Configurable extension would need access to code not in the public API. This change moves RegisterOptions into the Configurable class and therefore available to public extensions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8223 Reviewed By: anand1976 Differential Revision: D27960188 Pulled By: mrambacher fbshipit-source-id: ac88b19397183df633902def5b5701b9b65fbf40	2021-04-26 03:13:24 -07:00
Saketh Are	cc1c3ee54e	Eliminate double-buffering of keys in block_based_table_builder (#8219 ) Summary: The block_based_table_builder buffers some blocks in memory to construct a good compression dictionary. Before this commit, the keys from each block were buffered separately for convenience. However, the buffered block data implicitly contains all keys. This commit eliminates the redundant key buffers and reduces memory usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8219 Reviewed By: ajkr Differential Revision: D27945851 Pulled By: saketh-are fbshipit-source-id: caf3cac1217201e080a1e24b542bedf20973afee	2021-04-23 12:45:02 -07:00
Sahir Hoda	d65d7d657d	Expose JemallocNodumpAllocator to C API (#8178 ) Summary: Add new C APIs to create the JemallocNodumpAllocator and set it on a Cache object. `make test` passes with and without `DISABLE_JEMALLOC=1`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8178 Reviewed By: jay-zhuang Differential Revision: D27944631 Pulled By: ajkr fbshipit-source-id: 2531729aa285a8985c58f22f093c4d53029c4a7b	2021-04-22 22:22:34 -07:00
mrambacher	01e460d538	Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176 ) Summary: This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively. readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release). There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176 Reviewed By: pdillinger Differential Revision: D27954339 Pulled By: mrambacher fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad	2021-04-22 20:43:54 -07:00
Jay Zhuang	f0fca2b1d5	Add internal compaction API for Secondary instance (#8171 ) Summary: Add compaction API for secondary instance, which compact the files to a secondary DB path without installing to the LSM tree. The API will be used to remote compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8171 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D27694545 Pulled By: jay-zhuang fbshipit-source-id: 8ff3ec1bffdb2e1becee994918850c8902caf731	2021-04-22 13:02:28 -07:00
Hans Holmberg	e85d8a6517	Add ZenFS to plugin list (#8218 ) Summary: Add ZenFS, a file system for zoned block devices, to PLUGINS.md Pull Request resolved: https://github.com/facebook/rocksdb/pull/8218 Reviewed By: jay-zhuang Differential Revision: D27944376 Pulled By: ajkr fbshipit-source-id: c9ea2e9814001ccd7c56d7ef4d38e20dfeb48d1e	2021-04-22 11:12:40 -07:00
Zhichao Cao	09a9ec3ac0	Fix the false positive alert of CF consistency check in WAL recovery (#8207 ) Summary: In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue. Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here. Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207 Test Plan: added unit test, make crash_test Reviewed By: riversand963 Differential Revision: D27898839 Pulled By: zhichao-cao fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e	2021-04-22 10:28:37 -07:00
mrambacher	47b424f4bd	Add check to cmake to see if we need to link against -latomic (#8183 ) Summary: For some compilers/environments (e.g. Clang, riscv64), we need to link against -latomic. Check if this is a requirement and add the library to the third-party libs if it is. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8183 Reviewed By: pdillinger Differential Revision: D27773564 Pulled By: mrambacher fbshipit-source-id: 68e15d823144f83fb02221c7bf5b1e43323419bf	2021-04-22 08:29:08 -07:00
Yanqin Jin	314352761f	Ignore comparator name mismatch in ldb manifest dump (#8216 ) Summary: RocksDB allows user-specified custom comparators which may not be known to `ldb`, a built-in tool for checking/mutating the database. Therefore, column family comparator names mismatch encountered during manifest dump should not prevent the dumping from proceeding. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8216 Test Plan: ``` make check ``` Also manually do the following ``` KEEP_DB=1 ./db_with_timestamp_basic_test ./ldb --db=<db> manifest_dump --verbose ``` The ldb should succeed and print something like: ``` ... --------------- Column family "default" (ID 0) -------------- log number: 6 comparator: <TestComparator>, but the comparator object is not available. ... ``` Reviewed By: ltamasi Differential Revision: D27927581 Pulled By: riversand963 fbshipit-source-id: f610b2c842187d17f575362070209ee6b74ec6d4	2021-04-21 20:43:10 -07:00
sdong	4985cea141	Add comment to DisableManualCompaction() (#8186 ) Summary: Add comment to DisableManualCompaction() which was missing. Also explictly return from DBImpl::CompactRange() to avoid memtable flush when manual compaction is disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8186 Test Plan: Run existing unit tests. Reviewed By: jay-zhuang Differential Revision: D27744517 fbshipit-source-id: 449548a48905903b888dc9612bd17480f6596a71	2021-04-21 15:23:46 -07:00
Akanksha Mahajan	596e9008e4	Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898 ) Summary: When WriteBufferManager is shared across DBs and column families to maintain memory usage under a limit, OOMs have been observed when flush cannot finish but writes continuously insert to memtables. In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write, this change will stall incoming writers until flush is completed and memory_usage drops. Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_ (memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true. DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall() (which adds dummy stall object to the writer's queue). Then DB is blocked on a state State::Blocked (current write doesn't go through). WBStallInterface object maintained by every DB instance is added to the queue of WriteBufferManager. If multiple DBs tries to write during this stall, they will also be blocked when check WriteBufferManager::ShouldStall() returns true. End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush to complete and avoid continous stalling if memory usage remains close to buffer_size. WriterBufferManager::EndWriteStall() is called, which removes all instances from its queue and signal them to continue. Their state is changed to State::Running and they are unblocked. DBImpl then signal all incoming writers of that DB to continue by calling write_thread_.EndWriteStall() (which removes dummy stall object from the queue). DB instance creates WBMStallInterface which is an interface to block and signal DBs during stall. When DB needs to be blocked or signalled by WriteBufferManager, state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898 Test Plan: Added a new test db/db_write_buffer_manager_test.cc Reviewed By: anand1976 Differential Revision: D26093227 Pulled By: akankshamahajan15 fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae	2021-04-21 13:54:02 -07:00
Peter Dillinger	95f6add746	Revert Ribbon starting level support from #8198 (#8212 ) Summary: This partially reverts commit `10196d7edc`. The problem with this change is because of important filter use cases: FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so would only use Ribbon filters if specifically including level 0 for the Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate unknown level, and this would be treated the same as level 0 unless fixed. We are keeping the part about committing to permanent schema, which is only changes to API comments and HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212 Test Plan: CI Reviewed By: jay-zhuang Differential Revision: D27896468 Pulled By: pdillinger fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e	2021-04-20 19:46:40 -07:00
Andrew Gallagher	2e5de5a2c3	Cleanup include (#8208 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8208 Make include of "file_system.h" use the same include path as everywhere else. Reviewed By: riversand963, akankshamahajan15 Differential Revision: D27881606 fbshipit-source-id: fc1e076229fde21041a813c655ce017b5070c8b3	2021-04-20 14:57:27 -07:00
Andrew Kryczka	905dd17b35	Fix seqno in ingested file boundary key metadata (#8209 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/6245. Adapted from https://github.com/facebook/rocksdb/issues/8201 and https://github.com/facebook/rocksdb/issues/8205. Previously we were writing the ingested file's smallest/largest internal keys with sequence number zero, or `kMaxSequenceNumber` in case of range tombstone. The former (sequence number zero) is incorrect and can lead to files being incorrectly ordered. The fix in this PR is to overwrite boundary keys that have sequence number zero with the ingested file's assigned sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8209 Test Plan: repro unit test Reviewed By: riversand963 Differential Revision: D27885678 Pulled By: ajkr fbshipit-source-id: 4a9f2c6efdfff81c3a9923e915ea88b250ee7b6a	2021-04-20 14:00:21 -07:00
Levi Tamasi	1b99947e99	Mention PR 8206 in HISTORY.md (#8210 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8210 Reviewed By: akankshamahajan15 Differential Revision: D27887612 Pulled By: ltamasi fbshipit-source-id: 0db8d0b6047334dc47fe30a98804449043454386	2021-04-20 12:07:40 -07:00
Jay Zhuang	a89740fbc6	Fix unittest no space issue (#8204 ) Summary: Unittest reports no space from time to time, which can be reproduced on a small memory machine with SHM. It's caused by large WAL files generated during the test, which is preallocated, but didn't truncate during close(). Adding the missing APIs to set preallocation. It added arm test as nightly build, as the test runs more than 1 hour. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8204 Test Plan: test on small memory arm machine Reviewed By: mrambacher Differential Revision: D27873145 Pulled By: jay-zhuang fbshipit-source-id: f797c429d6bc13cbcc673bc03fcc72adda55f506	2021-04-20 08:42:28 -07:00
Jay Zhuang	a345b4d60d	Move arm build from travis to circleci (#8203 ) Summary: Moving ARM build from travis to CircleCI. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8203 Test Plan: CI Reviewed By: ajkr Differential Revision: D27861753 Pulled By: jay-zhuang fbshipit-source-id: 5e36a67f6fbb921c2ed80b284ba2de485411937b	2021-04-19 20:07:02 -07:00
Yanqin Jin	a376c22066	Handle rename() failure in non-local FS (#8192 ) Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4	2021-04-19 18:11:13 -07:00
Levi Tamasi	0c6e4674a6	Fix a data race related to DB properties (#8206 ) Summary: Historically, the DB properties `rocksdb.cur-size-active-mem-table`, `rocksdb.cur-size-all-mem-tables`, and `rocksdb.size-all-mem-tables` called the method `MemTable::ApproximateMemoryUsage` for mutable memtables, which is not safe without synchronization. This resulted in data races with memtable inserts. The patch changes the code handling these properties to use `MemTable::ApproximateMemoryUsageFast` instead, which returns a cached value backed by an atomic variable. Two test cases had to be updated for this change. `MemoryTest.MemTableAndTableReadersTotal` was fixed by increasing the value size used so each value ends up in its own memtable, which was the original intention (note: the test has been broken in the sense that the test code didn't consider that memtable sizes below 64 KB get increased to 64 KB by `SanitizeOptions`, and has been passing only by accident). `DBTest.MemoryUsageWithMaxWriteBufferSizeToMaintain` relies on completely up-to-date values and thus was changed to use `ApproximateMemoryUsage` directly instead of going through the DB properties. Note: this should be safe in this case since there's only a single thread involved. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8206 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D27866811 Pulled By: ltamasi fbshipit-source-id: 7bd754d0565e0a65f1f7f0e78ffc093beef79394	2021-04-19 16:38:02 -07:00
Yanqin Jin	b0e20194ea	Handle blob files when options.best_efforts_recovery is true (#8180 ) Summary: If `options.best_efforts_recovery == true`, RocksDB currently tolerates missing table files and recovers to the latest version without missing table files (not considering WAL). It is necessary to handle blob files as well to make the feature more complete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8180 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27840556 Pulled By: riversand963 fbshipit-source-id: 041685d0dc2e7779ac4f0374c07a8a327704aa5e	2021-04-19 11:56:14 -07:00
Akanksha Mahajan	c377c2ba15	Fix flaky test BackupableDBTest.FileSizeForIncremental (#8197 ) Summary: Test was flaky because for kUseDbSessionId naming, blob files use naming scheme kLegacyCrc32cAndFileSize. So expected number of files because of collision can vary. So disabling blobdb for this test case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8197 Reviewed By: pdillinger Differential Revision: D27836997 Pulled By: akankshamahajan15 fbshipit-source-id: 5eb21a5f4acae3d6b730a9e1b207264fbc18cb80	2021-04-18 16:18:35 -07:00
Akanksha Mahajan	531a5f88a1	Update release version to 6.20 (#8199 ) Summary: Update release version to 6.20 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8199 Test Plan: No code change Reviewed By: ajkr Differential Revision: D27838750 Pulled By: akankshamahajan15 fbshipit-source-id: f02f722fc6bdd37d626d47a0e932bbecea3507a8	2021-04-16 20:15:36 -07:00
Peter Dillinger	10196d7edc	Ribbon long-term support, starting level support (#8198 ) Summary: Since the Ribbon filter schema seems good (compatible back to 6.15.0), this change commits to long term support of the SST schema, even though we expect the API for enabling Ribbon to change (still called NewExperimentalRibbonFilterPolicy). This also adds support for "hybrid" configuration in which some levels use Bloom (higher levels, lower numbered) for speed and the rest use Ribbon (lower levels, higher numbered) for memory space efficiency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198 Test Plan: unit test added, crash test support Reviewed By: jay-zhuang Differential Revision: D27831232 Pulled By: pdillinger fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab	2021-04-16 15:43:08 -07:00
Adam Retter	90e245697f	Fix Windows strcmp for Unicode (#8190 ) Summary: The code for strcmp that was present does work when compiled for Windows unicode file paths. Needs backporting to: * 6.17.fb * 6.18.fb * 6.19.fb Pull Request resolved: https://github.com/facebook/rocksdb/pull/8190 Reviewed By: akankshamahajan15 Differential Revision: D27765588 Pulled By: jay-zhuang fbshipit-source-id: 89f8a5ac61fd7edc758340dfd335b0a5f96dae6e	2021-04-16 12:11:16 -07:00
mrambacher	c871142988	Fix Makefile when multiple targets are invoked (#8195 ) Summary: - Fixes the makefile to do the right thing when invoking multiple targets (e.g. make shared_lib install-shared). - Fixes the building of db_stress in shared lib mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8195 Reviewed By: pdillinger Differential Revision: D27803452 Pulled By: mrambacher fbshipit-source-id: 7c285d267770a359eb47f25855affdf58687e0e4	2021-04-16 08:34:59 -07:00
mrambacher	4c41e51c07	Add Blob Options to C API (#8148 ) Summary: Added the Blob option settings from the AdvancedColmnFamilyOptions to the C API. There are no tests for getting/setting options in the C API currently, hence no specific test plans. Should there be a some? Pull Request resolved: https://github.com/facebook/rocksdb/pull/8148 Reviewed By: ltamasi Differential Revision: D27568495 Pulled By: mrambacher fbshipit-source-id: 3a52b784467ea2c4bc58be5f75c5d41f0a5c55d6	2021-04-16 05:56:00 -07:00
Akanksha Mahajan	00803d619c	Fix flaky failure in DBSSTest.DBWithSstFileManagerForBlobFilesWithGC (#8196 ) Summary: Updated the test to wait until all trash files are deleted by SSTFileManager in the background. Since deletion runs in background so number of files deleted might not always be as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8196 Reviewed By: jay-zhuang Differential Revision: D27812273 Pulled By: akankshamahajan15 fbshipit-source-id: d3ace1db34f91254b52fa455e09844d02801f58e	2021-04-15 20:18:57 -07:00
Akanksha Mahajan	83031e7343	Fix for LITE mode failure on MacOS (#8189 ) Summary: Fix for failure to build in LITE mode on MacOs from BlobFileCompletionCallback unused private fields. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8189 Reviewed By: jay-zhuang Differential Revision: D27768341 Pulled By: akankshamahajan15 fbshipit-source-id: 14d31d7a9b52d308d9f9f27feff1977c5550622f	2021-04-15 09:45:02 -07:00
Akanksha Mahajan	296b47db25	Extend file_checksum_dump ldb command and DB::GetLiveFilesChecksumInfo to blob files (#8179 ) Summary: Extend the DB::GetLiveFilesChecksumInfo API to blob files. This API is also used by the file_checksum_dump ldb command to dump checksum of SST files which now also dumps blob files checksum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8179 Test Plan: Add new unit test Reviewed By: zhichao-cao Differential Revision: D27714965 Pulled By: akankshamahajan15 fbshipit-source-id: d8b7343ea845a64c83800336d88cced7152a8c92	2021-04-15 09:38:13 -07:00
Yanqin Jin	b1f62be10e	Use the right level (L0) for files written during WAL recovery (#8187 ) Summary: As the name of `DBImpl::WriteLevel0TableForRecovery` suggests, the resulting table file should be placed on L0. However, the argument `level` passed to `BuildTable()` is -1. We need to correct this since the level information will be useful to determine file placement. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8187 Test Plan: make check Reviewed By: ltamasi Differential Revision: D27748570 Pulled By: riversand963 fbshipit-source-id: e1cd23128a8de31f14b1edc2ea92754c154e4f10	2021-04-14 23:40:22 -07:00
Justin Chapman	d89483098f	Assert unlimited max_open_files for FIFO compaction. (#8172 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/8014 - Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used. - This is to align with what the docs mention and to prevent premature data deletion. - Update tests to work with this assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172 Test Plan: ```bash $ make check -j$(nproc) Generated TARGETS Summary: - 6 libs - 0 binarys - 180 tests ``` Reviewed By: ajkr Differential Revision: D27768792 Pulled By: thejchap fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3	2021-04-14 12:05:47 -07:00

1 2 3 4 5 ...

10070 Commits