rocksdb

Author	SHA1	Message	Date
Adam Simpkins	c06c4c01c5	Fix many bugs in log statement arguments (#5089 ) Summary: Annotate all of the logging functions to inform the compiler that these use printf-style formatting arguments. This allows the compiler to emit warnings if the format arguments are incorrect. This also fixes many problems reported now that format string checking is enabled. Many of these are simply mix-ups in the argument type (e.g, int vs uint64_t), but in several cases the wrong number of arguments were being passed in which can cause the code to crash. The primary motivation for this was to fix the log message in `DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s format parameter with no argument supplied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089 Differential Revision: D14574795 Pulled By: simpkins fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a	2019-04-04 12:12:11 -07:00
Siying Dong	89ab1381f8	Apply automatic formatting to some files (#5114 ) Summary: Following files were run through automatic formatter: db/db_impl.cc db/db_impl.h db/db_impl_compaction_flush.cc db/db_impl_debug.cc db/db_impl_files.cc db/db_impl_readonly.h db/db_impl_write.cc db/dbformat.cc db/dbformat.h table/block.cc table/block.h table/block_based_filter_block.cc table/block_based_filter_block.h table/block_based_filter_block_test.cc table/block_based_table_builder.cc table/block_based_table_reader.cc table/block_based_table_reader.h table/block_builder.cc table/block_builder.h table/block_fetcher.cc table/block_prefix_index.cc table/block_prefix_index.h table/block_test.cc table/format.cc table/format.h I could easily run all the files, but I don't want people to feel that I'm doing it for lines of code changes :) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114 Differential Revision: D14633040 Pulled By: siying fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52	2019-03-27 16:24:45 -07:00
Siying Dong	2b4d5ceb47	Remove some "using std::..." from header files. (#5113 ) Summary: The code convention we are following, Google C++ Style, discourage alias in header files, especially public headers: https://google.github.io/styleguide/cppguide.html#Aliases Remove some of them. Might removed some from .cc files as well to be consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113 Differential Revision: D14633030 Pulled By: siying fbshipit-source-id: b990edc919d5de60295992284f980195e501d424	2019-03-27 10:28:21 -07:00
Siying Dong	48e7effa79	Avoid to go through every CF for every ReleaseSnapshot() (#5090 ) Summary: With https://github.com/facebook/rocksdb/pull/3009 we go through every CF to check whether a bottommost compaction is needed to be triggered. This is done within DB mutex. What we do within DB mutex may heavily influece the write throughput we can achieve, so we always want to minimize work there. Here we try to avoid this for-loop by first check a global threshold. In most of the time, the CF loop can be avoided. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5090 Differential Revision: D14582684 Pulled By: siying fbshipit-source-id: 968f6d9bb6affe1a5ebc4910b418300b076f166f	2019-03-25 19:18:04 -07:00
Zhongyi Xie	a291f3a1e5	Collect compaction stats by priority and dump to info LOG (#5050 ) Summary: In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump. ``` Compaction Stats [default] Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Low 0/0 0.00 KB 0.0 16.8 11.3 5.5 5.6 0.1 0.0 0.0 406.4 136.1 42.24 34.96 45 0.939 13M 8865K High 0/0 0.00 KB 0.0 0.0 0.0 0.0 11.4 11.4 0.0 0.0 0.0 76.2 153.00 35.74 12185 0.013 0 0 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050 Differential Revision: D14408583 Pulled By: miasantreble fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608	2019-03-19 17:28:19 -07:00
Siying Dong	5e298f865b	Add two more StatsLevel (#5027 ) Summary: Statistics cost too much CPU for some use cases. Add two stats levels so that people can choose to skip two types of expensive stats, timers and histograms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027 Differential Revision: D14252765 Pulled By: siying fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592	2019-02-28 10:27:59 -08:00
Yanqin Jin	c5a64cffd2	Avoid fsync on the same directory in atomic flush (#4817 ) Summary: In `DBImpl::AtomicFlushMemTablesToOutputFiles`, we need to call fsync only once on the same data directory. If two column families share a common directory for their data, we call fsync only once. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4817 Differential Revision: D13543689 Pulled By: riversand963 fbshipit-source-id: 4701d77c96a47802fbf6cb9f3337ee65d46b95f5	2019-02-12 12:28:36 -08:00
Jay Zhuang	c9a52cbdc8	Fix potential DB hang while using CompactFiles (#4940 ) Summary: CompactFiles() may block auto compaction which could cuase DB hang when it reachs level0_stop_writes_trigger. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4940 Differential Revision: D13929648 Pulled By: cooldoger fbshipit-source-id: 10842df38df3bebf862cd1a120a88ce961fdd381	2019-02-05 11:23:38 -08:00
Yanqin Jin	842cdc11dd	Use correct FileMeta for atomic flush result install (#4932 ) Summary: 1. this commit fixes our handling of a combination of two separate edge cases. If a flush job does not pick any memtable to flush (because another flush job has already picked the same memtables), and the column family assigned to the flush job is dropped right before RocksDB calls rocksdb::InstallMemtableAtomicFlushResults, our original code passes a FileMetaData object whose file number is 0, failing the assertion in rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)). 2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932 Differential Revision: D13901322 Pulled By: riversand963 fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9	2019-01-31 14:49:51 -08:00
Maysam Yabandeh	35e5689e11	Take snapshots once for all cf flushes (#4934 ) Summary: FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934 Differential Revision: D13900747 Pulled By: maysamyabandeh fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c	2019-01-31 12:21:59 -08:00
Yi Wu	5d4fddfa52	WritePrepared: Fix visible key compacted out by compaction (#4883 ) Summary: With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case: ``` seq = 1: put "foo" = "bar" seq = 2: transaction T: delete "foo", prepare seq = 3: compaction start seq = 4: take snapshot S seq = 5: transaction T: commit. ... seq = N: compaction iterator reached key "foo". ``` When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out. The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883 Differential Revision: D13668775 Pulled By: maysamyabandeh fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4	2019-01-15 21:34:38 -08:00
Yanqin Jin	301da345ae	Make a copy of MutableCFOptions to avoid race condition (#4876 ) Summary: If we do not do this, then reading MutableCFOptions may have a race condition with SetOptions which modifies MutableCFOptions. Also reserve space in advance for vectors to avoid reallocation changing the address of its elements. Test plan ``` $make clean && make -j32 all check $make clean && COMPILE_WITH_TSAN=1 make -j32 all check $make clean && COMPILE_WITH_ASAN=1 make -j32 all check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4876 Differential Revision: D13644500 Pulled By: riversand963 fbshipit-source-id: 4b8112c5c819d5a2922bb61ad1521b3d2fb2fd47	2019-01-11 17:43:37 -08:00
Yanqin Jin	a07175af65	Refactor atomic flush result installation to MANIFEST (#4791 ) Summary: as titled. Since different bg flush threads can flush different sets of column families (due to column family creation and drop), we decide not to let one thread perform atomic flush result installation for other threads. Bg flush threads will install their atomic flush results sequentially to MANIFEST, using a conditional variable, i.e. atomic_flush_install_cv_ to coordinate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791 Differential Revision: D13498930 Pulled By: riversand963 fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7	2019-01-03 20:56:24 -08:00
Yanqin Jin	ec68091d19	Remove an unused parameter (#4816 ) Summary: The `flush_reason` parameter in `DBImpl::InstallSuperVersionAndScheduleWork` is not used. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4816 Differential Revision: D13543218 Pulled By: riversand963 fbshipit-source-id: 8fc75d49462ce092e85aef0fe0c50936140db153	2019-01-02 09:59:13 -08:00
Yanqin Jin	671a7eb36f	Avoid switching empty memtable in certain cases (#4792 ) Summary: in certain cases, we do not perform memtable switching if the active memtable of the column family is empty. Two exceptions: 1. In manual flush, if cached_recoverable_state_empty_ is false, then we need to switch memtable due to requirement of transaction. 2. In switch WAL, we need to switch memtable anyway because we have to seal the memtable if the WAL on which it depends will be closed. This change can potentially delay the occurence of write stalls because number of memtables increase more slowly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4792 Differential Revision: D13499501 Pulled By: riversand963 fbshipit-source-id: 91c9b17ae753578578039f3851667d93610005e1	2018-12-18 16:47:23 -08:00
Yanqin Jin	4fce44fc8b	Improve flushing multiple column families (#4708 ) Summary: If one column family is dropped, we should simply skip it and continue to flush other active ones. Currently we use Status::ShutdownInProgress to notify caller of column families being dropped. In the future, we should consider using a different Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708 Differential Revision: D13378954 Pulled By: riversand963 fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca	2018-12-13 15:12:40 -08:00
DorianZheng	2670fe8c73	Get `CompactionJobInfo` from CompactFiles Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716 Differential Revision: D13207677 Pulled By: ajkr fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b	2018-12-13 14:21:24 -08:00
Burton Li	a8b9891f95	Concurrent task limiter for compaction thread control (#4332 ) Summary: The PR is targeting to resolve the issue of: https://github.com/facebook/rocksdb/issues/3972#issue-330771918 We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD. When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth. Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us. With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening. ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant. The usage is straight forward: e.g.: // // Enable compaction thread limiter thru ColumnFamilyOptions // std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4)); Options options; ColumnFamilyOptions cf_opt(options); cf_opt.compaction_thread_limiter = ctl; ... // // Compaction thread limiter can be tuned or disabled on-the-fly // ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks ... ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter ctl->SetMaxOutstandingTask(-1); // Same as above ... ctl->SetMaxOutstandingTask(0); // full throttle (0 task) // // Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue) // std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8)); std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4)); Options options; ColumnFamilyOptions cf_opt_ssd1(options); ColumnFamilyOptions cf_opt_ssd2(options); ColumnFamilyOptions cf_opt_hdd1(options); ColumnFamilyOptions cf_opt_hdd2(options); ColumnFamilyOptions cf_opt_hdd3(options); // SSD CFs cf_opt_ssd1.compaction_thread_limiter = ctl_ssd; cf_opt_ssd2.compaction_thread_limiter = ctl_ssd; // HDD CFs cf_opt_hdd1.compaction_thread_limiter = ctl_hdd; cf_opt_hdd2.compaction_thread_limiter = ctl_hdd; cf_opt_hdd3.compaction_thread_limiter = ctl_hdd; ... // // The limiter is disabled by default (or set to nullptr explicitly) // Options options; ColumnFamilyOptions cf_opt(options); cf_opt.compaction_thread_limiter = nullptr; Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332 Differential Revision: D13226590 Pulled By: siying fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d	2018-12-13 13:18:28 -08:00
DorianZheng	4862720e08	Expose column family id to `FlushJobInfo` Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4772 Differential Revision: D13428923 Pulled By: ajkr fbshipit-source-id: e351e9c5eea97816db25429e129357a8af90712a	2018-12-11 20:33:42 -08:00
Yanqin Jin	b96fccb1e6	Move a function to critical section (#4752 ) Summary: Test plan ``` $make clean && make -j32 all check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4752 Differential Revision: D13344705 Pulled By: riversand963 fbshipit-source-id: fc3a43174d09d70ccc2b09decd78e1da1b6ba9d1	2018-12-05 13:12:09 -08:00
Yanqin Jin	8d7bc76f36	Fix a flaky test DBFlushTest.SyncFail (#4633 ) Summary: There is a race condition in DBFlushTest.SyncFail, as illustrated below. ``` time thread1 bg_flush_thread \| Flush(wait=false, cfd) \| refs_before=cfd->current()->TEST_refs() PickMemtable calls cfd->current()->Ref() V ``` The race condition between thread1 getting the ref count of cfd's current version and bg_flush_thread incrementing the cfd's current version makes it possible for later assertion on refs_before to fail. Therefore, we add test sync points to enforce the order and assert on the ref count before and after PickMemtable is called in bg_flush_thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4633 Differential Revision: D12967131 Pulled By: riversand963 fbshipit-source-id: a99d2bacb7869ec5d8d03b24ef2babc0e6ae1a3b	2018-11-29 13:39:56 -08:00
Yanqin Jin	147697420a	Rollback memtable flush upon atomic flush fail (#4641 ) Summary: This fixes an assertion. An atomic flush can have multiple flush jobs. Some of them may fail. If any of them fails, we need to rollback all of them. For the flush jobs that do fail, we already call `RollbackMemTableFlush` in `FlushJob::Run`. The tricky part is for flush jobs that have completed successfully. We need to call `RollbackMemTableFlush` for them as well. The newly added DBAtomicFlushTest.AtomicFlushRollbackSomeJobs will SigAbort without the corresponding change in AtomicFlushMemTablesToOutputFiles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4641 Differential Revision: D12943649 Pulled By: riversand963 fbshipit-source-id: c66a4a664a1e0938e938fd41edc5a70c34cdd868	2018-11-14 20:54:17 -08:00
Soli	a478682260	Fix #3840 : only `SyncClosedLogs` for multiple CFs (#4460 ) Summary: Call `SyncClosedLogs()` only if there are more than one column families. Update several unit tests (in `fault_injection_test` and `db_flush_test`) correspondingly. See #3840 for more info. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4460 Differential Revision: D12896377 Pulled By: riversand963 fbshipit-source-id: f49afdaec32568f12f001219a3aec1dfde3b32bf	2018-11-13 11:32:16 -08:00
DorianZheng	0f88160f67	Fix `CompactFiles` bug (#4665 ) Summary: `CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names` The timeline of execution flow is as follow: Let's say that level N has two file [1,2] and [5,6] ``` timeline user_thread1 user_thread2 t0 \| CompactFiles([1, 2], [5, 6]) begin t1 \| GetReferencedSuperVersion() t2 \| IngestExternalFile([3,4]) to level N begin t3 \| CompactFiles resume V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665 Differential Revision: D13030674 Pulled By: ajkr fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51	2018-11-12 14:32:18 -08:00
Yanqin Jin	05dec0c7c7	Remove redundant member var and set options (#4631 ) Summary: In the past, both `DBImpl::atomic_flush_` and `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is set correctly. This does not lead to incorrect behavior, but is a duplicate of information. Since `immutable_db_options_` is always there and has `atomic_flush`, we should use it as source of truth and remove `DBImpl::atomic_flush_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631 Differential Revision: D12928371 Pulled By: riversand963 fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d	2018-11-12 12:24:26 -08:00
Yi Wu	859dbda6e3	Fix DBTest.SoftLimit flakyness (#4658 ) Summary: The flakyness can be reproduced with the following patch: ``` --- a/db/db_impl_compaction_flush.cc +++ b/db/db_impl_compaction_flush.cc @@ -2013,6 +2013,9 @@ void DBImpl::BackgroundCallFlush() { if (job_context.HaveSomethingToDelete()) { PurgeObsoleteFiles(job_context); } + static int f_count = 0; + printf("clean flush job context %d\n", ++f_count); + env_->SleepForMicroseconds(1000000); job_context.Clean(); mutex_.Lock(); } ``` The issue is that FlushMemtable with opt.wait=true does not wait for `OnStallConditionsChanged` being called. The event listener is triggered on `JobContext::Clean`, which happens after flush result is installed. At the time we check for stall condition after flushing memtable, the job context cleanup may not be finished. To fix the flaykyness, we use sync point to create a custom WaitForFlush that waits for context cleanup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4658 Differential Revision: D13007301 Pulled By: yiwu-arbug fbshipit-source-id: d98395ee7b0ad4c62e83e8d0e9b6028058c61712	2018-11-09 16:45:19 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Andrew Kryczka	5c794d94c4	Prevent manual flush hanging in read-only mode (#4615 ) Summary: The logic to wait for stall conditions to clear before beginning a manual flush didn't take into account whether the DB was in read-only mode. In read-only mode the stall conditions would never clear since no background work is happening, so the wait would be never-ending. It's probably better to return an error to the user. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4615 Differential Revision: D12888008 Pulled By: ajkr fbshipit-source-id: 1c474b42a7ac38d9fd0d0e2340ff1d53e684d83c	2018-11-01 15:27:06 -07:00
Andrew Kryczka	b8f68bac38	Prevent manual compaction hanging in read-only mode (#4611 ) Summary: A background compaction with pre-picked files (i.e., either a manual compaction or a bottom-pri compaction) fails when the DB is in read-only mode. In the failure handling, we forgot to unregister the compaction and the files it covered. Then subsequent manual compactions could conflict with this zombie compaction (possibly Halloween related) and wait forever for it to finish. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4611 Differential Revision: D12871217 Pulled By: ajkr fbshipit-source-id: 9d24e921d5bbd2ee8c2c9536a30abfa42a220c6e	2018-10-31 17:24:36 -07:00
Yanqin Jin	5b4c709fad	Enable atomic flush (#4023 ) Summary: Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023 Differential Revision: D8518381 Pulled By: riversand963 fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f	2018-10-26 15:08:43 -07:00
Yanqin Jin	e633983cf1	Add support to flush multiple CFs atomically (#4262 ) Summary: Leverage existing `FlushJob` to implement atomic flush of multiple column families. This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262 Differential Revision: D9283109 Pulled By: riversand963 fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed	2018-10-15 20:01:17 -07:00
Peter Pei	09814f2cfc	support OnCompactionBegin (#4431 ) Summary: fix #4288 Add `OnCompactionBegin` support to `rocksdb::EventListener`. Currently, we only have these three callbacks: - OnFlushBegin - OnFlushCompleted - OnCompactionCompleted As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`. This PR is a try to implement the support of `OnCompactionBegin`. Hope it is useful to you. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431 Differential Revision: D10055515 Pulled By: yiwu-arbug fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334	2018-10-10 17:32:27 -07:00
jsteemann	141ef7f8d3	avoid copying when iterating using range-based for (#4459 ) Summary: this avoids a few copies of std::string and other structs in the context of range-based for loops. instead of copying the values for each iteration, use a const reference to avoid copying. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4459 Differential Revision: D10282045 Pulled By: sagar0 fbshipit-source-id: 5012e910dca279abd2be847e1fb432d96274edfb	2018-10-09 17:15:51 -07:00
Zhongyi Xie	cac87fcf57	move dump stats to a separate thread (#4382 ) Summary: Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all. We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior. Tested with db_bench using `--stats_dump_period_sec=20` in command line: > LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG content: > 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- 2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606] DB Stats Uptime(secs): 20.0 total, 20.0 interval Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s Interval stall: 00:00:0.012 H:M:S, 0.1 percent Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382 Differential Revision: D9933051 Pulled By: miasantreble fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa	2018-10-08 22:54:43 -07:00
DorianZheng	e0f05754ba	Expose column family id to OnCompactionCompleted (#4466 ) Summary: The controller you requested could not be found. PTAL Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466 Differential Revision: D10241358 Pulled By: yiwu-arbug fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e	2018-10-08 14:24:16 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Mikhail Antonov	927f274939	Avoiding write stall caused by manual flushes (#4297 ) Summary: Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening. One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable. This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297 Differential Revision: D9420705 Pulled By: mikhail-antonov fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc	2018-08-29 12:12:55 -07:00
Yanqin Jin	7daae512d2	Refactor flush request queueing and processing (#3952 ) Summary: RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB. This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it. This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952 Differential Revision: D8529933 Pulled By: riversand963 fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539	2018-08-24 13:27:35 -07:00
Yanqin Jin	1f802773bc	Update JobContext. (#3949 ) Summary: In the past, we assume that a job modifies a single column family. Therefore, a job can create at most one superversion since each superversion corresponds to one column family. This assumption leads to the fact that a `JobContext` has only one member variable called `superversion_context`. Now we want to support group flush of column families, indicating that each job can create multiple superversions. Therefore, we need to make the following change to accommodate this new feature. Add a vector of `SuperVersionContext` to `JobContext` to support installing superversions for multiple column families in one job context. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3949 Differential Revision: D8864895 Pulled By: riversand963 fbshipit-source-id: 5937a48817276370d3c8172db9c8aafc826d97ca	2018-08-03 17:42:34 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
DorianZheng	f5e46354d2	Protect external file when ingesting (#4099 ) Summary: If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099 Differential Revision: D9034092 Pulled By: riversand963 fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28	2018-07-27 14:13:12 -07:00
Anand Ananthabhotla	52d4c9b7f6	Allow DB resume after background errors (#3997 ) Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd	2018-06-28 12:34:40 -07:00
Andrew Kryczka	3d7dc75b36	Bottommost level-based compactions in bottom-pri pool Summary: This feature was introduced for universal compaction in `cc01985d`. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level. Closes https://github.com/facebook/rocksdb/pull/3835 Differential Revision: D7957179 Pulled By: ajkr fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4	2018-05-14 14:57:15 -07:00
Siying Dong	d59549298f	Skip deleted WALs during recovery Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes https://github.com/facebook/rocksdb/pull/3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729	2018-05-03 15:43:09 -07:00
Huachao Huang	ed7a95b28c	Add max_subcompactions as a compaction option Summary: Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default. I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically. Closes https://github.com/facebook/rocksdb/pull/3775 Differential Revision: D7792357 Pulled By: ajkr fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c	2018-04-27 11:57:39 -07:00
Yanqin Jin	7dfbe33532	Rename pending_compaction_ to queued_for_compaction_. Summary: We use `queued_for_flush_` to indicate a column family has been added to the flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used `pending_compaction_` which can also be ambiguous. Closes https://github.com/facebook/rocksdb/pull/3781 Differential Revision: D7790063 Pulled By: riversand963 fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804	2018-04-27 11:12:01 -07:00
Yanqin Jin	513b5ce618	Rename pending_flush_ to queued_for_flush_. Summary: With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush ``` if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) { ... } ``` `Pending` is ambiguous, and I feel `queued_for_flush` is a better name, especially for the sake of readability. Closes https://github.com/facebook/rocksdb/pull/3777 Differential Revision: D7783066 Pulled By: riversand963 fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229	2018-04-26 21:12:51 -07:00
Siying Dong	63c965cdb4	Sync parent directory after deleting a file in delete scheduler Summary: sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want. Closes https://github.com/facebook/rocksdb/pull/3767 Differential Revision: D7760136 Pulled By: siying fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f	2018-04-26 13:58:20 -07:00
Zhongyi Xie	e1e826b980	check return status for Sync() and Append() calls to avoid corruption Summary: Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status. Closes https://github.com/facebook/rocksdb/pull/3740 Differential Revision: D7678848 Pulled By: miasantreble fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77	2018-04-19 14:13:46 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00

1 2 3

103 Commits