rocksdb

Author	SHA1	Message	Date
Peter Pei	09814f2cfc	support OnCompactionBegin (#4431 ) Summary: fix #4288 Add `OnCompactionBegin` support to `rocksdb::EventListener`. Currently, we only have these three callbacks: - OnFlushBegin - OnFlushCompleted - OnCompactionCompleted As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`. This PR is a try to implement the support of `OnCompactionBegin`. Hope it is useful to you. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431 Differential Revision: D10055515 Pulled By: yiwu-arbug fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334	2018-10-10 17:32:27 -07:00
Andrew Kryczka	faa70fc575	DeleteRange regression tests using public API (#4476 ) Summary: I wrote a couple tests using the public API to expose/prevent the bugs we talked. In particular, - When files have overlapping endpoints and a range tombstone spans them, ensure the largest key does not reappear to readers. This was happening due to a bug that skipped writing range tombstones to an output file when their begin key exactly matched the file's largest key. - When a tombstone spans multiple atomic compaction units, ensure newer keys do not disappear by being compacted beneath it. This happened due to a range tombstone appearing untruncated to readers when it spanned files with overlapping endpoints, even if it extended into files without overlapping endpoints (i.e., different atomic compaction units). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4476 Differential Revision: D10286001 Pulled By: ajkr fbshipit-source-id: bb5ca51d0f90812fb37bfe1d01aec93f7eda55aa	2018-10-10 12:30:11 -07:00
Abhishek Madan	9c6fea7fe1	Update HISTORY.md, fix unity_test failure (#4479 ) Summary: Follow-up to https://github.com/facebook/rocksdb/pull/4432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4479 Differential Revision: D10304151 Pulled By: abhimadan fbshipit-source-id: 3608b95c324702ca26791f95cb26dae1d49efbe7	2018-10-10 12:09:56 -07:00
Anand Ananthabhotla	854a4be03f	Handle mixed slowdown/no_slowdown writer properly (#4475 ) Summary: There is a bug when the write queue leader is blocked on a write delay/stop, and the queue has writers with WriteOptions::no_slowdown set to true. They are not woken up until the write stall is cleared. The fix introduces a dummy writer inserted at the tail to indicate a write stall and prevent further inserts into the queue, and a condition variable that writers who can tolerate slowdown wait on before adding themselves to the queue. The leader calls WriteThread::BeginWriteStall() to add the dummy writer and then walk the queue to fail any writers with no_slowdown set. Once the stall clears, the leader calls WriteThread::EndWriteStall() to remove the dummy writer and signal the condition variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475 Differential Revision: D10285827 Pulled By: anand1976 fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9	2018-10-09 22:52:40 -07:00
jsteemann	141ef7f8d3	avoid copying when iterating using range-based for (#4459 ) Summary: this avoids a few copies of std::string and other structs in the context of range-based for loops. instead of copying the values for each iteration, use a const reference to avoid copying. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4459 Differential Revision: D10282045 Pulled By: sagar0 fbshipit-source-id: 5012e910dca279abd2be847e1fb432d96274edfb	2018-10-09 17:15:51 -07:00
jsteemann	517d3b8b77	fix typo in error message, twice (#4457 ) Summary: Fixes a typo in error messages returned by Iterator::GetProperty(...) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4457 Differential Revision: D10281965 Pulled By: sagar0 fbshipit-source-id: 1cd3c665f467ef06cdfd9f482692e6f8568f3d22	2018-10-09 17:07:27 -07:00
Abhishek Madan	3a4bd36fed	Truncate range tombstones by leveraging InternalKeys (#4432 ) Summary: To more accurately truncate range tombstones at SST boundaries, we now represent them in RangeDelAggregator using InternalKeys, which are end-key-exclusive as they were before this change. During compaction, "atomic compaction unit boundaries" (the range of keys contained in neighbouring and overlaping SSTs) are propagated down to RangeDelAggregator to truncate range tombstones at those boundariies instead. See https://github.com/facebook/rocksdb/pull/4432#discussion_r221072219 and https://github.com/facebook/rocksdb/pull/4432#discussion_r221138683 for motivating examples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4432 Differential Revision: D10263952 Pulled By: abhimadan fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579	2018-10-09 15:19:38 -07:00
Zhongyi Xie	283a700f5d	add locking around calls to RecalculateWriteStallConditions in column_family_test (#4474 ) Summary: this should fix the current failing TSAN jobs: The callstack for TSAN: > WARNING: ThreadSanitizer: data race (pid=87440) Read of size 8 at 0x7d580000fce0 by thread T22 (mutexes: write M548703): #0 rocksdb::InternalStats::DumpCFStatsNoFileHistogram(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:1204 (column_family_test+0x00000080eca7) #1 rocksdb::InternalStats::DumpCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:1169 (column_family_test+0x0000008106d0) #2 rocksdb::InternalStats::HandleCFStats(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rocksdb::Slice) db/internal_stats.cc:578 (column_family_test+0x000000810720) #3 rocksdb::InternalStats::GetStringProperty(rocksdb::DBPropertyInfo const&, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) db/internal_stats.cc:488 (column_family_test+0x00000080670c) #4 rocksdb::DBImpl::DumpStats() db/db_impl.cc:625 (column_family_test+0x00000070ce9a) > Previous write of size 8 at 0x7d580000fce0 by main thread: #0 rocksdb::InternalStats::AddCFStats(rocksdb::InternalStats::InternalCFStatsType, unsigned long) db/internal_stats.h:324 (column_family_test+0x000000693bbf) #1 rocksdb::ColumnFamilyData::RecalculateWriteStallConditions(rocksdb::MutableCFOptions const&) db/column_family.cc:818 (column_family_test+0x000000693bbf) #2 rocksdb::ColumnFamilyTest_WriteStallSingleColumnFamily_Test::TestBody() db/column_family_test.cc:2563 (column_family_test+0x0000005e5a49) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4474 Differential Revision: D10262099 Pulled By: miasantreble fbshipit-source-id: 1247973a3ca32e399b4575d3401dd5439c39efc5	2018-10-09 14:10:13 -07:00
Zhongyi Xie	cac87fcf57	move dump stats to a separate thread (#4382 ) Summary: Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all. We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior. Tested with db_bench using `--stats_dump_period_sec=20` in command line: > LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG content: > 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- 2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606] DB Stats Uptime(secs): 20.0 total, 20.0 interval Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s Interval stall: 00:00:0.012 H:M:S, 0.1 percent Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382 Differential Revision: D9933051 Pulled By: miasantreble fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa	2018-10-08 22:54:43 -07:00
DorianZheng	27090ae8f6	Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391 ) Summary: - Fix DBImpl API race condition The timeline of execution flow is as follow: ``` timeline user_thread1 user_thread2 t1 \| cfh = GetColumnFamilyHandleUnlocked(0) t2 \| id1 = cfh->GetID() t3 \| GetColumnFamilyHandleUnlocked(1) t4 \| id2 = cfh->GetID() V ``` The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id` - Expose ColumnFamily ID to compaction event listener - Fix the return status of `DBImpl::GetLatestSequenceForKey` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391 Differential Revision: D10221243 Pulled By: yiwu-arbug fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993	2018-10-08 14:24:16 -07:00
DorianZheng	e0f05754ba	Expose column family id to OnCompactionCompleted (#4466 ) Summary: The controller you requested could not be found. PTAL Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466 Differential Revision: D10241358 Pulled By: yiwu-arbug fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e	2018-10-08 14:24:16 -07:00
DorianZheng	7487a7628c	Fix return status of DBImpl::GetLatestSequenceForKey Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4467 Differential Revision: D10241418 Pulled By: yiwu-arbug fbshipit-source-id: f6adbe7292b2c934e14971c7432b3eb115c35026	2018-10-08 14:22:05 -07:00
Maysam Yabandeh	21b51dfec4	Add inline comments to flush job (#4464 ) Summary: It also renames InstallMemtableFlushResults to MaybeInstallMemtableFlushResults to clarify its contract. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4464 Differential Revision: D10224918 Pulled By: maysamyabandeh fbshipit-source-id: 04e3f2d8542002cb9f8010cb436f5152751b3cbe	2018-10-05 15:41:17 -07:00
Maysam Yabandeh	1fb6805527	Fix snprintf buffer overflow bug (#4465 ) Summary: The contract of snprintf says that it returns "The number of characters that would have been written if n had been sufficiently large" http://www.cplusplus.com/reference/cstdio/snprintf/ The existing code however was assuming that the return value is the actual number of written bytes and uses that to reposition the starting point on the next call to snprintf. This leads to buffer overflow when the last call to snprintf has filled up the buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4465 Differential Revision: D10224080 Pulled By: maysamyabandeh fbshipit-source-id: 40f44e122d15b0db439812a0a361167cf012de3e	2018-10-05 14:50:51 -07:00
Dmitry Alimov	e13d8dcbbb	Fix typos in comments (#4456 ) Summary: Fix some typos in the comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/4456 Differential Revision: D10209214 Pulled By: miasantreble fbshipit-source-id: dff857ba60396bc95126e635db96d7dc8330d2cb	2018-10-04 20:46:50 -07:00
Zhongyi Xie	ce1fc5af09	fix unused param `allocator` in compression.h (#4453 ) Summary: this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453 Differential Revision: D10202626 Pulled By: miasantreble fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355	2018-10-04 13:24:22 -07:00
JiYou	a1f6142f38	VersionSet: GetOverlappingInputs() fix overflow and optimize. (#4385 ) Summary: This fix is for `level == 0` in `GetOverlappingInputs()`: - In `GetOverlappingInputs()`, if `level == 0`, it has potential risk of overflow if `i == 0`. - Optmize process when `expand = true`, the expected complexity can be reduced to O(n). Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4385 Differential Revision: D10181001 Pulled By: riversand963 fbshipit-source-id: 46eef8a1d1605c9329c164e6471cd5c5b6de16b5	2018-10-03 18:40:59 -07:00
Yanqin Jin	4e58b2ea3d	Check for compression lib support before test exec (#4443 ) Summary: Before running CompactFilesTest.SentinelCompressionType, we should check whether zlib and snappy are supported. CompactFilesTest.SentinelCompressionType is a newly added test. Compilation and linking with different options, e.g. COMPILE_WITH_TSAN, COMPILE_WITH_ASAN, etc. lead to generation of different binaries. On the one hand, it's not clear why zlib or snappy is present under ASAN, but not under TSAN. On the other hand, changing the compilation flags for TSAN or ASAN seems a bigger change worth much more attention. To unblock the cont-runs, I suggest that we simply add these two checks at the beginning of the test, as we did for GeneralTableTest.ApproximateOffsetOfCompressed in table/table_test.cc. Future actions include invesigating the absence of zlib and snappy when compiling with TSAN, i.e. COMPILE_WITH_TSAN=1, if necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4443 Differential Revision: D10140935 Pulled By: riversand963 fbshipit-source-id: 62f96d1e685386accd2ef0b98f6f754d3fd67b3e	2018-10-02 10:42:01 -07:00
Yanqin Jin	be5cc4c7b8	Remove a race condition between lsdir and rm (#4440 ) Summary: In DBCompactionTestWithParam::ManualLevelCompactionOutputPathId, there is a race condition between `DBTestBase::GetSstFileCount` and `DBImpl::PurgeObsoleteFiles`. The following graph explains why. ``` Timeline db_compact_test_t bg_flush_t bg_compact_t \| [initiate bg flush and \| start waiting] \| flush \| DeleteObsoleteFiles \| [waken up by bg_flush_t which \| signaled in DeleteObsoleteFiles] \| \| [initiate compaction and \| start waiting] \| \| [compact, \| set manual.done to true] \| [signal at the end of \| BackgroundCallFlush] \| \| [waken up by bg_flush_t \| which signaled before \| returning from \| BackgroundCallFlush] \| \| Check manual.done is true \| \| GetSstFileCount <-- race condition --> PurgeObsoleteFiles V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4440 Differential Revision: D10122628 Pulled By: riversand963 fbshipit-source-id: 3ede73c39fee6ad804dc6ac1ed84759c7e63977f	2018-10-01 11:57:55 -07:00
Andrew Kryczka	ac6f435a9a	Fix CompactFiles support for kDisableCompressionOption (#4438 ) Summary: Previously `CompactFiles` with `CompressionType::kDisableCompressionOption` caused program to crash on assertion failure. This PR fixes the crash by adding support for that setting. Now, that setting will cause RocksDB to choose compression according to the column family's options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4438 Differential Revision: D10115761 Pulled By: ajkr fbshipit-source-id: a553c6fa76fa5b6f73b0d165d95640da6f454122	2018-10-01 01:18:10 -07:00
JiYou	75ca13875c	FindFile: use std::lower_bound reduce the repeated code. (#4372 ) Summary: `FindFile()` and `FindFileInRange()` actually works as the same of `std::lower_bound()`. Use `std::lower_bound()` to reduce the repeated code. - change `FindFile()` and `FindFileInRange()` to use `std::lower_bound()` Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4372 Differential Revision: D9919677 Pulled By: ajkr fbshipit-source-id: f74aaa30e2f80e410e299c5a5bca4eaf2a7a26de	2018-09-27 10:35:00 -07:00
Yi Wu	dc813e4b85	Improve log handling when recover without flush (#4405 ) Summary: Improve log handling when avoid_flush_during_recovery=true. 1. restore total_log_size_ after recovery, by summing up existing log sizes. Fixes #4253. 2. truncate the last existing log, since this log can contain preallocated space and it will be a waste to keep the space. It avoids a crash loop of user application cause a lot of log with non-trivial size being created and ultimately take up all disk space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4405 Differential Revision: D9953933 Pulled By: yiwu-arbug fbshipit-source-id: 967780fee8acec7f358b6eb65190fb4684f82e56	2018-09-26 10:37:48 -07:00
Nikhil Benesch	17edc82a4b	Handle tombstones at the same seqno in the CollapsedRangeDelMap (#4424 ) Summary: The CollapsedRangeDelMap was entirely mishandling tombstones at the same sequence number when the tombstones did not have identical start and end keys. Such tombstones are common since `90fc40690`, which causes tombstones to be split during compactions. For example, if the tombstone [a, c) @ 1 lies across a compaction boundary at b, it will be split into [a, b) @ 1 and [b, c) @ 1. Without this patch, the collapsed range deletion map would look like this: a -> 1 b -> 1 c -> 0 Notice how the b -> 1 entry is redundant. When the tombstones overlap, the problem is even worse. Consider tombstones [a, c) @ 1 and [b, d) @ 1, which produces this map without this patch: a -> 1 b -> 1 c -> 0 d -> 0 This map is corrupt, as a map can never contain adjacent sentinel (zero) entries. When the iterator advances from b to c, it will notice that c is a sentinel enty and skip to d--but d is also a sentinel entry! Asking what tombstone this iterator points to will trigger an assertion, as it is not pointing to a valid tombstone. /cc ajkr Pull Request resolved: https://github.com/facebook/rocksdb/pull/4424 Differential Revision: D10039248 Pulled By: abhimadan fbshipit-source-id: 6d737c1e88d60e80cf27286726627ba44463e7f4	2018-09-25 14:50:31 -07:00
Abhishek Madan	3c350a7cf0	Improve RangeDelAggregator benchmarks (#4395 ) Summary: Improve time measurements for AddTombstones to only include the call and not the VectorIterator setup. Also add a new add_tombstones_per_run flag to call AddTombstones multiple times per aggregator, which will help simulate more realistic workloads. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4395 Differential Revision: D9996811 Pulled By: abhimadan fbshipit-source-id: 5865a95c323fbd9b3606493013664b4890fe5a02	2018-09-21 16:13:08 -07:00
Anand Ananthabhotla	72712f4e28	Allow dynamic modification of window size and deletion trigger (#4403 ) Summary: Make the CompactOnDeletionCollectorFactory class public, and provide methods to update the window size and deletion trigger params. These will take effect on subsequent created SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4403 Differential Revision: D9976857 Pulled By: anand1976 fbshipit-source-id: 31dbf0511c12fa2bb9b2a7ba620079e0ee09cf48	2018-09-20 15:15:28 -07:00
Andrew Kryczka	990b52e95b	Unit test for custom comparator RangeDelAggregator (#4388 ) Summary: Add a unit test for range collapsing when non-default comparator is used. This exposes the bug fixed in #4386. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4388 Differential Revision: D9918252 Pulled By: ajkr fbshipit-source-id: 99501b96b251eab41791a7e33b27055ee36c5c39	2018-09-18 12:13:20 -07:00
jsteemann	27221b0cc2	use specified comparator in CollapsedRangeDelMap (#4386 ) Summary: The Comparator passed to CollapsedRangeDelMap was not used for operator less of the std::map `rep_` object contained in CollapsedRangeDelMap. So the map was always sorted using the default ByteWiseComparator, which seems wrong. Passing the specified Comparator through for usage in that map object fixes actual problems we were seeing with RangeDelete operations that do not delete keys as expected when using a custom Comparator. I found that the tests in current master crash when I run them locally, both with and without my patch, at the very same location. I therefore don't know if the patch breaks something else, but it seems to fix RangeDeletion issues in our product that uses RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4386 Differential Revision: D9916506 Pulled By: ajkr fbshipit-source-id: 27bff8c775831f089dde8c5289df7343d88b2d66	2018-09-18 09:28:30 -07:00
Maysam Yabandeh	65ac72edd9	Fix bug in partition filters with format_version=4 (#4381 ) Summary: Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381 Differential Revision: D9879505 Pulled By: maysamyabandeh fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1	2018-09-17 17:28:15 -07:00
Abhishek Madan	1626f6ab6b	Add RangeDelAggregator microbenchmarks (#4363 ) Summary: To measure the results of upcoming DeleteRange v2 work, this commit adds simple benchmarks for RangeDelAggregator. It measures the average time for AddTombstones and ShouldDelete calls. Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results: Before #4014: ``` ======================= Results: ======================= AddTombstones: 1356.28 us ShouldDelete: 0.401732 us ``` Latest master: ``` ======================= Results: ======================= AddTombstones: 740.82 us ShouldDelete: 0.383271 us ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363 Differential Revision: D9881676 Pulled By: abhimadan fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442	2018-09-17 14:58:31 -07:00
Anand Ananthabhotla	30c21df97c	Fix regression test failures introduced by PR #4164 (#4375 ) Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Pull Request resolved: https://github.com/facebook/rocksdb/pull/4375 Differential Revision: D9875779 Pulled By: anand1976 fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640	2018-09-17 13:14:07 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Sagar Vemuri	3db584059c	Remove sync point from Block destructor (#4370 ) Summary: AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const ==1798517==ABORTING ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370 Differential Revision: D9844146 Pulled By: sagar0 fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1	2018-09-15 00:12:57 -07:00
Dmitri Smirnov	879998b369	Adjust c test and fix windows compilation issues Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4369 Differential Revision: D9844200 Pulled By: sagar0 fbshipit-source-id: 0d9f5f73b28234eaac55d3551ce4e2dc177af138	2018-09-14 20:57:22 -07:00
JiYou	82e8e9e26b	VersionBuilder: optmize SaveTo() to linear time. (#4366 ) Summary: Because `base_files` and `added_files` both are sorted, using a merge operation to these two sorted arrays is more effective. The complexity is reduced to linear time. - optmize the merge complexity. - move the `NDEBUG` of sorted `added_files` out of merge process. Signed-off-by: JiYou <jiyou09@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/4366 Differential Revision: D9833592 Pulled By: ajkr fbshipit-source-id: dd32b67ebdca4c20e5e9546ab8082cecefe99fd0	2018-09-14 19:43:04 -07:00
Yanqin Jin	8959063c9c	Store the return value of Fsync for check Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4361 Differential Revision: D9803723 Pulled By: riversand963 fbshipit-source-id: 5a0d4cd3e57fd195571dcd5822895ee00547fa6a	2018-09-14 13:29:56 -07:00
Yanqin Jin	82057b0d8f	Improve type conversion (#4367 ) Summary: Use `static_cast<type>(var)` instead of `(type)var`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4367 Differential Revision: D9833391 Pulled By: riversand963 fbshipit-source-id: 3d33fc2c290e7e0f3d1d45b256a881d1bc5a7df2	2018-09-14 11:12:52 -07:00
Andrew Kryczka	c94523ee56	Delete code for WAL reader to start at nonzero offset (#4362 ) Summary: The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362 Differential Revision: D9817829 Pulled By: ajkr fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d	2018-09-13 17:13:03 -07:00
Vitaly Isaev	0bd2ede10e	Memory usage stats in C API (#4340 ) Summary: Please consider this small PR providing access to the `MemoryUsage::GetApproximateMemoryUsageByType` function in plain C API. Actually I'm working on Go application and now trying to investigate the reasons of high memory consumption (#4313). Go [wrappers](https://github.com/tecbot/gorocksdb) are built on the top of Rocksdb C API. According to the #706, `MemoryUsage::GetApproximateMemoryUsageByType` is considered as the best option to get database internal memory usage stats, but it wasn't supported in C API yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4340 Differential Revision: D9655135 Pulled By: ajkr fbshipit-source-id: a3d2f3f47c143ae75862fbcca2f571ea1b49e14a	2018-09-13 14:27:31 -07:00
Dan Melnic	ca92fc71a4	Initialize uninitialized std::atomic variables Summary: Initialize uninitialized std::atomic variables Reviewed By: yfeldblum Differential Revision: D9758050 fbshipit-source-id: 865d89eddafc81f3cab6f11e2ebb669f7ff70d04	2018-09-12 08:58:05 -07:00
Abhishek Madan	c86a22ac43	Restrict RangeDelAggregator's tombstone end-key truncation (#4356 ) Summary: `RangeDelAggregator::AddTombstones` contained an assertion which stated that, if a range tombstone extended past the largest key in the sstable, then `FileMetaData::largest` must have a sentinel sequence number of `kMaxSequenceNumber`, which implies that the tombstone's end key is safe to truncate. However, `largest` will not be a sentinel key when the next sstable in the level's smallest key is equal to the current sstable's largest key, which caused the assertion to fail. The assertion must hold for the truncation to be safe, so it has been moved to an additional check on end-key truncation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4356 Differential Revision: D9760891 Pulled By: abhimadan fbshipit-source-id: 7c20c3885cd919dcd14f291f88fd27aa33defebc	2018-09-10 17:42:43 -07:00
Maysam Yabandeh	3f5282268f	Skip concurrency control during recovery of pessimistic txn (#4346 ) Summary: TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346 Differential Revision: D9759149 Pulled By: maysamyabandeh fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b	2018-09-10 16:57:53 -07:00
Anand Ananthabhotla	ced618cf39	Fix a lint error due to unspecified move evaluation order (#4348 ) Summary: In C++ 11, the order of argument and move evaluation in a statement such as below is unspecified - foo(a.b).bar(std::move(a)) The compiler is free to evaluate std::move(a) first, and then a.b is unspecified. In C++ 17, this will be safe if a draft proposal around function chaining rules is accepted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4348 Differential Revision: D9688810 Pulled By: anand1976 fbshipit-source-id: e4651d0ca03dcf007e50371a0fc72c0d1e710fb4	2018-09-06 14:42:57 -07:00
cngzhnp	64324e329e	Support pragma once in all header files and cleanup some warnings (#4339 ) Summary: As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited. Besides this, try to fix some warnings about loss of data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339 Differential Revision: D9654990 Pulled By: ajkr fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848	2018-09-05 18:13:31 -07:00
Andrew Kryczka	1a88c43751	Reduce empty SST creation/deletion in compaction (#4336 ) Summary: This is a followup to #4311. Checking `!RangeDelAggregator::IsEmpty()` before opening a dedicated range tombstone SST did not properly prevent empty SSTs from being generated. That's because it relies on `CollapsedRangeDelMap::Size`, which had an underflow bug when the map was empty. This PR fixes that underflow bug. Also fixed an uninitialized variable in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4336 Differential Revision: D9600080 Pulled By: ajkr fbshipit-source-id: bc6980ca79d2cd01b825ebc9dbccd51c1a70cfc7	2018-08-31 12:28:52 -07:00
Mikhail Antonov	927f274939	Avoiding write stall caused by manual flushes (#4297 ) Summary: Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening. One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable. This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297 Differential Revision: D9420705 Pulled By: mikhail-antonov fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc	2018-08-29 12:12:55 -07:00
Andrew Kryczka	42733637e1	Sync CURRENT file during checkpoint (#4322 ) Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it. Differential Revision: D9525939 Pulled By: ajkr fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3	2018-08-28 12:43:18 -07:00
Yanqin Jin	198459ce17	Fix an inaccurate comment (#4315 ) Summary: According to `4848bd0c4e/db/log_reader.cc (L355)`, the original text is misleading when describing the layout of RecyclableLogHeader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4315 Differential Revision: D9505284 Pulled By: riversand963 fbshipit-source-id: 79994c37a69e7003f03453e7efc0186feeafa609	2018-08-24 18:13:20 -07:00
Shrikanth Shankar	4848bd0c4e	Drop unnecessary deletion markers during compaction (issue - 3842) (#4289 ) Summary: This PR fixes issue 3842. We drop deletion markers iff 1. We are the bottom most level AND 2. All other occurrences of the key are in the same snapshot range as the delete I've also enhanced db_stress_test to add an option that does a full compare of the keys. This is done by a single thread (thread # 0). For tests I've run (so far) make check -j64 db_stress db_stress --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify that new code doesnt break existing tests / ./db_stress --compare_full_db_state_snapshot=true --acquire_snapshot_one_in=1000 --ops_per_thread=100000 / to verify new test code */ Pull Request resolved: https://github.com/facebook/rocksdb/pull/4289 Differential Revision: D9491165 Pulled By: shrikanthshankar fbshipit-source-id: ce144834f31736c189aaca81bed356ba990331e2	2018-08-24 15:17:54 -07:00
Yanqin Jin	7daae512d2	Refactor flush request queueing and processing (#3952 ) Summary: RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB. This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it. This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952 Differential Revision: D8529933 Pulled By: riversand963 fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539	2018-08-24 13:27:35 -07:00
Andrew Kryczka	17f9a181d5	Reduce empty SST creation/deletion during compaction (#4311 ) Summary: I have a PR to start calling `OnTableFileCreated` for empty SSTs: #4307. However, it is a behavior change so should not go into a patch release. This PR adds back a check to make sure range deletions at least exist before starting file creation. This PR should be safe to backport to earlier versions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4311 Differential Revision: D9493734 Pulled By: ajkr fbshipit-source-id: f0d43cda4cfd904f133cfe3a6eb622f52a9ccbe8	2018-08-24 12:27:57 -07:00
Andrew Kryczka	ee234e83e3	Invoke OnTableFileCreated for empty SSTs (#4307 ) Summary: The API comment on `OnTableFileCreationStarted` (`b6280d01f9/include/rocksdb/listener.h (L331-L333)`) led users to believe a call to `OnTableFileCreationStarted` will always be matched with a call to `OnTableFileCreated`. However, we were skipping the `OnTableFileCreated` call in one case: no error happens but also no file is generated since there's no data. This PR adds the call to `OnTableFileCreated` for that case. The filename will be "(nil)" and the size will be zero. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4307 Differential Revision: D9485201 Pulled By: ajkr fbshipit-source-id: 2f077ec7913f128487aae2624c69a50762394df6	2018-08-23 18:27:30 -07:00
Gauresh Rane	ad789e4e0d	Adding a method for memtable class for memtable getting flushed. (#4304 ) Summary: Memtables are selected for flushing by the flush job. Currently we have listener which is invoked when memtables for a column family are flushed. That listener does not indicate which memtable was flushed in the notification. If clients want to know if particular data in the memtable was retired, there is no straight forward way to know this. This method will help users who implement memtablerep factory and extend interface for memtablerep, to know if the data in the memtable was retired. Another option that was tried, was to depend on memtable destructor to be called after flush to mark that data was persisted. This works all the time but sometimes there can huge delays between actual flush happening and memtable getting destroyed. Hence, if anyone who is waiting for data to persist will have to wait that longer. It is expected that anyone who is implementing this method to have return quickly as it blocks RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4304 Reviewed By: riversand963 Differential Revision: D9472312 Pulled By: gdrane fbshipit-source-id: 8e693308dee749586af3a4c5d4fcf1fa5276ea4d	2018-08-23 17:14:25 -07:00
Yanqin Jin	bb5dcea98e	Add path to WritableFileWriter. (#4039 ) Summary: We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039 Differential Revision: D8670178 Pulled By: riversand963 fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a	2018-08-23 10:12:58 -07:00
Zhongyi Xie	f1f5ba085f	add missing counters in readonly mode (#4260 ) Summary: User reported (https://github.com/facebook/rocksdb/issues/4168) that when opening RocksDB in read-only mode, some statistics are not correctly reported. After some investigation, we believe the following counters are indeed not reported during Get() call in a read-only DB: rocksdb.memtable.hit rocksdb.memtable.miss rocksdb.number.keys.read rocksdb.bytes.read As well as histogram rocksdb.bytes.per.read and perf context get_read_bytes This PR will add the necessary counter reporting logic in the Get() call path Pull Request resolved: https://github.com/facebook/rocksdb/pull/4260 Differential Revision: D9476431 Pulled By: miasantreble fbshipit-source-id: 7ab409d4e59df05d09ae8b69fe75554e5aa240d6	2018-08-22 22:43:13 -07:00
Siying Dong	d5612b43de	Two code changes to make "clang analyze" happy (#4292 ) Summary: Clang analyze is not happy in two pieces of code, with "Potential memory leak". No idea what the problem but slightly changing the code makes clang happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4292 Differential Revision: D9413555 Pulled By: siying fbshipit-source-id: 9428c9d3664530c72129feefd135ee63d8386137	2018-08-20 17:43:41 -07:00
Yanqin Jin	d116a1725d	Update recovery code for version edits group commit. (#3945 ) Summary: During recovery, RocksDB is able to handle version edits that belong to group commits. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3945 Differential Revision: D8529122 Pulled By: riversand963 fbshipit-source-id: 57cb0f9cc55ecca684a837742d6626dc9c07f37e	2018-08-20 14:58:00 -07:00
Mikhail Antonov	889a0553c8	VerifyChecksum() API should preserve options Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4275 Reviewed By: yiwu-arbug Differential Revision: D9369766 Pulled By: mikhail-antonov fbshipit-source-id: d91b64c34cc1976b324a260767fce343fa32afde	2018-08-16 16:42:29 -07:00
Andrey Zagrebin	aeed4f0749	#3865 fix performance regression introduced by MergeOperator.ShouldMerge (#4266 ) Summary: This PR addresses issue #3865 and implements the following approach to fix it: - adds `MergeContext::GetOperandsDirectionForward` and `MergeContext::GetOperandsDirectionBackward` to query merge operands in a specific order - `MergeContext::GetOperands` becomes a shortcut for `MergeContext::GetOperandsDirectionForward` - pass `MergeContext::GetOperandsDirectionBackward` to `MergeOperator::ShouldMerge` and document the order Pull Request resolved: https://github.com/facebook/rocksdb/pull/4266 Differential Revision: D9360750 Pulled By: sagar0 fbshipit-source-id: 20cb73ff017760b062ecdcf4382560767086e092	2018-08-16 10:58:05 -07:00
jsteemann	33ad9060d3	fix compilation with g++ option `-Wsuggest-override` (#4272 ) Summary: Fixes compilation warnings (which are turned into compilation errors by default) when compiling with g++ option `-Wsuggest-override`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4272 Differential Revision: D9322556 Pulled By: siying fbshipit-source-id: abd57a29ec8f544bee77c0bb438f31be830b7244	2018-08-14 15:13:10 -07:00
Huachao Huang	d916a1105a	c-api: add some missing options Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4267 Differential Revision: D9309505 Pulled By: anand1976 fbshipit-source-id: eb9fee8037f4ff24dc1cdd5cc5ef41c231a03e1f	2018-08-13 18:42:30 -07:00
Siying Dong	f3d91a0b57	Add a unit test to verify iterators release data blocks after using them (#4170 ) Summary: Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170 Differential Revision: D8962513 Pulled By: siying fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407	2018-08-13 17:43:14 -07:00
Anand Ananthabhotla	4ea56b1bd0	Revert changes in PR #4003 (#4263 ) Summary: Revert this change. Not generating the OnTableFileCreated() notification for a 0 byte SST on flush breaks the assumption that every OnTableFileCreationStarted() notification is followed by a corresponding OnTableFileCreated(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4263 Differential Revision: D9285623 Pulled By: anand1976 fbshipit-source-id: 808c3dcd498b4b4f4ed4be947a29a24b2296aa8d	2018-08-11 16:57:36 -07:00
Zhichao Cao	6d75319d95	Add tracing function of Seek() and SeekForPrev() to trace_replay (#4228 ) Summary: In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file. Replay of Iterator is not supported in the current implementation. Tested with trace_analyzer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228 Differential Revision: D9201381 Pulled By: zhichao-cao fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab	2018-08-10 17:57:40 -07:00
Zhichao Cao	76d77205da	Remove the redundant condition inclusion to avoid confusion (#4254 ) Summary: The pair of ROCKSDB_LITE condition inclusion is redundant, it is already inside the #ifndef ROCKSDB_LITE. Remove them to void confusion. Tested by make asan_check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4254 Differential Revision: D9281652 Pulled By: zhichao-cao fbshipit-source-id: 06bf7641ede71391f21f6a3fe37fbd13f0e2a43a	2018-08-10 17:43:33 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Yanqin Jin	de7f423a82	Add SST ingestion to ldb (#4205 ) Summary: We add two subcommands `write_extern_sst` and `ingest_extern_sst` to ldb. This PR avoids changing existing code because we hope to cherry-pick to earlier releases to support compatibility check for external SST file ingestion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4205 Differential Revision: D9112711 Pulled By: riversand963 fbshipit-source-id: 7cae88380d4de86da8440230e87eca66755648e4	2018-08-09 14:29:11 -07:00
Zhongyi Xie	b15379dcea	fix use-after-free error involving a temporary string (#4240 ) Summary: In the current code, `error_msg` is pointing to the inner buffer of a temporary std::string object. When `error_msg` is used to construct the error message, that array is already released. This PR will fix this bug by copying the string to a local variable. Fixes https://github.com/facebook/rocksdb/issues/4239 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4240 Differential Revision: D9204334 Pulled By: miasantreble fbshipit-source-id: 0ac599e166ae0a4ec413e32d8b8853d7c5fba878	2018-08-09 11:13:10 -07:00
Maysam Yabandeh	d8d66c937e	Simplify DBWithMaxSpaceAllowedRandomized (#4235 ) Summary: The test has become complicated over the years and hard to reason about the corner cases that makes the test flaky. The patch simplifies the test and also fixes some probable synchronization issues. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4235 Differential Revision: D9187995 Pulled By: maysamyabandeh fbshipit-source-id: 53c7b060f14367e5a9e361014578c26debfe3d27	2018-08-08 07:27:46 -07:00
Huachao Huang	badfd70a3e	types: add kEntryBlobIndex for TablePropertiesCollector (#4233 ) Summary: So that we can act accordingly on blob index entries Pull Request resolved: https://github.com/facebook/rocksdb/pull/4233 Differential Revision: D9190205 Pulled By: yiwu-arbug fbshipit-source-id: e5b84d5b41e44fa7a76762f1f7b0305369bb3a0c	2018-08-06 18:27:44 -07:00
Yi Wu	4cb7068c1e	BlobDB: Fix VisibleToActiveSnapshot() (#4236 ) Summary: There are two issues with `VisibleToActiveSnapshot`: 1. If there are no snapshots, `oldest_snapshot` will be 0 and `VisibleToActiveSnapshot` will always return true. Since the method is used to decide whether it is safe to delete obsolete files, obsolete file won't be able to delete in this case. 2. The `auto` keyword of `auto snapshots = db_impl_->snapshots()` translate to a copy of `const SnapshotList` instead of a reference. Since copy constructor of `SnapshotList` is not defined, using the copy may yield unexpected result. Issue 2 actually hide issue 1 from being catch by tests. During test `snapshots.empty()` can return false while it should actually be empty, and `snapshots.oldest()` return an invalid address, making `oldest_snapshot` being some random large number. The issue was originally reported by BlobDB early adopter at Kuaishou. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4236 Differential Revision: D9188706 Pulled By: yiwu-arbug fbshipit-source-id: a0f2624b927cf9bf28c1bb534784fee5d106f5ea	2018-08-06 16:57:42 -07:00
Jingguo Yao	ceb5fea1e3	Improve FullFilterBitsReader::HashMayMatch's doc (#4202 ) Summary: HashMayMatch is related to AddKey() instead of CreateFilter(). Also applies some minor Fixes #4191 #4200 #3910 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202 Differential Revision: D9180945 Pulled By: maysamyabandeh fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6	2018-08-06 11:13:18 -07:00
Yanqin Jin	1f802773bc	Update JobContext. (#3949 ) Summary: In the past, we assume that a job modifies a single column family. Therefore, a job can create at most one superversion since each superversion corresponds to one column family. This assumption leads to the fact that a `JobContext` has only one member variable called `superversion_context`. Now we want to support group flush of column families, indicating that each job can create multiple superversions. Therefore, we need to make the following change to accommodate this new feature. Add a vector of `SuperVersionContext` to `JobContext` to support installing superversions for multiple column families in one job context. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3949 Differential Revision: D8864895 Pulled By: riversand963 fbshipit-source-id: 5937a48817276370d3c8172db9c8aafc826d97ca	2018-08-03 17:42:34 -07:00
Yanqin Jin	22368965a0	Modify verification logic of ObsoleteOptionsFileTest (#4218 ) Summary: The current verification logic does not consider the case in which multiple threads (foreground and background) may execute `PurgeObsoleteFiles` function simultaneously. Each invocation will trigger the callback adding elements to a vector. Then we verify the elements in the vector, which can fail sometimes. The solution is to give up checking the elements. Instead, we check the number of OPTIONS file in the database dir. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4218 Differential Revision: D9128727 Pulled By: riversand963 fbshipit-source-id: 2b13b705fb21bc0ddd41940c4ec9b6b0c8d88224	2018-08-03 13:57:40 -07:00
DorianZheng	f9373e2d5c	Make sure to call ReleaseFileNumberFromPendingOutputs Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4219 Differential Revision: D9144294 Pulled By: riversand963 fbshipit-source-id: e46b72e5f8a149dc7a0512e38edcd0ddb0150f30	2018-08-02 18:57:34 -07:00
Andrew Kryczka	f8f6983f89	Skip range deletions at seqno zero when collapsing (#4216 ) Summary: `CollapsedRangeDelMap` internally uses seqno zero as a sentinel value to denote a gap between range tombstones or the end of range tombstones. It therefore expects to never have consecutive sentinel tombstones. However, since `DeleteRange` is now supported in `SstFileWriter`, an ingested file may contain range tombstones, and that ingested file may be assigned global seqno zero. When such tombstones are added to the collapsed map, they resemble sentinel tombstones due to having seqno zero. Then, the invariant mentioned above about never having consecutive sentinel tombstones can be violated. The symptom of this violation was dereferencing the `end()` iterator (#4204). The fix in this PR is to not add range tombstones with seqno zero to the collapsed map. They're not needed anyways since they can't possibly cover anything (in case of a key and a range tombstone with the same seqno, the key is visible). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4216 Differential Revision: D9121716 Pulled By: ajkr fbshipit-source-id: f5b78a70bea9527354603ea7ac8542a7e2b6a210	2018-08-01 12:12:02 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	2018-07-27 16:12:23 -07:00
DorianZheng	f5e46354d2	Protect external file when ingesting (#4099 ) Summary: If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099 Differential Revision: D9034092 Pulled By: riversand963 fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28	2018-07-27 14:13:12 -07:00
Siying Dong	fd45495cf5	DBImpl::IngestExternalFile() should grab mutex when releasing file number in failure case (#4189 ) Summary: `995fcf7573` has a bug: ReleaseFileNumberFromPendingOutputs() added is not protected by the DB mutex. Fix it by grabbing the lock for this operation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4189 Differential Revision: D9015447 Pulled By: siying fbshipit-source-id: b8506e09a96c3f95a6fe32b5ca5fcdb9bee88937	2018-07-26 11:12:29 -07:00
Siying Dong	2a81633da2	Fix bug when seeking backward against an out-of-bound iterator (#4187 ) Summary: `92ee3350e0` introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187 Differential Revision: D8996600 Pulled By: siying fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1	2018-07-25 17:14:01 -07:00
Manuel Ung	ea212e5316	WriteUnPrepared: Implement unprepared batches for transactions (#4104 ) Summary: This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch. Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`. A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104 Differential Revision: D8785717 Pulled By: lth fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91	2018-07-24 00:13:18 -07:00
Zhongyi Xie	f95a5b2464	Avoid unnecessary big for-loop when reporting ticker stats stored in GetContext (#3490 ) Summary: Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker` Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490 Differential Revision: D6969154 Pulled By: miasantreble fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029	2018-07-20 16:58:13 -07:00
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	2018-07-20 14:43:38 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Yanqin Jin	2736752b33	Fix a bug in MANIFEST group commit (#4157 ) Summary: PR #3944 introduces group commit of `VersionEdit` in MANIFEST. The implementation has a bug. When updating the log file number of each column family, we must consider only `VersionEdit`s that operate on the same column family. Otherwise, a column family may accidentally set its log file number higher than actual value, indicating that log files with smaller file number will be ignored, thus causing some updates to be lost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4157 Differential Revision: D8916650 Pulled By: riversand963 fbshipit-source-id: 8f456cf688f17bf35ad87b38e30e899aa162f201	2018-07-19 17:27:56 -07:00
Dmitri Smirnov	78ab11cd71	Return new operator for Status allocations for Windows (#4128 ) Summary: Windows requires new/delete for memory allocations to be overriden. Refactor to be less intrusive. Differential Revision: D8878047 Pulled By: siying fbshipit-source-id: 35f2b5fec2f88ea48c9be926539c6469060aab36	2018-07-19 15:09:06 -07:00
Sagar Vemuri	f3801528c1	Disable DBFlushTest.SyncFail and DBTest.GroupCommitTest on Travis (#4154 ) Summary: I am temporarily disabling DBFlushTest.SyncFail and DBTest.GroupCommitTest tests on Travis until we figure out the root-cause. These tests will still continue to run locally though. I haven't been able to reproduce these failures locally so far (even on a [local Travis environment](https://docs.travis-ci.com/user/common-build-problems/#Troubleshooting-Locally-in-a-Docker-Image) ). These tests are failing way too frequently causing everyone to wonder why their PR failed on travis, and waste time in debugging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4154 Differential Revision: D8907258 Pulled By: sagar0 fbshipit-source-id: f40068b16e9245fb3791b6a4796435d1ce1ed205	2018-07-18 18:43:11 -07:00
Siying Dong	37e0fdc824	DBSSTTest.DeleteSchedulerMultipleDBPaths data race (#4146 ) Summary: Fix a minor data race in DBSSTTest.DeleteSchedulerMultipleDBPaths reported by TSAN Pull Request resolved: https://github.com/facebook/rocksdb/pull/4146 Differential Revision: D8880945 Pulled By: siying fbshipit-source-id: 25c632f685757735c59ad4ff26b2f346a443a446	2018-07-17 17:57:46 -07:00
Yi Wu	d538ebdff0	Fix write get stuck when pipelined write is enabled (#4143 ) Summary: Fix the issue when pipelined write is enabled, writers can get stuck indefinitely and not able to finish the write. It can show with the following example: Assume there are 4 writers W1, W2, W3, W4 (W1 is the first, W4 is the last). T1: all writers pending in WAL writer queue: WAL writer queue: W1, W2, W3, W4 memtable writer queue: empty T2. W1 finish WAL writer and move to memtable writer queue: WAL writer queue: W2, W3, W4, memtable writer queue: W1 T3. W2 and W3 finish WAL write as a batch group. W2 enter ExitAsBatchGroupLeader and move the group to memtable writer queue, but before wake up next leader. WAL writer queue: W4 memtable writer queue: W1, W2, W3 T4. W1, W2, W3 finish memtable write as a batch group. Note that W2 still in the previous ExitAsBatchGroupLeader, although W1 have done memtable write for W2. WAL writer queue: W4 memtable writer queue: empty T5. The thread corresponding to W3 create another writer W3' with the same address as W3. WAL writer queue: W4, W3' memtable writer queue: empty T6. W2 continue with ExitAsBatchGroupLeader. Because the address of W3' is the same as W3, the last writer in its group, it thinks there are no pending writers, so it reset newest_writer_ to null, emptying the queue. W4 and W3' are deleted from the queue and will never be wake up. The issue exists since pipelined write was introduced in 5.5.0. Closes #3704 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4143 Differential Revision: D8871599 Pulled By: yiwu-arbug fbshipit-source-id: 3502674e51066a954a0660257e24ac588f815e2a	2018-07-17 17:27:51 -07:00
Siying Dong	ddc07b40fc	Remove managed iterator Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124 Differential Revision: D8829910 Pulled By: siying fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560	2018-07-17 14:43:18 -07:00
Siying Dong	995fcf7573	Pending output file number should be released after bulkload failure (#4145 ) Summary: If bulkload fails for an input error, the pending output file number wasn't released. This bug can cause all future files with larger number than the current number won't be deleted, even they are compacted. This commit fixes the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4145 Differential Revision: D8877900 Pulled By: siying fbshipit-source-id: 080be92a23d43305ca1e13fe1c06eb4cd0b01466	2018-07-17 14:13:16 -07:00
Maysam Yabandeh	b55da012f6	Refactor IndexBlockIter (#4141 ) Summary: Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141 Differential Revision: D8866437 Pulled By: maysamyabandeh fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b	2018-07-16 17:13:10 -07:00
Sagar Vemuri	991120fa10	Allow ttl to be changed dynamically (#4133 ) Summary: Allow ttl to be changed dynamically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4133 Differential Revision: D8845440 Pulled By: sagar0 fbshipit-source-id: c8c87ae643b3a8c4123e4c037c4645efc094a2d3	2018-07-16 14:27:53 -07:00
Nathan VanBenschoten	ef7815b803	Support range deletion tombstones in IngestExternalFile SSTs (#3778 ) Summary: Fixes #3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778 Differential Revision: D8821836 Pulled By: anand1976 fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844	2018-07-13 22:43:09 -07:00
Peter Mattis	90fc40690a	Relax VersionStorageInfo::GetOverlappingInputs check (#4050 ) Summary: Do not consider the range tombstone sentinel key as causing 2 adjacent sstables in a level to overlap. When a range tombstone's end key is the largest key in an sstable, the sstable's end key is so to a "sentinel" value that is the smallest key in the next sstable with a sequence number of kMaxSequenceNumber. This "sentinel" is guaranteed to not overlap in internal-key space with the next sstable. Unfortunately, GetOverlappingFiles uses user-keys to determine overlap and was thus considering 2 adjacent sstables in a level to overlap if they were separated by this sentinel key. This in turn would cause compactions to be larger than necessary. Note that this conflicts with https://github.com/facebook/rocksdb/pull/2769 and cases `DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050 Differential Revision: D8844423 Pulled By: ajkr fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b	2018-07-13 17:42:38 -07:00
Yanqin Jin	21171615c1	Reduce execution time of IngestFileWithGlobalSeqnoRandomized (#4131 ) Summary: Make `ExternalSSTFileTest.IngestFileWithGlobalSeqnoRandomized` run faster. `make format` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4131 Differential Revision: D8839952 Pulled By: riversand963 fbshipit-source-id: 4a7e842fde1cde4dc902e928a1cf511322578521	2018-07-13 17:27:39 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Fosco Marotto	8527012bb6	Converted db/merge_test.cc to use gtest (#4114 ) Summary: Picked up a task to convert this to use the gtest framework. It can't be this simple, can it? It works, but should all the std::cout be removed? ``` [$] ~/git/rocksdb [gft !]: ./merge_test [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from MergeTest [ RUN ] MergeTest.MergeDbTest Test read-modify-write counters... a: 3 1 2 a: 3 b: 1225 3 Compaction started ... Compaction ended a: 3 b: 1225 Test merge-based counters... a: 3 1 2 a: 3 b: 1225 3 Test merge in memtable... a: 3 1 2 a: 3 b: 1225 3 Test Partial-Merge Test merge-operator not set after reopen [ OK ] MergeTest.MergeDbTest (93 ms) [ RUN ] MergeTest.MergeDbTtlTest Opening database with TTL Test read-modify-write counters... a: 3 1 2 a: 3 b: 1225 3 Compaction started ... Compaction ended a: 3 b: 1225 Test merge-based counters... a: 3 1 2 a: 3 b: 1225 3 Test merge in memtable... Opening database with TTL a: 3 1 2 a: 3 b: 1225 3 Test Partial-Merge Opening database with TTL Opening database with TTL Opening database with TTL Opening database with TTL Test merge-operator not set after reopen [ OK ] MergeTest.MergeDbTtlTest (97 ms) [----------] 2 tests from MergeTest (190 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (190 ms total) [ PASSED ] 2 tests. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4114 Differential Revision: D8822886 Pulled By: gfosco fbshipit-source-id: c299d008e883c3bb911d2b357a2e9e4423f8e91a	2018-07-13 14:13:07 -07:00
Anand Ananthabhotla	e3eba52a5d	Re-enable kUniversalSubcompactions option_config (#4125 ) Summary: 1. Move kUniversalSubcompactions up before kEnd in db_test_util.h, so tests that cycle through all the option_configs include this 2. Skip kUniversalSubcompactions wherever kUniversalCompaction and kUniversalCompactionMultilevel are skipped Related to #3935 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4125 Differential Revision: D8828637 Pulled By: anand1976 fbshipit-source-id: 650dee15fd27d85281cf9bb4ca8ab460e04cac6f	2018-07-13 11:13:01 -07:00
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	2018-07-13 10:58:06 -07:00
Yanqin Jin	90ebf1a257	Reduce execution time of a test. (#4127 ) Summary: Reduce the number of key ranges in `ExternalSSTFileTest.OverlappingRanges` so that the test completes in shorter time to avoid timeouts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4127 Differential Revision: D8827851 Pulled By: riversand963 fbshipit-source-id: a16387b0cc92a7c872b1c50f0cfbadc463afc9db	2018-07-12 17:42:03 -07:00
Yanqin Jin	dbeaa0d397	Reduce #iterations to shorten execution time. (#4123 ) Summary: Reduce #iterations from 5000 to 1000 so that `ExternalSSTFileTest.CompactDuringAddFileRandom` can finish faster. On the one hand, 5000 iterations does not seem to improve the quality of unit test in comparison with 1000. On the other hand, long running tests should belong to stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4123 Differential Revision: D8822514 Pulled By: riversand963 fbshipit-source-id: 0f439b8d5ccd9a4aed84638f8bac16382de17245	2018-07-12 14:42:39 -07:00
Nikhil Benesch	5f3088d565	Range deletion performance improvements + cleanup (#4014 ) Summary: This fixes the same performance issue that #3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677	2018-07-12 14:42:39 -07:00
Nikhil Benesch	5cd8240b86	Test range deletions with more configurations (#4021 ) Summary: Run the basic range deletion tests against the standard set of configurations. This testing exposed that files with hash indexes and partitioned indexes were not handling the case where the file contained only range deletions--i.e., where the index was empty. Additionally file a TODO about the fact that range deletions are broken when allow_mmap_reads = true is set. /cc ajkr nvanbenschoten Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021 Differential Revision: D8811860 Pulled By: ajkr fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343	2018-07-11 15:57:49 -07:00
Yanqin Jin	331cb63641	SetOptions Backup Race Condition (#4108 ) Summary: Prior to this PR, there was a race condition between `DBImpl::SetOptions` and `BackupEngine::CreateNewBackup`, as illustrated below. ``` Time thread 1 thread 2 \| CreateNewBackup -> GetLiveFiles \| SetOptions -> RenameTempFileToOptionsFile \| SetOptions -> RenameTempFileToOptionsFile \| SetOptions -> RenameTempFileToOptionsFile // unlink oldest OPTIONS file \| copy the oldest OPTIONS // IO error! V ``` Proposed fix is to check the value of `DBImpl::disable_obsolete_files_deletion_` before calling `DeleteObsoleteOptionsFiles`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4108 Differential Revision: D8796360 Pulled By: riversand963 fbshipit-source-id: 02045317f793ea4c7d4400a5bf333b8502fa3e82	2018-07-11 14:57:46 -07:00
Manuel Ung	b9846370e9	WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078 ) Summary: This adds support for recovering WriteUnprepared transactions through the following changes: - The information in `RecoveredTransaction` is extended so that it can reference multiple batches. - `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not. - `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway. Commit/Rollback of live transactions is still unimplemented and will come later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078 Differential Revision: D8703382 Pulled By: lth fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a	2018-07-06 17:59:13 -07:00
Yanqin Jin	db7ae0a485	Fix a map lookup that may throw exception. (#4098 ) Summary: `std::map::at(key)` throws std::out_of_range if key does not exist. Current code does not handle this. Although this case is unlikely, I feel it's safe to use `std::map::find`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4098 Differential Revision: D8753865 Pulled By: riversand963 fbshipit-source-id: 9a9ba43badb0fb5e0d24cd87903931fd12f3f8ec	2018-07-06 16:12:49 -07:00
Huachao Huang	35b83327a7	compaction: fix max_subcompactions option for CompactRange (#4082 ) Summary: The max_subcompactions option was introduced in https://github.com/facebook/rocksdb/pull/3775. Closes https://github.com/facebook/rocksdb/pull/4082 Differential Revision: D8743258 Pulled By: ajkr fbshipit-source-id: d60ee75769dfc19ab6f8754e4ff3a267848f1ed9	2018-07-05 20:12:56 -07:00
Zhongyi Xie	b3efb1cbe0	fix clang analyzer warnings (#4072 ) Summary: clang analyze is giving the following warnings: > db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null } else if (meta->smallest.size() > 0) { ^~~~~~~~~~~~~~~~~~~~~ db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta') meta->marked_for_compaction = sub_compact->builder->NeedCompact(); ~~~~ db/version_set.cc:2770:26: warning: Called C++ object pointer is null uint32_t cf_id = last_writer->cfd->GetID(); ^~~~~~~~~~~~~~~~~~~~~~~~~ Closes https://github.com/facebook/rocksdb/pull/4072 Differential Revision: D8685852 Pulled By: miasantreble fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085	2018-06-28 19:12:35 -07:00
Manuel Ung	8ad63a4b86	WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069 ) Summary: This adds a new WAL marker of type kTypeBeginUnprepareXID. Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy. Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not. Closes https://github.com/facebook/rocksdb/pull/4069 Differential Revision: D8675099 Pulled By: lth fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749	2018-06-28 18:58:29 -07:00
Anand Ananthabhotla	52d4c9b7f6	Allow DB resume after background errors (#3997 ) Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd	2018-06-28 12:34:40 -07:00
Yanqin Jin	26d67e357e	Support group commits of version edits (#3944 ) Summary: This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Closes https://github.com/facebook/rocksdb/pull/3944 Differential Revision: D8432536 Pulled By: riversand963 fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb	2018-06-28 12:34:39 -07:00
Maysam Yabandeh	0a5b5d88b2	Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070 ) Summary: Lite does not support readonly DBs. Closes https://github.com/facebook/rocksdb/pull/4070 Differential Revision: D8677858 Pulled By: maysamyabandeh fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa	2018-06-28 08:42:17 -07:00
Zhongyi Xie	14f409c0f1	PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067 ) Summary: with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests. Closes https://github.com/facebook/rocksdb/pull/4067 Differential Revision: D8669002 Pulled By: miasantreble fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a	2018-06-27 20:42:43 -07:00
Zhichao Cao	1f6efabe23	Add bottommost_compression_opts to for bottommost_compression (#3985 ) Summary: …ression For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts Closes https://github.com/facebook/rocksdb/pull/3985 Reviewed By: riversand963 Differential Revision: D8385911 Pulled By: zhichao-cao fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc	2018-06-27 17:42:38 -07:00
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	2018-06-27 17:13:34 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Nikhil Benesch	17339dc2f3	Add table property tracking number of range deletions (#4016 ) Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR #3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes https://github.com/facebook/rocksdb/pull/4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7	2018-06-26 20:27:35 -07:00
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	2018-06-26 15:57:26 -07:00
Andrew Kryczka	a8e503e545	Fix universal compaction scheduling conflict with CompactFiles (#4055 ) Summary: Universal size-amp-triggered compaction was pulling the final sorted run into the compaction without checking whether any of its files are already being compacted. When all compactions are automatic, it is safe since it verifies the second-last sorted run is not already being compacted, which implies the last sorted run is also not being compacted (in automatic compaction multiple sorted runs are always compacted together). But with manual compaction, files in the last sorted run can be compacted independently, so the last sorted run also must be checked. We were seeing the below assertion failure in `db_stress`. Also the test case included in this PR repros the failure. ``` db_universal_compaction_test: db/compaction.cc:312: void rocksdb::Compaction::MarkFilesBeingCompacted(bool): Assertion `mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted' failed. Aborted (core dumped) ``` Closes https://github.com/facebook/rocksdb/pull/4055 Differential Revision: D8630094 Pulled By: ajkr fbshipit-source-id: ac3b30a874678b76e113d4f6c42c1260411b08f8	2018-06-26 10:44:56 -07:00
Sagar Vemuri	189f0c27aa	Make BlockBasedTableIterator compaction-aware (#4048 ) Summary: Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`. In `7103559f49`, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain. Closes https://github.com/facebook/rocksdb/pull/4048 Differential Revision: D8601056 Pulled By: sagar0 fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3	2018-06-25 13:19:27 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
Zhongyi Xie	795e663df0	option for timing measurement of non-blocking ops during compaction (#4029 ) Summary: For example calling CompactionFilter is always timed and gives the user no way to disable. This PR will disable the timer if `Statistics::stats_level_` (which is part of DBOptions) is `kExceptDetailedTimers` Closes https://github.com/facebook/rocksdb/pull/4029 Differential Revision: D8583670 Pulled By: miasantreble fbshipit-source-id: 913be9fe433ae0c06e88193b59d41920a532307f	2018-06-21 21:28:05 -07:00
Yanqin Jin	524c6e6b72	Add file name info to SequentialFileReader. (#4026 ) Summary: We potentially need this information for tracing, profiling and diagnosis. Closes https://github.com/facebook/rocksdb/pull/4026 Differential Revision: D8555214 Pulled By: riversand963 fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2	2018-06-21 08:42:24 -07:00
Siying Dong	92ee3350e0	BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004 ) Summary: `b555ed30a4` makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound. For example, if an SST file has a data block containing following keys : {a, z} The user sets the upper bound to be "x", and it executed following queries: Seek("b") Seek("c") Seek("d") Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again. To prevent this regression case, we keep the current data block iterator if it is upper bound. Closes https://github.com/facebook/rocksdb/pull/4004 Differential Revision: D8463192 Pulled By: siying fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81	2018-06-19 09:57:11 -07:00
Tomas Kolda	c766887458	Fix ExternalSSTFileTest::OverlappingRanges test on Solaris Sparc (#4012 ) Summary: Fix of #4011 Closes https://github.com/facebook/rocksdb/pull/4012 Differential Revision: D8499173 Pulled By: sagar0 fbshipit-source-id: cbb2b90c544ed364a3640ea65835d577b2dbc5df	2018-06-18 14:57:37 -07:00
Zhongyi Xie	80bc35927c	Should only decode restart points for uncompressed blocks (#3996 ) Summary: The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843. This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy. Closes https://github.com/facebook/rocksdb/pull/3996 Differential Revision: D8416186 Pulled By: miasantreble fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28	2018-06-15 19:26:58 -07:00
Anand Ananthabhotla	c48764ba47	Don't generate a notification for a 0 size SST (#4003 ) Summary: Don't call the OnTableFileCreated listener callback when a 0 size SST file gets created by Flush. Doing so causes an assertion failure in db_stress. It is also not correct behavior as we call env->DeleteFile() for such files right before the notification. Closes https://github.com/facebook/rocksdb/pull/4003 Differential Revision: D8461385 Pulled By: anand1976 fbshipit-source-id: ae92d4f921c2e2cff981ad58f4929ed8b609f35d	2018-06-15 17:57:24 -07:00
zhichao-cao	3fbc865cd5	Add kOptionsStatistics to GetProperty() (#3966 ) Summary: Add a new DB property to DB::GetProperty(), which returns the option.statistics. Test is updated to pass. Closes https://github.com/facebook/rocksdb/pull/3966 Differential Revision: D8311139 Pulled By: zhichao-cao fbshipit-source-id: ea78f4727358c807b0e5a0ea62e09defb10ad9ac	2018-06-15 17:28:01 -07:00
奏之章	f23fed19a1	Delay verify compaction output table (#3979 ) Summary: Verify table will load SST into `TableCache` it occupy memory & `TableCache`‘s capacity ... but no logic use them it's unnecessary ... so , we verify them after all sub compact finished Closes https://github.com/facebook/rocksdb/pull/3979 Differential Revision: D8389946 Pulled By: ajkr fbshipit-source-id: 54bd4f474f9e7b3accf39c3068b1f36a27ec4c49	2018-06-15 12:42:53 -07:00
Fenggang Wu	fbe3b9e2b6	Udpate db_universal_compaction_test according to PR #3970 (#3995 ) Summary: The SST file sizes changed slightly after the improvement of PR #3970 which reduces the size of the properties block. Before PR #3970 a size ratio compaction included all of the first four flushed files but it only includes two files after. We increase the size_ratio universal compaction option to make that compaction include all four files again. Closes https://github.com/facebook/rocksdb/pull/3995 Differential Revision: D8426925 Pulled By: fgwu fbshipit-source-id: 1429c38672e9f4fb4d4881fd4b06db45c4861d62	2018-06-15 10:42:21 -07:00
Siying Dong	d82f1421b4	Fix regression bug of Prev() with upper bound (#3989 ) Summary: A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong: Seek(key); if (!Valid()) SeekToLast(); Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases. This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev(). Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case. Closes https://github.com/facebook/rocksdb/pull/3989 Differential Revision: D8385422 Pulled By: siying fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4	2018-06-12 16:57:36 -07:00
Maysam Yabandeh	b73652169e	Extend format 3 to partitioned index/filters (#3958 ) Summary: format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters. Closes https://github.com/facebook/rocksdb/pull/3958 Differential Revision: D8294615 Pulled By: maysamyabandeh fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe	2018-06-06 16:58:16 -07:00
Andrew Kryczka	4420df4b0e	Check conflict at output level in CompactFiles (#3926 ) Summary: CompactFiles checked whether the existing files conflicted with the chosen compaction. But it missed checking whether future files would conflict, i.e., when another compaction was simultaneously writing new files to the same range at the same output level. Closes https://github.com/facebook/rocksdb/pull/3926 Differential Revision: D8218996 Pulled By: ajkr fbshipit-source-id: 21cb00a6fed4c8c62d3ed2ff810962e6bdc2fdfb	2018-06-05 14:14:05 -07:00
Zhongyi Xie	f1592a06c2	run make format for PR 3838 (#3954 ) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f	2018-06-05 12:58:02 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Zhongyi Xie	50d7ac0ea3	Fix test for rocksdb_lite: hide incompatible option kDirectIO Summary: Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build. Closes https://github.com/facebook/rocksdb/pull/3943 Differential Revision: D8242757 Pulled By: miasantreble fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a	2018-06-01 20:42:36 -07:00
Andrew Kryczka	fea2b1dfb2	Copy Get() result when file reads use mmap Summary: For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region. For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`). This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied. This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime. Closes https://github.com/facebook/rocksdb/pull/3881 Differential Revision: D8076288 Pulled By: ajkr fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08	2018-06-01 16:57:58 -07:00
straw	89b37081a1	add c api rocksdb_sstfilewriter_file_size Summary: Closes https://github.com/facebook/rocksdb/pull/3922 Differential Revision: D8208528 Pulled By: ajkr fbshipit-source-id: d384fe53cf526f2aadc7b79a423ce36dbd3ff224	2018-06-01 09:43:59 -07:00
Maysam Yabandeh	44cf84932f	Fix the bug of some test scenarios being put after kEnd Summary: DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests. Closes https://github.com/facebook/rocksdb/pull/3935 Differential Revision: D8230748 Pulled By: maysamyabandeh fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db	2018-05-31 19:28:00 -07:00
QingpingWang	2807678b11	c api set bottommost level compaction Summary: Closes https://github.com/facebook/rocksdb/pull/3928 Differential Revision: D8224962 Pulled By: ajkr fbshipit-source-id: 3caf463509a935bff46530f27232a85ae7e4e484	2018-05-31 17:30:50 -07:00
Siying Dong	82089d59c3	DBImpl::FindObsoleteFiles() not to call GetChildren() on the same path Summary: DBImpl::FindObsoleteFiles() may call GetChildren() multiple times if different CFs are on the same path. Fix it. Closes https://github.com/facebook/rocksdb/pull/3885 Differential Revision: D8084634 Pulled By: siying fbshipit-source-id: b471fbc251f6a05e9243304dc14c0831060cc0b0	2018-05-31 12:58:33 -07:00
maoyouxiang	a35451eaa4	fix deadlock with enable_pipelined_write=true and max_successive_merges > 0 Summary: fix this https://github.com/facebook/rocksdb/issues/3916 Closes https://github.com/facebook/rocksdb/pull/3923 Differential Revision: D8215192 Pulled By: yiwu-arbug fbshipit-source-id: a4c2f839a91d92dc70906d2b7c6de0fe014a2422	2018-05-31 11:13:14 -07:00
Siying Dong	4dd80debd0	Remove tests from ROCKSDB_VALGRIND_RUN Summary: In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind. Closes https://github.com/facebook/rocksdb/pull/3924 Differential Revision: D8210184 Pulled By: siying fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a	2018-05-30 16:15:16 -07:00
Anand Ananthabhotla	a736255de8	Delete triggered compaction for universal style Summary: This is still WIP, but I'm hoping for early feedback on the overall approach. This patch implements deletion triggered compaction, which till now only worked for leveled, for universal style. SST files are marked for compaction by the CompactOnDeletionCollertor table property. This is expected to be used when free disk space is low and the user wants to reclaim space by deleting a bunch of keys. The deletions are expected to be dense. In such a situation, we want to avoid a full compaction due to its space overhead. The strategy used in this case is similar to leveled. We pick one file from the set of files marked for compaction. We then expand the inputs to a clean cut on the same level, and then pick overlapping files from the next non-mepty level. Picking files from the next level can cause the key range to expand, and we opportunistically expand inputs in the source level to include files wholly in this key range. The main side effect of this is that it breaks the property of no time range overlap between levels. This shouldn't break any functionality. Closes https://github.com/facebook/rocksdb/pull/3860 Differential Revision: D8124397 Pulled By: anand1976 fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf	2018-05-29 15:44:34 -07:00
Yanqin Jin	cf826de3ed	Fix compilation error when OPT="-DROCKSDB_LITE". Summary: Closes https://github.com/facebook/rocksdb/pull/3917 Differential Revision: D8187733 Pulled By: riversand963 fbshipit-source-id: e4aa179cd0791ca77167e357f99de9afd4aef910	2018-05-29 12:28:59 -07:00
奏之章	1c1bafa668	Fix VersionStorageInfo::EstimateLiveDataSize seg fault Summary: `HandleEstimateLiveDataSize`'s `need_out_of_mutex` is true `402b7aa07f/db/internal_stats.cc (L412-L413)` so , is will ref a `SuperVersion` `402b7aa07f/db/db_impl.cc (L1896-L1908)` so , the param `version` of `InternalStats::HandleEstimateLiveDataSize` is safe , but `cfd_->current()` is not safe ! `402b7aa07f/db/internal_stats.cc (L790-L795)` the `cfd_->current()` maybe invalid ... here's mongo-rocks crash backtrace ``` mongod(mongo::printStackTrace(std::basic_ostream<char, std::char_traits<char> >&)+0x41) [0x7fe3a3137c51] mongod(+0x2152E89) [0x7fe3a3136e89] mongod(+0x21534F6) [0x7fe3a31374f6] libpthread.so.0(+0xF5E0) [0x7fe39f5e45e0] mongod(rocksdb::InternalKeyComparator::Compare(rocksdb::Slice const&, rocksdb::Slice const&) const+0x17) [0x7fe3a22375a7] mongod(rocksdb::VersionStorageInfo::EstimateLiveDataSize() const+0x3AA) [0x7fe3a228daba] mongod(rocksdb::InternalStats::HandleEstimateLiveDataSize(unsigned long, rocksdb::DBImpl, rocksdb::Version)+0x20) [0x7fe3a2250d70] mongod(rocksdb::DBImpl::GetIntPropertyInternal(rocksdb::ColumnFamilyData, rocksdb::DBPropertyInfo const&, bool, unsigned long*)+0xEF) [0x7fe3a21e3dbf] ``` Closes https://github.com/facebook/rocksdb/pull/3912 Differential Revision: D8179944 Pulled By: yiwu-arbug fbshipit-source-id: 26f314a8f98f4c2dc4348745d759f26f0e8d95e1	2018-05-28 11:27:08 -07:00
Maysam Yabandeh	402b7aa07f	Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd	2018-05-25 18:42:43 -07:00
Yanqin Jin	aa53579d6c	Fix segfault caused by object premature destruction Summary: Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609). There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes https://github.com/facebook/rocksdb/pull/3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438	2018-05-25 11:57:51 -07:00
QingpingWang	070319f7bb	add flush_before_backup parameter to c api rocksdb_backup_engine_create_new_backup Summary: Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior. Closes https://github.com/facebook/rocksdb/pull/3897 Differential Revision: D8157676 Pulled By: ajkr fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505	2018-05-24 22:28:52 -07:00

1 2 3 4 5 ...

3420 Commits