rocksdb

Author	SHA1	Message	Date
Yi Wu	84a04af9a9	TableProperty::oldest_key_time defaults to 0 Summary: We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0. Also revert db_sst_test back to before #2842. Closes https://github.com/facebook/rocksdb/pull/3079 Differential Revision: D6165702 Pulled By: yiwu-arbug fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0	2017-10-27 15:00:05 -07:00
Yi Wu	66a2c44ef4	Add DB::Properties::kEstimateOldestKeyTime Summary: With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property. My plan is to override the property with a more accurate value with blob db, where we actually have timestamp. Closes https://github.com/facebook/rocksdb/pull/2842 Differential Revision: D5770600 Pulled By: yiwu-arbug fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a	2017-10-23 15:27:27 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Yi Wu	60c09f5fbb	print more table_options to info log Summary: print more table_options to info log Closes https://github.com/facebook/rocksdb/pull/3003 Differential Revision: D6054490 Pulled By: yiwu-arbug fbshipit-source-id: 8e6f96e08bdc906077b6c62ade419d7cb739110f	2017-10-13 14:42:26 -07:00
Yi Wu	31d3e41810	PinnableSlice move assignment Summary: Allow `std::move(pinnable_slice)`. Closes https://github.com/facebook/rocksdb/pull/2997 Differential Revision: D6036782 Pulled By: yiwu-arbug fbshipit-source-id: 583fb0419a97e437ff530f4305822341cd3381fa	2017-10-12 18:28:24 -07:00
Yi Wu	fb4ae4d810	fix DBImpl::NewInternalIterator super-version leak on failure Summary: Close #2955 Closes https://github.com/facebook/rocksdb/pull/2960 Differential Revision: D5962872 Pulled By: yiwu-arbug fbshipit-source-id: a6472d5c015bea3dc476c572ff5a5c90259e6059	2017-10-11 14:57:43 -07:00
Manuel Ung	88ed1f6ea6	Allow upgrades from nullptr to some merge operator Summary: Currently, RocksDB does not allow reopening a preexisting DB with no merge operator defined, with a merge operator defined. This means that if a DB ever want to add a merge operator, there's no way to do so currently. Fix this by adding a new verification type `kByNameAllowFromNull` which will allow old values to be nullptr, and new values to be non-nullptr. Closes https://github.com/facebook/rocksdb/pull/2958 Differential Revision: D5961131 Pulled By: lth fbshipit-source-id: 06179bebd0d90db3d43690b5eb7345e2d5bab1eb	2017-10-04 09:57:23 -07:00
Yi Wu	d1cab2b64e	Add ValueType::kTypeBlobIndex Summary: Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to 1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex. 2. Make rocksdb able to detect if the db contains value written by blob db, if so return error. 3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type). The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob(). Changes on blob db side will be in a separate patch. Closes https://github.com/facebook/rocksdb/pull/2886 Differential Revision: D5838431 Pulled By: yiwu-arbug fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca	2017-10-03 09:11:23 -07:00
Zhongyi Xie	593d3de371	No need for Restart Interval for meta blocks Summary: In SST files, restart interval helps us search in data blocks. However, some meta blocks will be read sequentially, so there's no need for restart points. Restart interval will introduce extra space in the block (https://github.com/facebook/rocksdb/blob/master/table/block_builder.cc#L80). We will see if we can remove this redundant space. (Maybe set restart interval to infinite.) Closes https://github.com/facebook/rocksdb/pull/2940 Differential Revision: D5930139 Pulled By: miasantreble fbshipit-source-id: 92b1b23c15cffa90378343ac846b713623b19c21	2017-09-29 20:26:20 -07:00
Maysam Yabandeh	ab0542f5ec	Fix for when block.cache_handle is nullptr Summary: When using with compressed cache it is possible that the status is ok but the block is not actually added to the block cache. The patch takes this case into account. Closes https://github.com/facebook/rocksdb/pull/2945 Differential Revision: D5937613 Pulled By: maysamyabandeh fbshipit-source-id: 5428cf1115e5046b3d01ab78d26cb181122af4c6	2017-09-29 07:56:55 -07:00
Sagar Vemuri	93c2b91740	Introduce conditional merge-operator invocation in point lookups Summary: For every merge operand encountered for a key in the read path we now have the ability to decide whether to look further (to retrieve more merge operands for the key) or stop and invoke the merge operator to return the value. The user needs to override `ShouldMerge()` method with a condition to terminate search when true to avail this facility. This has a couple of advantages: 1. It helps in limiting the number of merge operands that are looked at to compute a value as part of a user Get operation. 2. It allows to peek at a merge key-value to see if further merge operands need to look at. Example: Limiting the number of merge operands that are looked at: Lets say you have 10 merge operands for a key spread over various levels. If you only want RocksDB to look at the latest two merge operands instead of all 10 to compute the value, it is now possible with this PR. You can set the condition in `ShouldMerge()` to return true when the size of the operand list is 2. Look at the example implementation in the unit test. Without this PR, a Get might look at all the 10 merge operands in different levels before invoking the merge-operator. Added a new unit test. Made sure that there is no perf regression by running benchmarks. Command line to Load data: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="mergerandom" --merge_operator="uint64add" --num=10000000 ... mergerandom : 12.861 micros/op 77757 ops/sec; 8.6 MB/s ( updates:10000000) ``` ReadRandomMergeRandom bechmark results: Command line: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="readrandommergerandom" --merge_operator="uint64add" --num=10000000 ``` Base -- Without this code change (on commit fc7476b): ``` readrandommergerandom : 38.586 micros/op 25916 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8) ``` With this code change: ``` readrandommergerandom : 38.653 micros/op 25870 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8) ``` Closes https://github.com/facebook/rocksdb/pull/2923 Differential Revision: D5898239 Pulled By: sagar0 fbshipit-source-id: daefa325019f77968639a75c851d46352c2303ef	2017-09-28 15:58:49 -07:00
Andrew Kryczka	c2f6e45aa3	prevent nullptr dereference in table reader error case Summary: A user encountered segfault on the call to `CacheDependencies()`, probably because `NewIndexIterator()` failed before populating `*index_entry`. Let's avoid the call in that case. Closes https://github.com/facebook/rocksdb/pull/2939 Differential Revision: D5928611 Pulled By: ajkr fbshipit-source-id: 484be453dbb00e5e160e9c6a1bc933df7d80f574	2017-09-28 00:12:34 -07:00
Siying Dong	edcbb36944	Three code-level optimization to Iterator::Next() Summary: Three small optimizations: (1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree. (2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads. (3) Move the switching direction logic in MergingIterator::Next() to a separate function. These three in total improves readseq performance by about 3% in my benchmark setting. Closes https://github.com/facebook/rocksdb/pull/2880 Differential Revision: D5829252 Pulled By: siying fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3	2017-09-14 17:57:31 -07:00
Siying Dong	64b6452e0c	Make InternalKeyComparator final and directly use it in merging iterator Summary: Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress. I have to disable the final key word in debug build, as a hack test class depends on overriding the class. Closes https://github.com/facebook/rocksdb/pull/2860 Differential Revision: D5800461 Pulled By: siying fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53	2017-09-11 12:04:21 -07:00
Maysam Yabandeh	f46464d383	write-prepared txn: call IsInSnapshot Summary: This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot. Closes https://github.com/facebook/rocksdb/pull/2850 Differential Revision: D5787375 Pulled By: maysamyabandeh fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c	2017-09-11 09:14:48 -07:00
Andrew Kryczka	7fbb9eccaf	support disabling checksum in block-based table Summary: store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length. Closes https://github.com/facebook/rocksdb/pull/2781 Differential Revision: D5694702 Pulled By: ajkr fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd	2017-08-23 19:40:47 -07:00
Andrew Kryczka	19cc66dc4f	fix clang bug in block-based table reader Summary: This is the warning that clang considers a bug and has been causing it to fail: ``` table/block_based_table_reader.cc:240:27: warning: Potential leak of memory pointed to by 'block.value' for (; biter.Valid(); biter.Next()) { ^~~~~ ``` Actually clang just doesn't have enough knowledge to statically determine it's safe. We can teach it using an assert. Closes https://github.com/facebook/rocksdb/pull/2779 Differential Revision: D5691225 Pulled By: ajkr fbshipit-source-id: 3f0d545bf44636953b30ee5243c63239e8f16d8e	2017-08-23 15:12:05 -07:00
Andrew Kryczka	234f33a3f9	allow nullptr Slice only as sentinel Summary: Allow `Slice` holding nullptr as a sentinel value but not in comparisons. This new restriction eliminates the need for the manual checks in 39ef900551a4d88c8546ca086baaba76730e6162, while still conforming to glibc's `memcmp` API. Thanks siying for the idea. Users may need to migrate, so mentioned it in HISTORY.md. Closes https://github.com/facebook/rocksdb/pull/2777 Differential Revision: D5686016 Pulled By: ajkr fbshipit-source-id: 03a2ca3fd9a0ebade9d0d5686c81d59a9534f563	2017-08-23 10:56:06 -07:00
Maysam Yabandeh	1dfcdb15f9	Extend pin_l0 to filter partitions Summary: This is the continuation of https://github.com/facebook/rocksdb/pull/2661 for filter partitions. When pin_l0 is set (along with cache_xxx), then open table open the filter partitions are loaded into the cache and pinned there. Closes https://github.com/facebook/rocksdb/pull/2766 Differential Revision: D5671098 Pulled By: maysamyabandeh fbshipit-source-id: 174f24018f1d7f1129621e7380287b65b67d2115	2017-08-23 07:56:08 -07:00
Maysam Yabandeh	1efc600ddf	Preload l0 index partitions Summary: This fixes the existing logic for pinning l0 index partitions. The patch preloads the partitions into block cache and pin them if they belong to level 0 and pin_l0 is set. The drawback is that it does many small IOs when preloading all the partitions into the cache is direct io is enabled. Working for a solution for that. Closes https://github.com/facebook/rocksdb/pull/2661 Differential Revision: D5554010 Pulled By: maysamyabandeh fbshipit-source-id: 1e6f32a3524d71355c77d4138516dcfb601ca7b2	2017-08-18 10:56:20 -07:00
Siying Dong	666a005f9b	Support prefetch last 512KB with direct I/O in block based file reader Summary: Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too. Closes https://github.com/facebook/rocksdb/pull/2708 Differential Revision: D5593091 Pulled By: siying fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7	2017-08-11 12:16:45 -07:00
Aaron G	7848f0b24c	add VerifyChecksum() to db.h Summary: We need a tool to check any sst file corruption in the db. It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status. Closes https://github.com/facebook/rocksdb/pull/2498 Differential Revision: D5324269 Pulled By: lightmark fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b	2017-08-09 15:58:13 -07:00
Siying Dong	21696ba502	Replace dynamic_cast<> Summary: Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available. Some nontrivial changes: 1. Add Comparator::GetRootComparator() to get around the internal comparator hack 2. Add the two experiemental functions to DB 3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string 4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric. Closes https://github.com/facebook/rocksdb/pull/2645 Differential Revision: D5502723 Pulled By: siying fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c	2017-07-28 16:27:16 -07:00
Andrew Kryczka	710411aea6	fix asan/valgrind for TableCache cleanup Summary: Breaking commit: d12691b86fb788f0ee7180db626c4ea2445fa976 In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction. This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed. Closes https://github.com/facebook/rocksdb/pull/2662 Differential Revision: D5515108 Pulled By: ajkr fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18	2017-07-27 20:28:04 -07:00
Aaron Gao	8f553d3c52	remove unnecessary internal_comparator param in newIterator Summary: solved https://github.com/facebook/rocksdb/issues/2604 Closes https://github.com/facebook/rocksdb/pull/2648 Differential Revision: D5504875 Pulled By: lightmark fbshipit-source-id: c14bb62ccbdc9e7bda9cd914cae4ea0765d882ee	2017-07-27 14:30:42 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit 1d7048c5985e60be8e356663ec3cb6d020adb44d, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	ae28634e9f	Remove some left-over BSD headers Summary: Closes https://github.com/facebook/rocksdb/pull/2608 Differential Revision: D5444797 Pulled By: siying fbshipit-source-id: 690581d03f37822e059a16085088e8e2d8a45016	2017-07-18 11:56:57 -07:00
Sushma Devendrappa	0655b58582	enable PinnableSlice for RowCache Summary: This patch enables using PinnableSlice for RowCache, changes include not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle. Closes https://github.com/facebook/rocksdb/pull/2492 Differential Revision: D5316216 Pulled By: maysamyabandeh fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362	2017-07-17 15:08:30 -07:00
Daniel Black	cbaab30449	table/block.h: change memset Summary: In gcc-7 the following is an error identified by -Werror=class-memaccess In file included from ./table/get_context.h:14:0, from db/version_set.cc:43: ./table/block.h: In constructor ‘rocksdb::BlockReadAmpBitmap::BlockReadAmpBitmap(size_t, size_t, rocksdb::Statistics)’: ./table/block.h:73:53: error: ‘void memset(void, int, size_t)’ clearing an object of type ‘struct std::atomic<unsigned int>’ with no trivial copy-assignment; use value-initialization instead [-Werror=class-memaccess] memset(bitmap_, 0, bitmap_size kBytesPersEntry); ^ In file included from ./db/version_set.h:23:0, from db/version_set.cc:12: /toolchain/include/c++/8.0.0/atomic:684:12: note: ‘struct std::atomic<unsigned int>’ declared here struct atomic<unsigned int> : __atomic_base<unsigned int> ^~~~~~~~~~~~~~~~~~~~ As a solution the default initializer can be applied in list context. Signed-off-by: Daniel Black <daniel.black@au.ibm.com> Closes https://github.com/facebook/rocksdb/pull/2561 Differential Revision: D5398714 Pulled By: siying fbshipit-source-id: d883fb88ec7535eee60d551038fe91f14488be36	2017-07-17 10:41:56 -07:00
Yedidya Feldblum	f1a056e005	CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera Summary: CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera. The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures. Reviewed By: Orvid Differential Revision: D5432398 Tags: codemod, codemod-opensource fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914	2017-07-16 21:26:02 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
奏之章	70440f7a63	Add virtual func IsDeleteRangeSupported Summary: this modify allows third-party tables able to support delete range Closes https://github.com/facebook/rocksdb/pull/2035 Differential Revision: D5407973 Pulled By: ajkr fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418	2017-07-12 16:58:45 -07:00
Daniel Black	7a0b5de771	Gcc 7 ignored quantifiers Summary: The casting seemed to cause a problem. I think this might increase it to unsigned long. Closes https://github.com/facebook/rocksdb/pull/2562 Differential Revision: D5406842 Pulled By: siying fbshipit-source-id: 736adef31448229a58a1a48bdbe77792f36736e8	2017-07-12 09:41:15 -07:00
Maysam Yabandeh	98669b5356	init filters_in_partition_ Summary: Valgrind reports that it is not initialized. Closes https://github.com/facebook/rocksdb/pull/2541 Differential Revision: D5376084 Pulled By: maysamyabandeh fbshipit-source-id: 55c312f4f506863aa0d25ff92c8c34b57f48b860	2017-07-06 08:48:53 -07:00
Maysam Yabandeh	0013bf14ef	fix asan and valgrind leak report in test Summary: Closes https://github.com/facebook/rocksdb/pull/2537 Differential Revision: D5371433 Pulled By: maysamyabandeh fbshipit-source-id: 90d3e8bb1a8576f48b1ddf1bdbba5512b5986ba0	2017-07-05 19:11:39 -07:00
Maysam Yabandeh	f6b9d9355e	Fix clang error in PartitionedFilterBlockBuilder Summary: Closes https://github.com/facebook/rocksdb/pull/2536 Differential Revision: D5371271 Pulled By: maysamyabandeh fbshipit-source-id: f1355ac658a79c9982a24986f0925c9e24fc39d5	2017-07-05 11:16:04 -07:00
Maysam Yabandeh	45b9bb0331	Cut filter partition based on metadata_block_size Summary: Currently metadata_block_size controls only index partition size. With this patch a partition is cut after any of index or filter partitions reaches metadata_block_size. Closes https://github.com/facebook/rocksdb/pull/2452 Differential Revision: D5275651 Pulled By: maysamyabandeh fbshipit-source-id: 5057e4424b4c8902043782e6bf8c38f0c4f25160	2017-07-02 10:42:12 -07:00
Mike Kolupaev	397ab11152	Improve Status message for block checksum mismatches Summary: We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages. It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild. Closes https://github.com/facebook/rocksdb/pull/2507 Differential Revision: D5345702 Pulled By: al13n321 fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339	2017-06-28 21:27:01 -07:00
Sagar Vemuri	1cd45cd1b3	FIFO Compaction with TTL Summary: Introducing FIFO compactions with TTL. FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size. To address that request: - Added a new TTL option to FIFO compaction options. - Updated FIFO compaction score to take TTL into consideration. - Added a new table property, creation_time, to keep track of when the SST file is created. - Creation_time is set as below: - On Flush: Set to the time of flush. - On Compaction: Set to the max creation_time of all the files involved in the compaction. - On Repair and Recovery: Set to the time of repair/recovery. - Old files created prior to this code change will have a creation_time of 0. - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day. - FIFO compaction will fall back to the prior way of deleting files based on size if: - the creation_time of all files involved in compaction is 0. - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted. This feature is not supported if max_open_files != -1 or with table formats other than Block-based. Test Plan: Added tests. Benchmark results: Base: FIFO with max size: 100MB :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found) ``` With TTL (a low one for testing) :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20 readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found) ``` Example Log lines: ``` 2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion 2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files ... 2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK 2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40} ``` SST Files remaining in the dbbench dir, after db_bench execution completed: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst ``` Closes https://github.com/facebook/rocksdb/pull/2480 Differential Revision: D5305116 Pulled By: sagar0 fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174	2017-06-27 17:11:48 -07:00
Siying Dong	5c97a7c066	Unit Tests for sync, range sync and file close failures Summary: Closes https://github.com/facebook/rocksdb/pull/2454 Differential Revision: D5255320 Pulled By: siying fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913	2017-06-26 13:27:58 -07:00
Maysam Yabandeh	499ebb3ab5	Optimize for serial commits in 2PC Summary: Throughput: 46k tps in our sysbench settings (filling the details later) The idea is to have the simplest change that gives us a reasonable boost in 2PC throughput. Major design changes: 1. The WAL file internal buffer is not flushed after each write. Instead it is flushed before critical operations (WAL copy via fs) or when FlushWAL is called by MySQL. Flushing the WAL buffer is also protected via mutex_. 2. Use two sequence numbers: last seq, and last seq for write. Last seq is the last visible sequence number for reads. Last seq for write is the next sequence number that should be used to write to WAL/memtable. This allows to have a memtable write be in parallel to WAL writes. 3. BatchGroup is not used for writes. This means that we can have parallel writers which changes a major assumption in the code base. To accommodate for that i) allow only 1 WriteImpl that intends to write to memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes come via group commit phase which is serial anyway, ii) make all the parts in the code base that assumed to be the only writer (via EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are protected via a stat_mutex_. Note: the first commit has the approach figured out but is not clean. Submitting the PR anyway to get the early feedback on the approach. If we are ok with the approach I will go ahead with this updates: 0) Rebase with Yi's pipelining changes 1) Currently batching is disabled by default to make sure that it will be consistent with all unit tests. Will make this optional via a config. 2) A couple of unit tests are disabled. They need to be updated with the serial commit of 2PC taken into account. 3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires releasing mutex_ beforehand (the same way EnterUnbatched does). This needs to be cleaned up. Closes https://github.com/facebook/rocksdb/pull/2345 Differential Revision: D5210732 Pulled By: maysamyabandeh fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4	2017-06-24 14:11:29 -07:00
Maysam Yabandeh	0ac4afb975	Sanitize partitioning options Summary: We currently do not support partitioning filters if indexes are not partitioned. The patch makes sure that these two are consistent. Closes https://github.com/facebook/rocksdb/pull/2455 Differential Revision: D5275644 Pulled By: maysamyabandeh fbshipit-source-id: b61701ac8914c2206d06f5e33ff6f67b24406d1d	2017-06-23 18:30:01 -07:00
Maysam Yabandeh	6f4154d693	record index partition properties Summary: When Partitioning index/filter is enabled the user might need to check the index block size as well as the top-level index size via sst_dump. This patch records i) number of partitions, ii) top-level index size and make it accessible through sst_dump. The number of partitions for filters is the same as that of indexes. The top-level index for filters has a similar size to top-level index for indexes, so it is not repeated. Closes https://github.com/facebook/rocksdb/pull/2437 Differential Revision: D5224225 Pulled By: maysamyabandeh fbshipit-source-id: 5324598c75793523aef1bb7ee225a5475e95a9cb	2017-06-13 11:21:32 -07:00
Siying Dong	5582123dee	Sample number of reads per SST file Summary: We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne. Closes https://github.com/facebook/rocksdb/pull/2417 Differential Revision: D5193528 Pulled By: siying fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df	2017-06-12 07:12:08 -07:00
Maysam Yabandeh	cc5f9339ee	Fix concurrency issue with filter_block_set_ Summary: filter_block_set_ access must also be protected with mutex. Closes https://github.com/facebook/rocksdb/pull/2413 Differential Revision: D5193159 Pulled By: maysamyabandeh fbshipit-source-id: 6987fc219d9a65c20b9c7e52151aef4b8e4882e6	2017-06-06 12:56:52 -07:00
Aaron Gao	7f6c02dda1	using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… Summary: … headers https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB. We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term. make check -j64 Closes https://github.com/facebook/rocksdb/pull/2380 Differential Revision: D5177896 Pulled By: lightmark fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390	2017-06-02 17:26:19 -07:00
Mike Kolupaev	138b87eae4	Fix interaction between CompactionFilter::Decision::kRemoveAndSkipUnt… Summary: Fixes the following scenario: 1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`. 2. Do a compaction. 3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`. 4. At some point compaction filter returns `kRemoveAndSkipUntil`. 5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded. The fix is to make compaction iterator use `total_order_seek = true`. The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice). Closes https://github.com/facebook/rocksdb/pull/2349 Differential Revision: D5110388 Pulled By: lightmark fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a	2017-06-02 15:11:38 -07:00
Andrew Kryczka	a4d9c02511	Pass CF ID to MemTableRepFactory Summary: Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff: - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload. - updates MemTable to create MemTableRep's using the new overload. Closes https://github.com/facebook/rocksdb/pull/2346 Differential Revision: D5108061 Pulled By: ajkr fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616	2017-06-02 12:12:06 -07:00
Aaron Gao	f7bb1a0060	support merge and delete in file ingestion Summary: Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested. Closes https://github.com/facebook/rocksdb/pull/2361 Differential Revision: D5139402 Pulled By: lightmark fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88	2017-05-26 12:11:21 -07:00

1 2 3 4 5 ...

676 Commits