rocksdb

Author	SHA1	Message	Date
Islam AbdelRahman	67f37cf198	Allow user to specify a CF for SST files generated by SstFileWriter Summary: Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF. This allow us to persist the CF id in the generated file Closes https://github.com/facebook/rocksdb/pull/1615 Differential Revision: D4270422 Pulled By: IslamAbdelRahman fbshipit-source-id: 7fb954e	2016-12-05 14:24:16 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Siying Dong	a4eb7387b2	Allow plain table to store index on file with bloom filter disabled Summary: Currently plain table bloom filter is required if storing metadata on file. Remove the constraint. Closes https://github.com/facebook/rocksdb/pull/1525 Differential Revision: D4190977 Pulled By: siying fbshipit-source-id: be60442	2016-11-17 11:09:13 -08:00
Andrew Kryczka	327085b7b2	fix valgrind Summary: Closes https://github.com/facebook/rocksdb/pull/1526 Differential Revision: D4191257 Pulled By: ajkr fbshipit-source-id: d09dc76	2016-11-16 12:09:11 -08:00
Andrew Kryczka	48e8baebc0	Decouple data iterator and range deletion iterator in TableCache Summary: Previously we used TableCache::NewIterator() for multiple purposes (data block iterator and range deletion iterator), and returned non-ok status in the data block iterator. In one case where the caller only used the range deletion block iterator (`9e7cf3469b/db/version_set.cc (L965-L973)`), we didn't check/free the data block iterator containing non-ok status, which caused a valgrind error. So, this diff decouples creation of data block and range deletion block iterators, and updates the callers accordingly. Both functions can return non-ok status in an InternalIterator. Since the non-ok status is returned in an iterator that the callers will definitely use, it should be more usable/less error-prone. Closes https://github.com/facebook/rocksdb/pull/1513 Differential Revision: D4181423 Pulled By: ajkr fbshipit-source-id: 835b8f5	2016-11-15 17:24:28 -08:00
Islam AbdelRahman	5ed650857d	Fix SstFileWriter destructor Summary: If user did not call SstFileWriter::Finish() or called Finish() but it failed. We need to abandon the builder, to avoid destructing it while it's open Closes https://github.com/facebook/rocksdb/pull/1502 Differential Revision: D4171660 Pulled By: IslamAbdelRahman fbshipit-source-id: ab6f434	2016-11-12 20:11:19 -08:00
Andrew Kryczka	307a4e80c8	sst_dump support for range deletion Summary: Change DumpTable() so we can see the range deletion meta-block. Closes https://github.com/facebook/rocksdb/pull/1505 Differential Revision: D4172227 Pulled By: ajkr fbshipit-source-id: ae35665	2016-11-12 09:39:23 -08:00
Andrew Kryczka	815f54afad	Insert range deletion meta-block into block cache Summary: This handles two issues: (1) range deletion iterator sometimes outlives the table reader that created it, in which case the block must not be destroyed during table reader destruction; and (2) we prefer to read these range tombstone meta-blocks from file fewer times. - Extracted cache-populating logic from NewDataBlockIterator() into a separate function: MaybeLoadDataBlockToCache() - Use MaybeLoadDataBlockToCache() to load range deletion meta-block and pin it through the reader's lifetime. This code reuse works since range deletion meta-block has same format as data blocks. - Use NewDataBlockIterator() to create range deletion iterators, which uses block cache if enabled, otherwise reads the block from file. Either way, the underlying block won't disappear until after the iterator is destroyed. Closes https://github.com/facebook/rocksdb/pull/1459 Differential Revision: D4123175 Pulled By: ajkr fbshipit-source-id: 8f64281	2016-11-05 09:24:26 -07:00
Islam AbdelRahman	5c5d01ae74	Fix wrong comment (Maximum supported block size) Summary: We can support SST files >2GB but we don't support blocks >2GB Closes https://github.com/facebook/rocksdb/pull/1465 Differential Revision: D4132140 Pulled By: yiwu-arbug fbshipit-source-id: 63bf12d	2016-11-04 11:24:14 -07:00
Andrew Kryczka	f998c9790f	DeleteRange Get support Summary: During Get()/MultiGet(), build up a RangeDelAggregator with range tombstones as we search through live memtable, immutable memtables, and SST files. This aggregator is then used by memtable.cc's SaveValue() and GetContext::SaveValue() to check whether keys are covered. added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761 Closes https://github.com/facebook/rocksdb/pull/1456 Differential Revision: D4111271 Pulled By: ajkr fbshipit-source-id: 6e388d4	2016-11-03 18:54:20 -07:00
Andrew Kryczka	40a2e406f8	DeleteRange flush support Summary: Changed BuildTable() (used for flush) to (1) add range tombstones to the aggregator, which is used by CompactionIterator to determine which keys can be removed; and (2) add aggregator's range tombstones to the table that is output for the flush. Closes https://github.com/facebook/rocksdb/pull/1438 Differential Revision: D4100025 Pulled By: ajkr fbshipit-source-id: cb01a70	2016-10-31 20:54:18 -07:00
Willem Jan Withagen	0aab5e55f0	FreeBSD: malloc_usable_size is in <malloc_np.h> (#1428 ) Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>	2016-10-28 10:44:52 -07:00
Aaron Gao	9de2f75216	revert Prev() in MergingIterator to use previous code in non-prefix-seek mode Summary: Siying suggested to keep old code for normal mode prev() for safety Test Plan: make check -j64 Reviewers: yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65439	2016-10-24 13:13:01 -07:00
Aaron Gao	59a7c0337b	Change ioptions to store user_comparator, fix bug Summary: change ioptions.comparator to user_comparator instread of internal_comparator. Also change Comparator* to InternalKeyComparator* to make its type explicitly. Test Plan: make all check -j64 Reviewers: andrewkr, sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65121	2016-10-21 11:31:42 -07:00
Andrew Kryczka	a0ba0aa877	Fix uninitialized variable gcc error for MyRocks Summary: make sure seq_ is properly initialized even if ParseInternalKey() fails. Test Plan: run myrocks release tests Reviewers: lightmark, mung, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65199	2016-10-19 10:59:46 -07:00
Islam AbdelRahman	b88f8e87c5	Support SST files with Global sequence numbers [reland] Summary: reland https://reviews.facebook.net/D62523 - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: sdong, yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65211	2016-10-18 16:59:37 -07:00
Andrew Kryczka	6fbe96baf8	Compaction Support for Range Deletion Summary: This diff introduces RangeDelAggregator, which takes ownership of iterators provided to it via AddTombstones(). The tombstones are organized in a two-level map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data copy by holding Slices returned by the iterator, which remain valid thanks to pinning. For compaction, we create a hierarchical range tombstone iterator with structure matching the iterator over compaction input data. An aggregator based on that iterator is used by CompactionIterator to determine which keys are covered by range tombstones. In case of merge operand, the same aggregator is used by MergeHelper. Upon finishing each file in the compaction, relevant range tombstones are added to the output file's range tombstone metablock and file boundaries are updated accordingly. To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete() considers tombstones in the key's snapshot stripe. When this function is used outside of compaction, it also checks newer stripes, which can contain covering tombstones. Currently the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges within a stripe such that binary search can be used. RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range to a new table's range tombstone meta-block. Since range tombstones may fall in the gap between files, we may need to extend some files' key-ranges. The strategy is (1) first file extends as far left as possible and other files do not extend left, (2) all files extend right until either the start of the next file or the end of the last range tombstone in the gap, whichever comes first. One other notable change is adding release/move semantics to ScopedArenaIterator such that it can be used to transfer ownership of an arena-allocated iterator, similar to how unique_ptr is used for malloc'd data. Depends on D61473 Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927 Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62205	2016-10-18 12:04:56 -07:00
Aaron Gao	21e8daced5	fix assertion failure in Prev() Summary: fix assertion failure in db_stress. It happens because of prefix seek key is larger than merge iterator key when they have the same user key Test Plan: ./db_stress --max_background_compactions=1 --max_write_buffer_number=3 --sync=0 --reopen=20 --write_buffer_size=33554432 --delpercent=5 --log2_keys_per_lock=10 --block_size=16384 --allow_concurrent_memtable_write=0 --test_batches_snapshots=0 --max_bytes_for_level_base=67108864 --progress_reports=0 --mmap_read=0 --writepercent=35 --disable_data_sync=0 --readpercent=50 --subcompactions=4 --ops_per_thread=20000000 --memtablerep=skip_list --prefix_size=0 --target_file_size_multiplier=1 --column_families=1 --threads=32 --disable_wal=0 --open_files=500000 --destroy_db_initially=0 --target_file_size_base=16777216 --nooverwritepercent=1 --iterpercent=10 --max_key=100000000 --prefixpercent=0 --use_clock_cache=false --kill_random_test=888887 --cache_size=1048576 --verify_checksum=1 Reviewers: sdong, andrewkr, yiwu, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65025	2016-10-13 17:36:48 -07:00
Dmitri Smirnov	e489270980	Fix scoped arena iterator (#1387 )	2016-10-12 11:16:16 -07:00
Aaron Gao	447f17127c	new Prev() prefix support using SeekForPrev() Summary: 1) The previous solution for Prev() prefix support is not clean. Since I add api SeekForPrev(), now the Prev() can be symmetric to Next(). and we do not need SeekToLast() to be called in Prev() any more. Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward. 2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor. add test cases for the above two cases. There are some tests for the SeekToLast() in Prev(), I will clean them later. Test Plan: make all check Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63933	2016-10-11 13:54:26 -07:00
Yi Wu	991b585ee0	More block cache tickers Summary: Adding several missing block cache tickers. Test Plan: make all check Reviewers: IslamAbdelRahman, yhchiang, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64881	2016-10-11 11:59:05 -07:00
Islam AbdelRahman	d062328977	Revert "Support SST files with Global sequence numbers" This reverts commit `ab01da5437`.	2016-10-07 14:05:12 -07:00
Islam AbdelRahman	ab01da5437	Support SST files with Global sequence numbers Summary: - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: andrewkr, yhchiang, yiwu, sdong Reviewed By: sdong Subscribers: hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62523	2016-10-03 16:12:39 -07:00
Andrew Kryczka	6009c473c7	Store range tombstones in memtable Summary: - Store range tombstones in a separate MemTableRep instantiated with ColumnFamilyOptions::memtable_factory - MemTable::NewRangeTombstoneIterator() returns a MemTableIterator over the separate MemTableRep - Part of the read path is not implemented yet (i.e., MemTable::Get()) Test Plan: see unit tests Reviewers: wanning Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62217	2016-09-30 09:06:43 -07:00
Aaron Gao	f517d9dd09	Add SeekForPrev() to Iterator Summary: Add new Iterator API, `SeekForPrev`: find the last key that <= target key support prefix_extractor support prefix_same_as_start support upper_bound not supported in iterators without Prev() Also add tests in db_iter_test and db_iterator_test Pass all tests Cheers! Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64149	2016-09-27 18:20:57 -07:00
panfengfeng	7afbb7420b	solve the problem of table_factory_to_write_=nullptr (#1342 )	2016-09-20 10:11:51 -07:00
rockeet	4c3f4496b5	Add TableBuilderOptions::level and relevant changes (#1335 )	2016-09-17 22:30:43 -07:00
sdong	3edb9461b7	Avoid hard-coded sleep in EnvPosixTestWithParam.TwoPools Summary: EnvPosixTestWithParam.TwoPools relies on explicit sleeping, so it sometimes fail. Fix it. Test Plan: Run tests with high parallelism many times and make sure the test passes. Reviewers: yiwu, andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63417	2016-09-16 17:45:12 -07:00
Yi Wu	81747f1be6	Refactor MutableCFOptions Summary: * Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions. * Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63945	2016-09-13 21:11:59 -07:00
Islam AbdelRahman	1cca091298	Temporarily revert Prev() prefix support Summary: Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release These are the commits reverted - `6a14d55bd9` - `b18f9c9eac` - `db74b1a219` - `2482d5fb45` Test Plan: make check -j64 Reviewers: sdong, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D63789	2016-09-08 14:45:32 -07:00
sdong	607628d349	Support ZSTD with finalized format Summary: ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility. Still keep ZSTDNotFinal for compatibility reason. Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63141	2016-09-06 12:22:16 -07:00
Injun Song	ce1be2ce37	Fix build error on Windows (AppVeyor) (#1315 ) Add 'cf_options' to source list and db_imple.cc fix casting	2016-09-06 08:41:43 -07:00
sdong	f7669b40ba	Fix Windows Build Summary: Fix two Windows build problems. Test Plan: Build on Windows and run all Linux tests. Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63189	2016-09-02 17:10:28 -07:00
Yi Wu	a88677d2cf	Remove ImmutableCFOptions from public API Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (`5665e5e285`) but later its reference in /include/rocksdb/table.h is removed. Test Plan: make all check Reviewers: IslamAbdelRahman, sdong, yhchiang Reviewed By: yhchiang Subscribers: yhchiang, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63177	2016-09-02 14:16:31 -07:00
Aaron Gao	4590b53a4b	add stats to Cache::LookUp() Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor. Test Plan: make all check Reviewers: andrewkr, sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62193	2016-09-01 13:50:39 -07:00
Islam AbdelRahman	8ce1b8440a	Fix Travis on Mac Summary: not sure why travis complain about this line, works fine on my mac Test Plan: run on my mac Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63045	2016-08-31 15:10:12 -07:00
Aaron Gao	db74b1a219	fix bug in merge_iterator when data race happens Summary: core dump when run `./db_stress --max_background_compactions=1 --max_write_buffer_number=3 --sync=0 --reopen=20 --write_buffer_size=33554432 --delpercent=5 --log2_keys_per_lock=10 --block_size=16384 --allow_concurrent_memtable_write=1 --test_batches_snapshots=0 --max_bytes_for_level_base=67108864 --progress_reports=0 --mmap_read=1 --kill_prefix_blacklist=WritableFileWriter::Append,WritableFileWriter::WriteBuffered --writepercent=35 --disable_data_sync=0 --readpercent=50 --subcompactions=3 --ops_per_thread=20000000 --memtablerep=skip_list --prefix_size=0 --target_file_size_multiplier=1 --column_families=1 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --threads=32 --disable_wal=0 --open_files=500000 --destroy_db_initially=0 --target_file_size_base=16777216 --nooverwritepercent=1 --iterpercent=10 --max_key=100000000 --prefixpercent=0 --use_clock_cache=false --kill_random_test=189 --cache_size=1048576 --verify_checksum=1` Actually the relevant flag is `--threads`, data race when --thread > 1 cause problem. It is possible that multiple threads read/write memtable simultaneously. After one thread calls Prev(), another thread may insert a new key just between the current key and the key next, which may cause the assert(current_ == CurrentForward()) failure when the first thread calls Next() again if in prefix seek mode Test Plan: rerun db_stress with >1 thread / make all check -j64 Reviewers: sdong, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62979	2016-08-30 22:19:42 -07:00
Aaron Gao	b18f9c9eac	add nullptr check to internal_prefix_transform Summary: patch for D62361 Test Plan: make all check Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62883	2016-08-30 13:48:31 -07:00
Aaron Gao	2482d5fb45	support Prev() in prefix seek mode Summary: As title, make sure Prev() works as expected with Next() when the current iter->key() in the range of the same prefix in prefix seek mode Test Plan: make all check -j64 (add prefix_test with PrefixSeekModePrev test case) Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61419	2016-08-29 20:55:39 -07:00
Islam AbdelRahman	b49b92cf28	Introduce Read amplification bitmap (read amp statistics) Summary: Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read the bitmap will contain (block_size / read_amp_bytes_per_bit) bits. We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification Test Plan: added new tests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: yiwu, leveldb, march, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58707	2016-08-26 18:55:58 -07:00
Aaron Gao	c7004840d2	store prefix_extractor_name in table Summary: Make sure prefix extractor name is stored in SST files and if DB is opened with a prefix extractor of a different name, prefix bloom is skipped when read the file. Also add unit tests for that. Test Plan: before change: ``` Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from BlockBasedTableTest [ RUN ] BlockBasedTableTest.SkipPrefixBloomFilter table/table_test.cc:1421: Failure Value of: db_iter->Valid() Actual: false Expected: true [ FAILED ] BlockBasedTableTest.SkipPrefixBloomFilter (1 ms) [----------] 1 test from BlockBasedTableTest (1 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (1 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] BlockBasedTableTest.SkipPrefixBloomFilter 1 FAILED TEST ``` after: ``` Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from BlockBasedTableTest [ RUN ] BlockBasedTableTest.SkipPrefixBloomFilter [ OK ] BlockBasedTableTest.SkipPrefixBloomFilter (0 ms) [----------] 1 test from BlockBasedTableTest (0 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (0 ms total) [ PASSED ] 1 test. ``` Reviewers: sdong, andrewkr, yiwu, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61215	2016-08-26 11:46:32 -07:00
Aaron Gao	cec2c6436b	fix data race in NewIndexIterator() in block_based_table_reader.cc Summary: fixed data race described in https://github.com/facebook/rocksdb/issues/1267 and add regression test Test Plan: ./table_test --gtest_filter=BlockBasedTableTest.NewIndexIteratorLeak make all check -j64 core dump before fix. ok after fix. Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: igor, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62361	2016-08-23 18:20:41 -07:00
Yi Wu	4a16c32ece	Option to cache index/filter blocks with priority Summary: Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted. Depends on D61977. Test Plan: See unit test. Reviewers: lightmark, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, march, leveldb Differential Revision: https://reviews.facebook.net/D62241	2016-08-23 13:44:13 -07:00
Andrew Kryczka	ecf9003860	Fix bug in printing values for block-based table Summary: value is not an InternalKey, we do not need to decode it Test Plan: setup: $ ldb put --create_if_missing=true k v $ ldb put --db=./tmp --create_if_missing k v $ ldb compact --db=./tmp before: $ sst_dump --command=raw --file=./tmp/000004.sst ... terminate called after throwing an instance of 'std::length_error' after: $ ./sst_dump --command=raw --file=./tmp/000004.sst $ cat tmp/000004_dump.txt ... ASCII k : v ... Reviewers: sdong, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62301	2016-08-22 10:27:50 -07:00
Islam AbdelRahman	6a17b07ca8	Add TablePropertiesCollector support in SstFileWriter Summary: Update SstFileWriter to use user TablePropertiesCollectors that are passed in Options Test Plan: unittests Reviewers: sdong Reviewed By: sdong Subscribers: jkedgar, andrewkr, hermanlee4, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D62253	2016-08-19 16:17:56 -07:00
Wanning Jiang	78837f5d61	TableBuilder / TableReader support for range deletion Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test ) Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61473	2016-08-19 15:10:31 -07:00
Philipp Unterbrunner	deda159b55	Added min/max/avg data block size output to sst_dump Summary: Added min/max/avg data block size output to sst_dump. Output was added to the end of BlockBasedTable::DumpDataBlocks, so it appears after the data block details, at the very end of the dump file. Test Plan: ``` ./db_bench --benchmarks=fillrandom ./sst_dump --file=/tmp/rocksdbtest-xyz/dbbench/000007.sst --command=raw tail -n 6 /tmp/rocksdbtest-xyz/dbbench/000007_dump.txt ``` ``` Data Block Summary: -------------------------------------- # data blocks: 11336 min data block size: 903 max data block size: 2268 avg data block size: 2245.363356 ``` Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61815	2016-08-12 16:34:11 -07:00
Islam AbdelRahman	b693ba68b5	Minor PinnedIteratorsManager Refactoring Summary: This diff include these simple change - Rename ReleasePinnedIterators to ReleasePinnedData - Rename PinIteratorIfNeeded to PinIterator - Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector> - Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer Test Plan: existing tests Reviewers: sdong, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61305	2016-08-11 11:54:17 -07:00
omegaga	d51dc96a79	Experiments on column-aware encodings Summary: Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format. There is still on-going work on this diff. More refactoring is necessary. Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added. Reviewers: sdong Reviewed By: sdong Subscribers: arahut, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60027	2016-08-01 14:50:19 -07:00
omegaga	e70020e4f6	Only cache level 0 indexes and filter when opening table reader Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1). Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?) Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59913	2016-07-20 11:23:31 -07:00
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	2016-07-20 09:49:03 -07:00
John Alexander	9430333f84	New Statistics to track Compression/Decompression (#1197 ) * Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer). * Prevent incrementing compression counter when compression is turned off in options. * Prevent incrementing compression counter when compression is turned off in options. * Added two more supported compression types to test code in db_test.cc * Prevent incrementing compression counter when compression is turned off in options. * Added new StatsLevel that excludes compression timing. * Fixed casting error in coding.h * Fixed CompressionStatsTest for new StatsLevel. * Removed unused variable that was breaking the Linux build	2016-07-19 09:44:03 -07:00
Jay Edgar	efd013d6d8	Miscellaneous performance improvements Summary: I was investigating performance issues in the SstFileWriter and found all of the following: - The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time. Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function. - In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key. - The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed. I added the move constructor which also required a copy constructor and assignment operator. - The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added. I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value. - The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function. - BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key). It is more efficient just to copy the whole new key over. - Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on. - FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time. - Each PutVarint??() function has a buffer and calls std::string::append(). Two or three calls in a row could share a buffer and a single call to std::string::append(). Some of these will be helpful outside of the SstFileWriter. I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility? I tried it on gcc 4.8 and 4.9. Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added. The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas. Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59607	2016-07-12 14:15:32 -07:00
Yi Wu	296545a2c7	Fix clang analyzer errors Summary: Fixing erros reported by clang static analyzer. * Removing some unused variables. * Adding assertions to fix false positives reported by clang analyzer. * Adding `__clang_analyzer__` macro to suppress false positive warnings. Test Plan: USE_CLANG=1 OPT=-g make analyze -j64 Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60549	2016-07-08 17:50:51 -07:00
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	2016-07-05 18:11:25 -07:00
Islam AbdelRahman	88a2776db5	Update SstFileWriter to use bottommost_compression if avaliable Summary: SstFileWriter ignore Options::bottommost_compression, update it to use bottommost_compression if available Test Plan: make check -j64 verified used compression using ./sst_dump Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59841	2016-06-20 11:26:25 -07:00
Islam AbdelRahman	8366e10ffc	Fix clang build Summary: Fix clang build Test Plan: USE_CLANG=1 make check -j64 Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59667	2016-06-15 00:24:33 -07:00
Islam AbdelRahman	812dbfb483	Optimize BlockIter::Prev() by caching decoded entries Summary: Right now the way we do BlockIter::Prev() is like this - Go to the beginning of the restart interval - Keep moving forward (and decoding keys using ParseNextKey()) until we reach the desired key This can be optimized by caching the decoded entries in the first pass and reusing them in consecutive BlockIter::Prev() calls Before caching ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.413 micros/op 2423972 ops/sec; 268.2 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.414 micros/op 2413867 ops/sec; 267.0 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.410 micros/op 2440881 ops/sec; 270.0 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.414 micros/op 2417298 ops/sec; 267.4 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.413 micros/op 2421682 ops/sec; 267.9 MB/s ``` After caching ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.324 micros/op 3088955 ops/sec; 341.7 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.335 micros/op 2980999 ops/sec; 329.8 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.341 micros/op 2929681 ops/sec; 324.1 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.344 micros/op 2908490 ops/sec; 321.8 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.338 micros/op 2958404 ops/sec; 327.3 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59463	2016-06-14 12:27:46 -07:00
Islam AbdelRahman	7c919deccc	Reuse TimedFullMerge instead of FullMerge + instrumentation Summary: We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59577	2016-06-13 16:17:26 -07:00
Nadav Rotem	7360db39e6	Add a check mode to verify compressed block can be decompressed back Summary: Try to decompress compressed blocks when a special flag is set. assert and crash in debug builds if we can't decompress the just-compressed input. Test Plan: Run unit-tests. Reviewers: dhruba, andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59145	2016-06-10 18:20:54 -07:00
sdong	5009b5326b	BlockBasedTable::FullFilterKeyMayMatch() Should skip prefix bloom if full key bloom exists Summary: Currently, if users define both of full key bloom and prefix bloom in SST files. During Get(), if full key bloom shows the key may exist, we still go ahead and check prefix bloom. This is wasteful. If bloom filter for full keys exists, we should always ignore prefix bloom in Get(). Test Plan: Run existing tests Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57825	2016-06-10 16:27:56 -07:00
Aaron Gao	e532877940	Add statistics field to show total size of index and filter blocks in block cache Summary: With `table_options.cache_index_and_filter_blocks = true`, index and filter blocks are stored in block cache. Then people are curious how much of the block cache total size is used by indexes and bloom filters. It will be nice we have a way to report that. It can help people tune performance and plan for optimized hardware setting. We add several enum values for db Statistics. BLOCK_CACHE_INDEX/FILTER_BYTES_INSERT - BLOCK_CACHE_INDEX/FILTER_BYTES_ERASE = current INDEX/FILTER total block size in bytes. Test Plan: write a test case called `DBBlockCacheTest.IndexAndFilterBlocksStats`. The result is: ``` [gzh@dev9927.prn1 ~/local/rocksdb] make db_block_cache_test -j64 && ./db_block_cache_test --gtest_filter=DBBlockCacheTest.IndexAndFilterBlocksStats Makefile:101: Warning: Compiling in debug mode. Don't use the resulting binary in production GEN util/build_version.cc make: `db_block_cache_test' is up to date. Note: Google Test filter = DBBlockCacheTest.IndexAndFilterBlocksStats [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBBlockCacheTest [ RUN ] DBBlockCacheTest.IndexAndFilterBlocksStats [ OK ] DBBlockCacheTest.IndexAndFilterBlocksStats (689 ms) [----------] 1 test from DBBlockCacheTest (689 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (689 ms total) [ PASSED ] 1 test. ``` Reviewers: IslamAbdelRahman, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58677	2016-06-03 10:47:47 -07:00
sdong	1d725ca51d	Deprecate BlockBasedTableOptions.hash_index_allow_collision=false. Summary: Deprecate this one option and delete code and tests that are now superfluous. Test Plan: all tests pass Reviewers: igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: msalib, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55317	2016-05-20 17:52:27 -07:00
Aaron Gao	43afd72bee	[rocksdb] make more options dynamic Summary: make more ColumnFamilyOptions dynamic: - compression - soft_pending_compaction_bytes_limit - hard_pending_compaction_bytes_limit - min_partial_merge_operands - report_bg_io_stats - paranoid_file_checks Test Plan: Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit. All passed. Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57519	2016-05-17 13:11:56 -07:00
krad	a08c8c851a	Added PersistentCache abstraction Summary: Added a new abstraction to cache page to RocksDB designed for the read cache use. RocksDB current block cache is more of an object cache. For the persistent read cache project, what we need is a page cache equivalent. This changes adds a cache abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache uncompressed pages or raw pages (content as in filesystem). The user can choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode. Blame Rev: Test Plan: Run unit tests Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55707	2016-05-15 22:17:18 -07:00
Ashish Shenoy	fa3536d202	Store SST file compression algorithm as a TableProperty Summary: Store SST file compression algorithm as a TableProperty. Test Plan: Modified and ran the table_test UT that checks for TableProperties Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: lgalanis, andrewkr, dhruba, IslamAbdelRahman Differential Revision: https://reviews.facebook.net/D58017	2016-05-12 09:47:16 -07:00
sdong	7ccb8d6ef3	BlockBasedTable::Get() not to use prefix bloom if read_options.total_order_seek = true Summary: This is to provide a way for users to skip prefix bloom in point look-up. Test Plan: Add a new unit test scenario. Reviewers: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57747	2016-05-06 10:16:11 -07:00
Dmitri Smirnov	e7899c6618	Fix build issue. (#1103 )	2016-04-28 11:39:12 -07:00
Li Peng	6d4832a998	Merge pull request #1101 from flyd1005/wip-fix-typo fix typos and remove duplicated words	2016-04-28 02:30:44 -07:00
Andrew Kryczka	843d2e3137	Shared dictionary compression using reference block Summary: This adds a new metablock containing a shared dictionary that is used to compress all data blocks in the SST file. The size of the shared dictionary is configurable in CompressionOptions and defaults to 0. It's currently only used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of the compression type if the user chooses a nonzero dictionary size. During compaction, computes the dictionary by randomly sampling the first output file in each subcompaction. It pre-computes the intervals to sample by assuming the output file will have the maximum allowable length. In case the file is smaller, some of the pre-computed sampling intervals can be beyond end-of-file, in which case we skip over those samples and the dictionary will be a bit smaller. After the dictionary is generated using the first file in a subcompaction, it is loaded into the compression library before writing each block in each subsequent file of that subcompaction. On the read path, gets the dictionary from the metablock, if it exists. Then, loads that dictionary into the compression library before reading each block. Test Plan: new unit test Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52287	2016-04-27 17:36:03 -07:00
Islam AbdelRahman	d719b095dc	Introduce PinnedIteratorsManager (Reduce PinData() overhead / Refactor PinData) Summary: While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times. This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation Test Plan: make check -j64 COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56493	2016-04-26 12:41:07 -07:00
Dhruba Borthakur	995353e46a	Fix null-pointer-dereference detected by Infer (https://github.com/facebook/infer ) Test Plan: make check Reviewers: leveldb, sdong Reviewed By: sdong Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57165	2016-04-25 20:09:36 +01:00
Islam AbdelRahman	5bd4022fec	Add comparator, merge operator, property collectors to SST file properties (again) Summary: This is the original diff that I have landed and reverted and now I want to land again https://reviews.facebook.net/D34269 For old SST files we will show ``` comparator name: N/A merge operator name: N/A property collectors names: N/A ``` For new SST files with no merge operator name and with no property collectors ``` comparator name: leveldb.BytewiseComparator merge operator name: nullptr property collectors names: [] ``` for new SST files with these properties ``` comparator name: leveldb.BytewiseComparator merge operator name: UInt64AddOperator property collectors names: [DummyPropertiesCollector1,DummyPropertiesCollector2] ``` Test Plan: unittests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56487	2016-04-21 10:16:28 -07:00
Dmitri Smirnov	ee221d2de0	Introduce XPRESS compresssion on Windows. (#1081 ) Comparable with Snappy on comp ratio. Implemented using Windows API, does not require external package. Avaiable since Windows 8 and server 2012. Use -DXPRESS=1 with CMake to enable.	2016-04-19 22:54:24 -07:00
Islam AbdelRahman	6356b4d516	Fix nullptr dereference in adaptive_table Summary: @dulmarod Ran infer on RocksDB and found that we dereference nullptr in adaptive_table https://fb.facebook.com/groups/rocksdb/permalink/1046374415411173/ Test Plan: make check -j64 Reviewers: sdong, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56973	2016-04-19 13:57:05 -07:00
sdong	535af525d6	BlockBasedTable::PrefixMayMatch() to skip index checking if we can't find a filter block. Summary: In the case where we can't find a filter block, there is not much benefit of doing the binary search and see whether the index key has the prefix. With the change, we blindly return true if we can't get the filter. It also fixes missing row cases for reverse comparator with full bloom. Test Plan: Add a test case that used to fail. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: kradhakrishnan, yiwu, hermanlee4, yoshinorim, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56697	2016-04-13 19:06:48 -07:00
sdong	dff4c48ede	BlockBasedTable::PrefixMayMatch: no need to find data block after full bloom checking Summary: Full block checking should be a good enough indication of prefix existance. No need to further check data block. This also fixes wrong results when using prefix bloom and reverse bitwise comparator. Test Plan: Will add a unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: hermanlee4, yoshinorim, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56625	2016-04-12 16:25:54 -07:00
sdong	30d72ee43c	PrefixTest.PrefixAndWholeKeyTest should run against a different directory from prefix_test Summary: PrefixTest.PrefixAndWholeKeyTest runs against the same directory as prefix_test, which sometimes fail parallel tests. Fix it. Test Plan: Run it in parallel and see it doesn't fail anymore. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56541	2016-04-11 13:02:56 -07:00
Andrew Kryczka	0e3cc2cf1f	Add column family info to TableProperties::ToString() Summary: This is used at least by `sst_dump --show_properties` Test Plan: - default CF ``` $ ./sst_dump --show_properties --file=./tmp-db/000007.sst \| grep 'column family' column family ID: 0 column family name: default ``` - custom CF ``` $ ./sst_dump --show_properties --file=./tmp-db/000012.sst \| grep 'column family' column family ID: 1 column family name: col-fam-1 ``` - no CF ``` $ ./sst_dump --show_properties --file=./tmp-db/000017.sst \| grep 'column family' column family ID: N/A column family name: N/A ``` Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56499	2016-04-08 18:50:18 -07:00
sdong	1518b733eb	Change default number of cache shard bit to be 6 and max_file_opening_threads to be 16. Summary: Cache shard bit 4 is sometimes too small and 6 is a more common value picked by users. Make that default. It shouldn't hurt much to change options.max_file_opening_threads default to be 16, which will reduce the worst case DB open time. Test Plan: Run all existing tests. Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, igor, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55047	2016-04-07 13:55:10 -07:00
Andrew Kryczka	2391ef7214	Embed column family name in SST file Summary: Added the column family name to the properties block. This property is omitted only if the property is unavailable, such as when RepairDB() writes SST files. In a next diff, I will change RepairDB to use this new property for deciding to which column family an existing SST file belongs. If this property is missing, it will add it to the "unknown" column family (same as its existing behavior). Test Plan: New unit test: $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55605	2016-04-06 23:10:32 -07:00
Marton Trencseni	9b51987521	Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. Test Plan: 'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK. I didn't run the Java tests, I don't have Java set up on my devserver. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56133	2016-04-01 10:42:39 -07:00
Laurent Demailly	21700a5106	to/from hex refactor Summary: Expose the inverse of ToString(hex=true) on Slice: Slice::DecodeHex Refactor the other implementation of to/from hex in ldb_cmd.h to use the Slice version (Difference between the 2 is whether 0x is expected/produced in front of the hex string or not) Eliminated support for invalid odd length hex string - this is now invalid instead of having 1/2 byte set Added (inverse of HexToString) test for LDBCommand::StringToHex which also indirectly tests Slice::ToString(true) After moving the original implementation from ldb_cmd.h, updated it to much simpler/efficient version (originally/inspired from https://github.com/facebook/wdt/blob/master/util/EncryptionUtils.cpp#L140-L169 ) Test Plan: run tests Reviewers: uddipta, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56121	2016-03-30 14:36:48 -07:00
sdong	b1fafcaca6	Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature." This reverts commit `522de4f59e`. It has bug of index block cleaning up.	2016-03-21 11:50:42 -07:00
Marton Trencseni	522de4f59e	Adding pin_l0_filter_and_index_blocks_in_cache feature. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction. Test Plan: Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false). DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32 Mac: OK. Linux: with D55287 patched in it's OK. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54801	2016-03-17 22:40:01 +00:00
Edouard A	02e62ebbc8	Fixes warnings and ensure correct int behavior on 32-bit platforms.	2016-03-16 22:57:57 +01:00
sdong	b2ae5950ba	Index Reader should not be reused after DB restart Summary: In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems. Fix it by making cache key identical per table reader. Test Plan: Add a new test which failed with out the commit but now pass. Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55287	2016-03-14 10:04:09 -07:00
Yi Wu	f71fc77b7c	Cache to have an option to fail Cache::Insert() when full Summary: Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error. I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed. Test Plan: make check -j32, see tests pass. Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54705	2016-03-10 17:35:19 -08:00
sdong	e79ad9e184	Add Iterator Property rocksdb.iterator.version_number Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number. Test Plan: Add two unit tests for it. Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54921	2016-03-02 16:23:59 -08:00
sdong	74b660702e	Rename iterator property "rocksdb.iterator.is.key.pinned" => "rocksdb.iterator.is-key-pinned" Summary: Rename iterator property to folow property naming convention. Test Plan: Run all existing tests. Reviewers: andrewkr, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54957	2016-03-01 13:47:12 -08:00
sdong	1f5954147b	Introduce Iterator::GetProperty() and replace Iterator::IsKeyPinned() Summary: Add Iterator::GetProperty(), a way for users to communicate with iterator, and turn Iterator::IsKeyPinned() with it. As a follow-up, I'll ask a property as the version number attached to the iterator Test Plan: Rerun existing tests and add a negative test case. Reviewers: yhchiang, andrewkr, kradhakrishnan, anthony, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54783	2016-02-29 14:01:31 -08:00
Yueh-Hsuan Chiang	291ae4c206	Revert "Revert "Fixed the bug when both whole_key_filtering and prefix_extractor are set."" Summary: This reverts commit `73c31377bb`, which mistakenly reverts `73c31377bb` that fixes a bug when both whole_key_filtering and prefix_extractor are set Test Plan: revert the patch Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52707	2016-02-22 16:33:26 -08:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang	4a8cbf4e31	Allows Get and MultiGet to read directly from SST files. Summary: Add kSstFileTier to ReadTier, which allows Get and MultiGet to read only directly from SST files and skip mem-tables. kSstFileTier = 0x2 // data in SST files. // Note that this ReadTier currently only supports // Get and MultiGet and does not support iterators. Test Plan: add new test in db_test. Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: igor, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53511	2016-02-09 11:20:22 -08:00
Islam AbdelRahman	8e6172bc57	Add BlockBasedTableOptions::index_block_restart_interval Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block Test Plan: unit tests Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: march, dhruba Differential Revision: https://reviews.facebook.net/D53721	2016-02-05 10:22:37 -08:00
Islam AbdelRahman	0c433cd1eb	Fix issue in Iterator::Seek when using Block based filter block with prefix_extractor Summary: Similar to D53385 we need to check InDomain before checking the filter block. Test Plan: unit tests Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53421	2016-01-26 14:47:42 -08:00
Islam AbdelRahman	b0afcdeeac	Fix bug in block based tables with full filter block and prefix_extractor Summary: Right now when we are creating a BlockBasedTable with fill filter block we add to the filter all the prefixes that are InDomain() based on the prefix_extractor the problem is that when we read a key from the file, we check the filter block for the prefix whether or not it's InDomain() Test Plan: unit tests Reviewers: yhchiang, rven, anthony, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53385	2016-01-26 11:07:08 -08:00
Islam AbdelRahman	2fbc59a348	Disallow SstFileWriter from creating empty sst files Summary: SstFileWriter may create an sst file with no entries Right now this will fail when being ingested using DB::AddFile() saying that the keys are corrupted Test Plan: make check Reviewers: yhchiang, rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52815	2016-01-25 13:47:07 -08:00
Islam AbdelRahman	d9bca1e14c	Reduce iterator deletion overhead Summary: After introducing Iterator::PinData(), we have extra overhead of deleting the pinned iterators that we track in a std::set This patch avoid inserting to the std::set if we have only one iterator (normal use case when no iterators are pinned) Before this change ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="newiterator" --db="/tmp/rocksdbtest-8616/dbbench" --use_existing_db --disable_auto_compactions newiterator : 1.006 micros/op 994013 ops/sec; newiterator : 0.994 micros/op 1006295 ops/sec; newiterator : 0.990 micros/op 1010422 ops/sec; ``` After change ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="newiterator" --db="/tmp/rocksdbtest-8616/dbbench" --use_existing_db --disable_auto_compactions newiterator : 0.754 micros/op 1326588 ops/sec; newiterator : 0.759 micros/op 1317394 ops/sec; newiterator : 0.691 micros/op 1446704 ops/sec; ``` Test Plan: make check -j64 Reviewers: yhchiang, rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52761	2016-01-11 16:48:15 -08:00

1 2 3 4 5 ...

632 Commits