rocksdb

Author	SHA1	Message	Date
Wanning Jiang	78837f5d61	TableBuilder / TableReader support for range deletion Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test ) Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D61473	2016-08-19 15:10:31 -07:00
Philipp Unterbrunner	deda159b55	Added min/max/avg data block size output to sst_dump Summary: Added min/max/avg data block size output to sst_dump. Output was added to the end of BlockBasedTable::DumpDataBlocks, so it appears after the data block details, at the very end of the dump file. Test Plan: ``` ./db_bench --benchmarks=fillrandom ./sst_dump --file=/tmp/rocksdbtest-xyz/dbbench/000007.sst --command=raw tail -n 6 /tmp/rocksdbtest-xyz/dbbench/000007_dump.txt ``` ``` Data Block Summary: -------------------------------------- # data blocks: 11336 min data block size: 903 max data block size: 2268 avg data block size: 2245.363356 ``` Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61815	2016-08-12 16:34:11 -07:00
Islam AbdelRahman	b693ba68b5	Minor PinnedIteratorsManager Refactoring Summary: This diff include these simple change - Rename ReleasePinnedIterators to ReleasePinnedData - Rename PinIteratorIfNeeded to PinIterator - Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector> - Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer Test Plan: existing tests Reviewers: sdong, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61305	2016-08-11 11:54:17 -07:00
omegaga	d51dc96a79	Experiments on column-aware encodings Summary: Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format. There is still on-going work on this diff. More refactoring is necessary. Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added. Reviewers: sdong Reviewed By: sdong Subscribers: arahut, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60027	2016-08-01 14:50:19 -07:00
omegaga	e70020e4f6	Only cache level 0 indexes and filter when opening table reader Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1). Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?) Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59913	2016-07-20 11:23:31 -07:00
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	2016-07-20 09:49:03 -07:00
John Alexander	9430333f84	New Statistics to track Compression/Decompression (#1197 ) * Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer). * Prevent incrementing compression counter when compression is turned off in options. * Prevent incrementing compression counter when compression is turned off in options. * Added two more supported compression types to test code in db_test.cc * Prevent incrementing compression counter when compression is turned off in options. * Added new StatsLevel that excludes compression timing. * Fixed casting error in coding.h * Fixed CompressionStatsTest for new StatsLevel. * Removed unused variable that was breaking the Linux build	2016-07-19 09:44:03 -07:00
Jay Edgar	efd013d6d8	Miscellaneous performance improvements Summary: I was investigating performance issues in the SstFileWriter and found all of the following: - The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time. Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function. - In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key. - The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed. I added the move constructor which also required a copy constructor and assignment operator. - The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added. I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value. - The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function. - BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key). It is more efficient just to copy the whole new key over. - Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on. - FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time. - Each PutVarint??() function has a buffer and calls std::string::append(). Two or three calls in a row could share a buffer and a single call to std::string::append(). Some of these will be helpful outside of the SstFileWriter. I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility? I tried it on gcc 4.8 and 4.9. Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added. The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas. Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59607	2016-07-12 14:15:32 -07:00
Yi Wu	296545a2c7	Fix clang analyzer errors Summary: Fixing erros reported by clang static analyzer. * Removing some unused variables. * Adding assertions to fix false positives reported by clang analyzer. * Adding `__clang_analyzer__` macro to suppress false positive warnings. Test Plan: USE_CLANG=1 OPT=-g make analyze -j64 Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60549	2016-07-08 17:50:51 -07:00
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	2016-07-05 18:11:25 -07:00
Islam AbdelRahman	88a2776db5	Update SstFileWriter to use bottommost_compression if avaliable Summary: SstFileWriter ignore Options::bottommost_compression, update it to use bottommost_compression if available Test Plan: make check -j64 verified used compression using ./sst_dump Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59841	2016-06-20 11:26:25 -07:00
Islam AbdelRahman	8366e10ffc	Fix clang build Summary: Fix clang build Test Plan: USE_CLANG=1 make check -j64 Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59667	2016-06-15 00:24:33 -07:00
Islam AbdelRahman	812dbfb483	Optimize BlockIter::Prev() by caching decoded entries Summary: Right now the way we do BlockIter::Prev() is like this - Go to the beginning of the restart interval - Keep moving forward (and decoding keys using ParseNextKey()) until we reach the desired key This can be optimized by caching the decoded entries in the first pass and reusing them in consecutive BlockIter::Prev() calls Before caching ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.413 micros/op 2423972 ops/sec; 268.2 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.414 micros/op 2413867 ops/sec; 267.0 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.410 micros/op 2440881 ops/sec; 270.0 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.414 micros/op 2417298 ops/sec; 267.4 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.413 micros/op 2421682 ops/sec; 267.9 MB/s ``` After caching ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.324 micros/op 3088955 ops/sec; 341.7 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.335 micros/op 2980999 ops/sec; 329.8 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.341 micros/op 2929681 ops/sec; 324.1 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.344 micros/op 2908490 ops/sec; 321.8 MB/s DB path: [/dev/shm/bench_prev_opt/] readreverse : 0.338 micros/op 2958404 ops/sec; 327.3 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59463	2016-06-14 12:27:46 -07:00
Islam AbdelRahman	7c919deccc	Reuse TimedFullMerge instead of FullMerge + instrumentation Summary: We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59577	2016-06-13 16:17:26 -07:00
Nadav Rotem	7360db39e6	Add a check mode to verify compressed block can be decompressed back Summary: Try to decompress compressed blocks when a special flag is set. assert and crash in debug builds if we can't decompress the just-compressed input. Test Plan: Run unit-tests. Reviewers: dhruba, andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59145	2016-06-10 18:20:54 -07:00
sdong	5009b5326b	BlockBasedTable::FullFilterKeyMayMatch() Should skip prefix bloom if full key bloom exists Summary: Currently, if users define both of full key bloom and prefix bloom in SST files. During Get(), if full key bloom shows the key may exist, we still go ahead and check prefix bloom. This is wasteful. If bloom filter for full keys exists, we should always ignore prefix bloom in Get(). Test Plan: Run existing tests Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57825	2016-06-10 16:27:56 -07:00
Aaron Gao	e532877940	Add statistics field to show total size of index and filter blocks in block cache Summary: With `table_options.cache_index_and_filter_blocks = true`, index and filter blocks are stored in block cache. Then people are curious how much of the block cache total size is used by indexes and bloom filters. It will be nice we have a way to report that. It can help people tune performance and plan for optimized hardware setting. We add several enum values for db Statistics. BLOCK_CACHE_INDEX/FILTER_BYTES_INSERT - BLOCK_CACHE_INDEX/FILTER_BYTES_ERASE = current INDEX/FILTER total block size in bytes. Test Plan: write a test case called `DBBlockCacheTest.IndexAndFilterBlocksStats`. The result is: ``` [gzh@dev9927.prn1 ~/local/rocksdb] make db_block_cache_test -j64 && ./db_block_cache_test --gtest_filter=DBBlockCacheTest.IndexAndFilterBlocksStats Makefile:101: Warning: Compiling in debug mode. Don't use the resulting binary in production GEN util/build_version.cc make: `db_block_cache_test' is up to date. Note: Google Test filter = DBBlockCacheTest.IndexAndFilterBlocksStats [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBBlockCacheTest [ RUN ] DBBlockCacheTest.IndexAndFilterBlocksStats [ OK ] DBBlockCacheTest.IndexAndFilterBlocksStats (689 ms) [----------] 1 test from DBBlockCacheTest (689 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (689 ms total) [ PASSED ] 1 test. ``` Reviewers: IslamAbdelRahman, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58677	2016-06-03 10:47:47 -07:00
sdong	1d725ca51d	Deprecate BlockBasedTableOptions.hash_index_allow_collision=false. Summary: Deprecate this one option and delete code and tests that are now superfluous. Test Plan: all tests pass Reviewers: igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: msalib, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55317	2016-05-20 17:52:27 -07:00
Aaron Gao	43afd72bee	[rocksdb] make more options dynamic Summary: make more ColumnFamilyOptions dynamic: - compression - soft_pending_compaction_bytes_limit - hard_pending_compaction_bytes_limit - min_partial_merge_operands - report_bg_io_stats - paranoid_file_checks Test Plan: Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit. All passed. Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57519	2016-05-17 13:11:56 -07:00
krad	a08c8c851a	Added PersistentCache abstraction Summary: Added a new abstraction to cache page to RocksDB designed for the read cache use. RocksDB current block cache is more of an object cache. For the persistent read cache project, what we need is a page cache equivalent. This changes adds a cache abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache uncompressed pages or raw pages (content as in filesystem). The user can choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode. Blame Rev: Test Plan: Run unit tests Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55707	2016-05-15 22:17:18 -07:00
Ashish Shenoy	fa3536d202	Store SST file compression algorithm as a TableProperty Summary: Store SST file compression algorithm as a TableProperty. Test Plan: Modified and ran the table_test UT that checks for TableProperties Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: lgalanis, andrewkr, dhruba, IslamAbdelRahman Differential Revision: https://reviews.facebook.net/D58017	2016-05-12 09:47:16 -07:00
sdong	7ccb8d6ef3	BlockBasedTable::Get() not to use prefix bloom if read_options.total_order_seek = true Summary: This is to provide a way for users to skip prefix bloom in point look-up. Test Plan: Add a new unit test scenario. Reviewers: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57747	2016-05-06 10:16:11 -07:00
Dmitri Smirnov	e7899c6618	Fix build issue. (#1103 )	2016-04-28 11:39:12 -07:00
Li Peng	6d4832a998	Merge pull request #1101 from flyd1005/wip-fix-typo fix typos and remove duplicated words	2016-04-28 02:30:44 -07:00
Andrew Kryczka	843d2e3137	Shared dictionary compression using reference block Summary: This adds a new metablock containing a shared dictionary that is used to compress all data blocks in the SST file. The size of the shared dictionary is configurable in CompressionOptions and defaults to 0. It's currently only used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of the compression type if the user chooses a nonzero dictionary size. During compaction, computes the dictionary by randomly sampling the first output file in each subcompaction. It pre-computes the intervals to sample by assuming the output file will have the maximum allowable length. In case the file is smaller, some of the pre-computed sampling intervals can be beyond end-of-file, in which case we skip over those samples and the dictionary will be a bit smaller. After the dictionary is generated using the first file in a subcompaction, it is loaded into the compression library before writing each block in each subsequent file of that subcompaction. On the read path, gets the dictionary from the metablock, if it exists. Then, loads that dictionary into the compression library before reading each block. Test Plan: new unit test Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52287	2016-04-27 17:36:03 -07:00
Islam AbdelRahman	d719b095dc	Introduce PinnedIteratorsManager (Reduce PinData() overhead / Refactor PinData) Summary: While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times. This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation Test Plan: make check -j64 COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56493	2016-04-26 12:41:07 -07:00
Dhruba Borthakur	995353e46a	Fix null-pointer-dereference detected by Infer (https://github.com/facebook/infer ) Test Plan: make check Reviewers: leveldb, sdong Reviewed By: sdong Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57165	2016-04-25 20:09:36 +01:00
Islam AbdelRahman	5bd4022fec	Add comparator, merge operator, property collectors to SST file properties (again) Summary: This is the original diff that I have landed and reverted and now I want to land again https://reviews.facebook.net/D34269 For old SST files we will show ``` comparator name: N/A merge operator name: N/A property collectors names: N/A ``` For new SST files with no merge operator name and with no property collectors ``` comparator name: leveldb.BytewiseComparator merge operator name: nullptr property collectors names: [] ``` for new SST files with these properties ``` comparator name: leveldb.BytewiseComparator merge operator name: UInt64AddOperator property collectors names: [DummyPropertiesCollector1,DummyPropertiesCollector2] ``` Test Plan: unittests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56487	2016-04-21 10:16:28 -07:00
Dmitri Smirnov	ee221d2de0	Introduce XPRESS compresssion on Windows. (#1081 ) Comparable with Snappy on comp ratio. Implemented using Windows API, does not require external package. Avaiable since Windows 8 and server 2012. Use -DXPRESS=1 with CMake to enable.	2016-04-19 22:54:24 -07:00
Islam AbdelRahman	6356b4d516	Fix nullptr dereference in adaptive_table Summary: @dulmarod Ran infer on RocksDB and found that we dereference nullptr in adaptive_table https://fb.facebook.com/groups/rocksdb/permalink/1046374415411173/ Test Plan: make check -j64 Reviewers: sdong, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56973	2016-04-19 13:57:05 -07:00
sdong	535af525d6	BlockBasedTable::PrefixMayMatch() to skip index checking if we can't find a filter block. Summary: In the case where we can't find a filter block, there is not much benefit of doing the binary search and see whether the index key has the prefix. With the change, we blindly return true if we can't get the filter. It also fixes missing row cases for reverse comparator with full bloom. Test Plan: Add a test case that used to fail. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: kradhakrishnan, yiwu, hermanlee4, yoshinorim, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56697	2016-04-13 19:06:48 -07:00
sdong	dff4c48ede	BlockBasedTable::PrefixMayMatch: no need to find data block after full bloom checking Summary: Full block checking should be a good enough indication of prefix existance. No need to further check data block. This also fixes wrong results when using prefix bloom and reverse bitwise comparator. Test Plan: Will add a unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: hermanlee4, yoshinorim, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56625	2016-04-12 16:25:54 -07:00
sdong	30d72ee43c	PrefixTest.PrefixAndWholeKeyTest should run against a different directory from prefix_test Summary: PrefixTest.PrefixAndWholeKeyTest runs against the same directory as prefix_test, which sometimes fail parallel tests. Fix it. Test Plan: Run it in parallel and see it doesn't fail anymore. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56541	2016-04-11 13:02:56 -07:00
Andrew Kryczka	0e3cc2cf1f	Add column family info to TableProperties::ToString() Summary: This is used at least by `sst_dump --show_properties` Test Plan: - default CF ``` $ ./sst_dump --show_properties --file=./tmp-db/000007.sst \| grep 'column family' column family ID: 0 column family name: default ``` - custom CF ``` $ ./sst_dump --show_properties --file=./tmp-db/000012.sst \| grep 'column family' column family ID: 1 column family name: col-fam-1 ``` - no CF ``` $ ./sst_dump --show_properties --file=./tmp-db/000017.sst \| grep 'column family' column family ID: N/A column family name: N/A ``` Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56499	2016-04-08 18:50:18 -07:00
sdong	1518b733eb	Change default number of cache shard bit to be 6 and max_file_opening_threads to be 16. Summary: Cache shard bit 4 is sometimes too small and 6 is a more common value picked by users. Make that default. It shouldn't hurt much to change options.max_file_opening_threads default to be 16, which will reduce the worst case DB open time. Test Plan: Run all existing tests. Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, igor, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55047	2016-04-07 13:55:10 -07:00
Andrew Kryczka	2391ef7214	Embed column family name in SST file Summary: Added the column family name to the properties block. This property is omitted only if the property is unavailable, such as when RepairDB() writes SST files. In a next diff, I will change RepairDB to use this new property for deciding to which column family an existing SST file belongs. If this property is missing, it will add it to the "unknown" column family (same as its existing behavior). Test Plan: New unit test: $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55605	2016-04-06 23:10:32 -07:00
Marton Trencseni	9b51987521	Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. Test Plan: 'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK. I didn't run the Java tests, I don't have Java set up on my devserver. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56133	2016-04-01 10:42:39 -07:00
Laurent Demailly	21700a5106	to/from hex refactor Summary: Expose the inverse of ToString(hex=true) on Slice: Slice::DecodeHex Refactor the other implementation of to/from hex in ldb_cmd.h to use the Slice version (Difference between the 2 is whether 0x is expected/produced in front of the hex string or not) Eliminated support for invalid odd length hex string - this is now invalid instead of having 1/2 byte set Added (inverse of HexToString) test for LDBCommand::StringToHex which also indirectly tests Slice::ToString(true) After moving the original implementation from ldb_cmd.h, updated it to much simpler/efficient version (originally/inspired from https://github.com/facebook/wdt/blob/master/util/EncryptionUtils.cpp#L140-L169 ) Test Plan: run tests Reviewers: uddipta, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56121	2016-03-30 14:36:48 -07:00
sdong	b1fafcaca6	Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature." This reverts commit `522de4f59e`. It has bug of index block cleaning up.	2016-03-21 11:50:42 -07:00
Marton Trencseni	522de4f59e	Adding pin_l0_filter_and_index_blocks_in_cache feature. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction. Test Plan: Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false). DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32 Mac: OK. Linux: with D55287 patched in it's OK. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54801	2016-03-17 22:40:01 +00:00
Edouard A	02e62ebbc8	Fixes warnings and ensure correct int behavior on 32-bit platforms.	2016-03-16 22:57:57 +01:00
sdong	b2ae5950ba	Index Reader should not be reused after DB restart Summary: In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems. Fix it by making cache key identical per table reader. Test Plan: Add a new test which failed with out the commit but now pass. Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55287	2016-03-14 10:04:09 -07:00
Yi Wu	f71fc77b7c	Cache to have an option to fail Cache::Insert() when full Summary: Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error. I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed. Test Plan: make check -j32, see tests pass. Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54705	2016-03-10 17:35:19 -08:00
sdong	e79ad9e184	Add Iterator Property rocksdb.iterator.version_number Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number. Test Plan: Add two unit tests for it. Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54921	2016-03-02 16:23:59 -08:00
sdong	74b660702e	Rename iterator property "rocksdb.iterator.is.key.pinned" => "rocksdb.iterator.is-key-pinned" Summary: Rename iterator property to folow property naming convention. Test Plan: Run all existing tests. Reviewers: andrewkr, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54957	2016-03-01 13:47:12 -08:00
sdong	1f5954147b	Introduce Iterator::GetProperty() and replace Iterator::IsKeyPinned() Summary: Add Iterator::GetProperty(), a way for users to communicate with iterator, and turn Iterator::IsKeyPinned() with it. As a follow-up, I'll ask a property as the version number attached to the iterator Test Plan: Rerun existing tests and add a negative test case. Reviewers: yhchiang, andrewkr, kradhakrishnan, anthony, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54783	2016-02-29 14:01:31 -08:00
Yueh-Hsuan Chiang	291ae4c206	Revert "Revert "Fixed the bug when both whole_key_filtering and prefix_extractor are set."" Summary: This reverts commit `73c31377bb`, which mistakenly reverts `73c31377bb` that fixes a bug when both whole_key_filtering and prefix_extractor are set Test Plan: revert the patch Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52707	2016-02-22 16:33:26 -08:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang	4a8cbf4e31	Allows Get and MultiGet to read directly from SST files. Summary: Add kSstFileTier to ReadTier, which allows Get and MultiGet to read only directly from SST files and skip mem-tables. kSstFileTier = 0x2 // data in SST files. // Note that this ReadTier currently only supports // Get and MultiGet and does not support iterators. Test Plan: add new test in db_test. Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: igor, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53511	2016-02-09 11:20:22 -08:00
Islam AbdelRahman	8e6172bc57	Add BlockBasedTableOptions::index_block_restart_interval Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block Test Plan: unit tests Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: march, dhruba Differential Revision: https://reviews.facebook.net/D53721	2016-02-05 10:22:37 -08:00
Islam AbdelRahman	0c433cd1eb	Fix issue in Iterator::Seek when using Block based filter block with prefix_extractor Summary: Similar to D53385 we need to check InDomain before checking the filter block. Test Plan: unit tests Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53421	2016-01-26 14:47:42 -08:00
Islam AbdelRahman	b0afcdeeac	Fix bug in block based tables with full filter block and prefix_extractor Summary: Right now when we are creating a BlockBasedTable with fill filter block we add to the filter all the prefixes that are InDomain() based on the prefix_extractor the problem is that when we read a key from the file, we check the filter block for the prefix whether or not it's InDomain() Test Plan: unit tests Reviewers: yhchiang, rven, anthony, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53385	2016-01-26 11:07:08 -08:00
Islam AbdelRahman	2fbc59a348	Disallow SstFileWriter from creating empty sst files Summary: SstFileWriter may create an sst file with no entries Right now this will fail when being ingested using DB::AddFile() saying that the keys are corrupted Test Plan: make check Reviewers: yhchiang, rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52815	2016-01-25 13:47:07 -08:00
Islam AbdelRahman	d9bca1e14c	Reduce iterator deletion overhead Summary: After introducing Iterator::PinData(), we have extra overhead of deleting the pinned iterators that we track in a std::set This patch avoid inserting to the std::set if we have only one iterator (normal use case when no iterators are pinned) Before this change ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="newiterator" --db="/tmp/rocksdbtest-8616/dbbench" --use_existing_db --disable_auto_compactions newiterator : 1.006 micros/op 994013 ops/sec; newiterator : 0.994 micros/op 1006295 ops/sec; newiterator : 0.990 micros/op 1010422 ops/sec; ``` After change ``` DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="newiterator" --db="/tmp/rocksdbtest-8616/dbbench" --use_existing_db --disable_auto_compactions newiterator : 0.754 micros/op 1326588 ops/sec; newiterator : 0.759 micros/op 1317394 ops/sec; newiterator : 0.691 micros/op 1446704 ops/sec; ``` Test Plan: make check -j64 Reviewers: yhchiang, rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52761	2016-01-11 16:48:15 -08:00
sdong	df7e3b6229	Include <array> in table/plain_table_key_coding.h Summary: <array> is not included in table/plain_table_key_coding.h. It may be the cause of one CLANG build failure. Test Plan: Build it Reviewers: yhchiang, rven, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52725	2016-01-08 17:34:53 -08:00
sdong	9a8e3f73ed	plain table reader: non-mmap mode to keep two recent buffers Summary: In plain table reader's non-mmap mode, we only keep the most recent read buffer. However, for binary search, it is likely we come back to a location to read. To avoid one pread in such a case, we keep two read buffers. It should cover most of the cases. Test Plan: 1. run tests 2. check the optimization works through strace when running ./table_reader_bench -mmap_read=false --num_keys2=1 -num_keys1=5000 -table_factory=plain_table --iterator --through_db Reviewers: anthony, rven, kradhakrishnan, igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51171	2016-01-08 10:53:57 -08:00
Islam AbdelRahman	f3fb39814d	Fix BlockBasedTableTest.NoopTransformSeek failure Summary: table_test is failing because we are creating a temp InternalComparator 14:27:28 [ RUN ] BlockBasedTableTest.NoopTransformSeek 14:27:28 pure virtual method called 14:27:28 terminate called without an active exception 14:27:28 /bin/sh: line 7: 2346261 Aborted (core dumped) ./$t Test Plan: make table_test -j64 && ./table_test --gtest_filter="BlockBasedTableTest.NoopTransformSeek" Reviewers: igor, sdong, anthony, rven Reviewed By: rven Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52671	2016-01-07 09:48:29 -08:00
Yueh-Hsuan Chiang	73c31377bb	Revert "Fixed the bug when both whole_key_filtering and prefix_extractor are set." Summary: This patch reverts commit `57605d7ef3` as it will cause BlockBasedTableTest.NoopTransformSeek test crashes in some environment. Test Plan: revert the patch Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52623	2016-01-06 23:33:41 -08:00
Yueh-Hsuan Chiang	57605d7ef3	Fixed the bug when both whole_key_filtering and prefix_extractor are set. Summary: When both whole_key_filtering and prefix_extractor are set, RocksDB will mistakenly encode prefix + whole key into the database instead of simply whole key when BlockBasedTable is used. This patch fixes this bug. Test Plan: Add a test in table_test Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52233	2016-01-06 23:01:23 -08:00
Igor Canadi	ba83447363	Merge pull request #923 from petermattis/pmattis/prefix-may-match Fix index seeking in BlockTableReader::PrefixMayMatch.	2016-01-06 14:02:10 -08:00
Reid Horuff	da032495d3	Optimize GetLatestSequenceForKey Summary: DBImpl::GetLatestSequenceForKey() can do memcpy's to load a value that will never be used. This can be optimized by changing all the Get() functions called to optionally not fetch the value (and only fetch the sequencenumber). Test Plan: optimistic_transaction_test and transaction_test Reviewers: anthony Reviewed By: anthony Subscribers: leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D52227	2016-01-06 13:43:22 -08:00
Peter Mattis	260c29762a	Fix index seeking in BlockTableReader::PrefixMayMatch. PrefixMayMatch previously seeked in the prefix index using an internal key with a sequence number of 0. This would cause the prefix index seek to fall off the end if the last key in the index had a user-key greater than or equal to the key being looked for. Falling off the end of the index in turn results in PrefixMayMatch returning false if the index is in memory.	2016-01-06 16:39:56 -05:00
Jay Edgar	7699439b7c	Prevent the user from setting block_restart_interval to less than 1 Summary: If block_restart_interval gets set to less than 1 an assert will be triggered in BlockBuilder::BlockBuilder(). This prevents the user from doing this by silently setting any value less than 1 to 1. Test Plan: Added a test (in BlockBasedTableTest in table_test) that checks invalid values to make sure that they are reset to the expected values. The block_restart_interval value is checked along with block_size_deviation which also silently sets the value if it is outside a specific range. Reviewers: yoshinorim, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52509	2016-01-04 14:13:18 -08:00
Nathan Bronson	ac16663bd6	use -Werror=missing-field-initializers, to closer match MyRocks build Summary: myrocks seems to build rocksdb using -Wmissing-field-initializers (and treats warnings as errors). This diff adds that flag to the rocksdb build, and fixes the compilation failures that result. I have not checked for any other differences in the build flags for rocksdb build as part of myrocks. Test Plan: make check Reviewers: sdong, rven Reviewed By: rven Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52443	2015-12-30 14:56:18 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
Siying Dong	298ba27ae2	Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata Enable MS compiler warning c4244.	2015-12-23 22:59:42 -08:00
Andrew Kryczka	e089db40f9	Skip bottom-level filter block caching when hit-optimized Summary: When Get() or NewIterator() trigger file loads, skip caching the filter block if (1) optimize_filters_for_hits is set and (2) the file is on the bottommost level. Also skip checking filters under the same conditions, which means that for a preloaded file or a file that was trivially-moved to the bottom level, its filter block will eventually expire from the cache. - added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader - in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr - in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr Test Plan: updated unit test: $ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits will also run 'make check' Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang Reviewed By: yhchiang Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D51633	2015-12-23 10:15:07 -08:00
Reid Horuff	97ea8afaaf	compaction assertion triggering test fix for sequence zeroing assertion trip	2015-12-18 16:08:31 -08:00
Islam AbdelRahman	521da3abb3	Fix BlockBasedTableTest.BlockCacheLeak valgrind failure Summary: I added this line in my previous patch D48999 (which is incorrect) We should not release the iterator since releasing it will evict the blocks from cache Test Plan: Run the test under valgrind make check Reviewers: rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52161	2015-12-18 11:17:21 -08:00
Islam AbdelRahman	aececc209e	Introduce ReadOptions::pin_data (support zero copy for keys) Summary: This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed. Benchmark results (using https://phabricator.fb.com/P20083553) ``` // $ du -h /home/tec/local/normal.4K.Snappy/db10077 // 6.1G /home/tec/local/normal.4K.Snappy/db10077 // $ du -h /home/tec/local/zero.8K.LZ4/db10077 // 6.4G /home/tec/local/zero.8K.LZ4/db10077 // Benchmarks for shard db10077 // _build/opt/rocks/benchmark/rocks_copy_benchmark \ // --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \ // --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077" // First run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 1.73s 576.97m // BM_StringPiece 103.74% 1.67s 598.55m // ============================================================================ // Match rate : 1000000 / 1000000 // Second run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 611.99ms 1.63 // BM_StringPiece 203.76% 300.35ms 3.33 // ============================================================================ // Match rate : 1000000 / 1000000 ``` Test Plan: Unit tests Reviewers: sdong, igor, anthony, yhchiang, rven Reviewed By: rven Subscribers: dhruba, lovro, adsharma Differential Revision: https://reviews.facebook.net/D48999	2015-12-16 12:08:30 -08:00
Dmitri Smirnov	236fe21c92	Enable MS compiler warning c4244. Mostly due to the fact that there are differences in sizes of int,long on 64 bit systems vs GNU.	2015-12-11 16:47:34 -08:00
agiardullo	3bfd3d39a3	Use SST files for Transaction conflict detection Summary: Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings. With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged). Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files. Test Plan: unit tests, db bench Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb, yoshinorim Differential Revision: https://reviews.facebook.net/D50475	2015-12-11 12:34:11 -08:00
sdong	9d0b8f19d9	plain table reader: avoid re-read the same position for index and data in non-mmap mode Summary: In non-mmap mode, plain table reader can issue two pread() for index checking and reading the actual data, although it's for the same location. By reusing the key decoder, we reuse the buffer used for the two to avoid it. Test Plan: Run unit tests. Run table_reader_bench and see from strace the repeat read cases to disappear. Reviewers: anthony, yhchiang, rven, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D50949	2015-11-18 16:49:08 -08:00
sdong	6170fec251	Fix build broken by previous commit of "option helper refactor" Summary: The commit of option helper refactor broken the build: (1) a git merge problem (2) some uncaught compiler warning Fix it. Test Plan: Make sure "make all" passes Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D50943	2015-11-17 16:52:54 -08:00
Siying Dong	3a6643c2fd	Merge pull request #805 from SherlockNoMad/OptionHelperFix Option Helper Refactoring	2015-11-17 16:24:52 -08:00
Islam AbdelRahman	a163cc2d5a	Lint everything Summary: ``` arc2 lint --everything ``` run the linter on the whole code repo to fix exisitng lint issues Test Plan: make check -j64 Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50769	2015-11-16 12:56:21 -08:00
Yueh-Hsuan Chiang	e11f676e34	Add OptionsUtil::LoadOptionsFromFile() API Summary: This patch adds OptionsUtil::LoadOptionsFromFile() and OptionsUtil::LoadLatestOptionsFromDB(), which allow developers to construct DBOptions and ColumnFamilyOptions from a RocksDB options file. Note that most pointer-typed options such as merge_operator will not be constructed. With this API, developers no longer need to remember all the options in order to reopen an existing rocksdb instance like the following: DBOptions db_options; std::vector<std::string> cf_names; std::vector<ColumnFamilyOptions> cf_opts; // Load primitive-typed options from an existing DB OptionsUtil::LoadLatestOptionsFromDB( dbname, &db_options, &cf_names, &cf_opts); // Initialize necessary pointer-typed options cf_opts[0].merge_operator.reset(new MyMergeOperator()); ... // Construct the vector of ColumnFamilyDescriptor std::vector<ColumnFamilyDescriptor> cf_descs; for (size_t i = 0; i < cf_opts.size(); ++i) { cf_descs.emplace_back(cf_names[i], cf_opts[i]); } // Open the DB DB* db = nullptr; std::vector<ColumnFamilyHandle*> cf_handles; auto s = DB::Open(db_options, dbname, cf_descs, &handles, &db); Test Plan: Augment existing tests in column_family_test options_test db_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49095	2015-11-12 06:52:43 -08:00
Islam AbdelRahman	838676c17b	Revert "Adding new table properties" Summary: Reverting https://reviews.facebook.net/D34269 for now after I landed it a flaky test started continuously failing, I am almost sure this patch is not related to the test but I will revert it until I figure out why it's failing Test Plan: make check Reviewers: kradhakrishnan Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50385	2015-11-06 16:49:38 -08:00
Islam AbdelRahman	5b9ce1a323	Merge pull request #820 from yuslepukhin/enable_compiler_warnings Enable Windows warnings C4307 C4309 C4512 C4701	2015-11-06 12:08:25 -08:00
Dmitri Smirnov	20f57b1715	Enable Windows warnings C4307 C4309 C4512 C4701 Enable C4307 'operator' : integral constant overflow Longs and ints on Windows are 32-bit hence the overflow Enable C4309 'conversion' : truncation of constant value Enable C4512 'class' : assignment operator could not be generated Enable C4701 Potentially uninitialized local variable 'name' used	2015-11-06 11:34:06 -08:00
Islam AbdelRahman	8be568a9c2	Adding new table properties Summary: This diff introduce new table properties that will be written for block based tables These properties are - comparator name - merge operator name - property collectors names Test Plan: - Added a new unit test to verify that these tests are written/read correctly - Running all other tests right now (wont land until all tests finish) Reviewers: rven, kradhakrishnan, igor, sdong, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34269	2015-11-06 11:19:01 -08:00
Islam AbdelRahman	f31442fb5c	Merge pull request #803 from SherlockNoMad/SkipFlush Add Option to Skip Flushing in TableBuilder	2015-11-02 14:56:11 -08:00
SherlockNoMad	df7ed91ef9	Fix white space at end of line	2015-11-02 14:12:29 -08:00
SherlockNoMad	ccc8c10c0c	Move skip_table_builder_flush to BlockBasedTableOption	2015-10-30 18:33:01 -07:00
SherlockNoMad	84992d6475	Option Helper Refactoring	2015-10-30 15:58:46 -07:00
SherlockNoMad	550af4ee68	Fix Travis Build Error	2015-10-29 22:41:57 -07:00
SherlockNoMad	a6dd0831d5	Add Option to Skip Flushing in TableBuilder	2015-10-29 22:10:25 -07:00
SherlockNoMad	b69b9b624e	Support PlainTableOption in option_helper	2015-10-28 23:01:33 -07:00
Dmitri Smirnov	5c8f2ee786	Fix MockTable ID storage On Windows two tests fail that use MockTable: flush_job_test and compaction_job_test with the following message: compaction_job_test_je.exe : Assertion failed: result.size() == 4, file c:\dev\rocksdb\rocksdb\table\mock_table.cc, line 110 Investigation reveals that this failure occurs when a 4 byte ID written to a beginning of the physically open file (main contents remains in a in-memory map) can not be read back. The reason for the failure is that the ID is written directly to a WritableFile bypassing WritableFileWriter. The side effect of that is that pending_sync_ never becomes true so the file is never flushed, however, the direct cause of the failure is that the filesize_ member of the WritableFileWriter remains zero. At Close() the file is truncated to that size and the file becomes empty so the ID can not be read back.	2015-10-28 10:53:14 -07:00
Venkatesh Radhakrishnan	a98fbacfa0	Moving memtable related files from util to a new directory memtable Summary: We are cleaning up dependencies. This diff takes a first step at moving memtable files to their own directory called memtable. In future diffs, we will move other memtable files from db to memtable. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48915	2015-10-16 14:10:33 -07:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Islam AbdelRahman	1fe78a4073	Fix tests failing in ROCKSDB_LITE Summary: Fix tests that compile under ROCKSDB_LITE but currently failing. table_test: RandomizedLongDB test is using internal stats which is not supported in ROCKSDB_LITE compaction_job_test: Using CompactionJobStats which is not supported perf_context_test: KeyComparisonCount test try to open DB in ReadOnly mode which is not supported Test Plan: run the tests under ROCKSDB_LITE Reviewers: yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48585	2015-10-13 10:32:05 -07:00
Alexey Maykov	3d07b815f6	Passing table properties to compaction callback Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online. Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats. Reviewers: igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48267	2015-10-09 18:10:55 -07:00
sdong	776bd8d5eb	Pass column family ID to table property collector Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently. Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct. Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48411	2015-10-09 14:36:51 -07:00
Islam AbdelRahman	51fa7ecec5	Bytes read/written from cache statistics Summary: Add 2 new counters BLOCK_CACHE_BYTES_WRITE, BLOCK_CACHE_BYTES_READ to keep track of how many bytes were written to the cache and how many bytes that we read from cache Test Plan: make check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48195	2015-10-07 15:17:20 -07:00
dyniusz	a065cdb388	bloom hit/miss stats for SST and memtable Summary: hit and miss bloom filter stats for memtable and SST stats added to perf_context struct key matches and prefix matches combined into one stat Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47859	2015-10-07 11:23:20 -07:00
Igor Canadi	d80ce7f99a	Compaction filter on merge operands Summary: Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit. The behavior in this diff is a bit simpler than in D42555: * only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored * compaction filter is not called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff) Do we also need a compaction filter to get called on merge results? Test Plan: make && make check Reviewers: lovro, tnovak, rven, yhchiang, sdong Reviewed By: sdong Subscribers: noetzli, kolmike, leveldb, dhruba, sdong Differential Revision: https://reviews.facebook.net/D47847	2015-10-07 09:30:03 -07:00
sdong	a70d08ec07	Fix the bug of using freed memory introduced by recent plain table reader patch Summary: Recent patch introduced a bug that if non-mmap mode is used, in prefix encoding case, there is a resizing of cur_key_ within the same prefix, we still read prefix from the released buffer. It fails ASAN tests and this commit fixes it. Test Plan: Run the ASAN tests for the failing test case. Reviewers: IslamAbdelRahman, yhchiang, anthony, igor, kradhakrishnan, rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47457	2015-09-23 16:16:26 -07:00
Islam AbdelRahman	f03b5c987b	Add experimental DB::AddFile() to plug sst files into empty DB Summary: This is an initial version of bulk load feature This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are (1) Memtables are empty (2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0 (3) Added sst files values are not overlapping Test Plan: unit testing Reviewers: igor, ott, sdong Reviewed By: sdong Subscribers: leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D39081	2015-09-23 12:42:43 -07:00
sdong	df34aea331	PlainTableReader to support non-mmap mode Summary: PlainTableReader now only allows mmap-mode. Add the support to non-mmap mode for more flexibility. Refactor the codes to move all logic of reading data to PlainTableKeyDecoder, and consolidate the calls to Read() call and ReadVarint32() call. Implement the calls for both of mmap and non-mmap case seperately. For non-mmap mode, make copy of keys in several places when we need to move the buffer after reading the keys. Test Plan: Add the mode of non-mmap case in plain_table_db_test. Run it in valgrind mode too. Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47187	2015-09-23 11:41:07 -07:00
sdong	d746eaad5e	RandomAccessFileReader should not inherit RandomAccessFile Summary: RandomAccessFileReader unnecessarily inherited RandomAccessFile, which can introduce unnecessarily extra costs. Remove it. Test Plan: Run all existing tests Reviewers: yhchiang, anthony, igor, kradhakrishnan, rven, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47409	2015-09-23 11:00:41 -07:00
Andres Noetzli	014fd55adc	Support for SingleDelete() Summary: This patch fixes #7460559. It introduces SingleDelete as a new database operation. This operation can be used to delete keys that were never overwritten (no put following another put of the same key). If an overwritten key is single deleted the behavior is undefined. Single deletion of a non-existent key has no effect but multiple consecutive single deletions are not allowed (see limitations). In contrast to the conventional Delete() operation, the deletion entry is removed along with the value when the two are lined up in a compaction. Note: The semantics are similar to @igor's prototype that allowed to have this behavior on the granularity of a column family ( https://reviews.facebook.net/D42093 ). This new patch, however, is more aggressive when it comes to removing tombstones: It removes the SingleDelete together with the value whenever there is no snapshot between them while the older patch only did this when the sequence number of the deletion was older than the earliest snapshot. Most of the complex additions are in the Compaction Iterator, all other changes should be relatively straightforward. The patch also includes basic support for single deletions in db_stress and db_bench. Limitations: - Not compatible with cuckoo hash tables - Single deletions cannot be used in combination with merges and normal deletions on the same key (other keys are not affected by this) - Consecutive single deletions are currently not allowed (and older version of this patch supported this so it could be resurrected if needed) Test Plan: make all check Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor Reviewed By: igor Subscribers: maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43179	2015-09-17 11:42:56 -07:00
Igor Canadi	81a61d75dc	Skipped tests shouldn't be failures [part 2] Summary: Missed one file in the previous commit Test Plan: compiles Reviewers: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47055	2015-09-15 22:59:53 -07:00
Islam AbdelRahman	45e9e4f0bb	Refactor NewTableReader to accept TableReaderOptions Summary: Refactoring NewTableReader to accept TableReaderOptions This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071 Test Plan: run existing tests Reviewers: igor, yhchiang, anthony, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D46179	2015-09-11 11:36:33 -07:00
Andres Noetzli	6bdc484fd8	Added Equal method to Comparator interface Summary: In some cases, equality comparisons can be done more efficiently than three-way comparisons. There are quite a few places in the code where we only care about equality. This patch adds an Equal() method that defaults to using the Compare() method. Test Plan: make clean all check Reviewers: rven, anthony, yhchiang, igor, sdong Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46233	2015-09-08 15:30:49 -07:00
Igor Canadi	14456aea52	Fix compile Summary: There was a merge conflict with https://reviews.facebook.net/D45993 Test Plan: make check Reviewers: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46065	2015-09-02 16:05:53 -07:00
Igor Canadi	76f286cc82	Optimize bloom filter cache misses Summary: This optimizes the case when (cache_index_and_filter_blocks=1) and bloom filter is not present in the cache. Previously we did: 1. Read meta block from file 2. Read the filter position from the meta block 3. Read the filter Now, we pre-load the filter position on Table::Open(), so we can skip steps (1) and (2) on bloom filter cache miss. Instead of 2 IOs, we do only 1. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46047	2015-09-02 15:36:47 -07:00
Andres Noetzli	3c9cef1eed	Unified maps with Comparator for sorting, other cleanup Summary: This diff is a collection of cleanups that were initially part of D43179. Additionally it adds a unified way of defining key-value maps that use a Comparator for sorting (this was previously implemented in four different places). Test Plan: make clean check all Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45993	2015-09-02 13:58:22 -07:00
sdong	7a0dbdf3ac	Add ZSTD (not final format) compression type Summary: Add ZSTD compression type. The same way as adding LZ4. Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests. Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor Reviewed By: igor Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45747	2015-08-28 11:01:13 -07:00
Yueh-Hsuan Chiang	6996de87af	Expose per-level aggregated table properties via GetProperty() Summary: This patch adds "rocksdb.aggregated-table-properties" and "rocksdb.aggregated-table-properties-at-levelN", the former returns the aggreated table properties of a column family, while the later returns the aggregated table properties of the specified level N. Test Plan: Added tests in db_test Reviewers: igor, sdong, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45087	2015-08-25 12:03:54 -07:00
sdong	c852968465	db_iter_test: add more test cases for the data race bug Summary: Add more test cases of data race causing wrong iterating results. Tag tests not passing as DISABLED_ Test Plan: Run the tests Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: tnovak, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44907	2015-08-21 12:14:12 -07:00
sdong	4637207120	Add test case to repro the mispositional iterator in a low-chance data race case Summary: Iterator has a bug: if a child iterator reaches its end, and user issues a Prev(), and just before SeekToLast() of the child iterator is called, some extra rows is added in the end, the position of iterator can be misplaced. Test Plan: Run the tests with or without valgrind Reviewers: rven, yhchiang, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: tnovak, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43671	2015-08-12 10:50:52 -07:00
Andres Notzli	68f934355a	Better CompactionJob testing Summary: Changed compaction_job_test to support better/more thorough tests and added two tests. Also changed MockFileContents to order using InternalKeyComparator. Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check Reviewers: sdong, rven, igor, yhchiang, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42837	2015-08-07 21:59:51 -07:00
Andres Notzli	c465071029	Removing duplicate code Summary: While working on https://reviews.facebook.net/D43179 , I found duplicate code in the tests. This patch removes it. Test Plan: make clean all check Reviewers: igor, sdong, rven, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43263	2015-08-05 07:33:27 -07:00
agiardullo	064294081b	Improved FileExists API Summary: Add new CheckFileExists method. Considered changing the FileExists api but didn't want to break anyone's builds. Test Plan: unit tests Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42003	2015-07-20 17:20:40 -07:00
Islam AbdelRahman	4853e228ef	Make table_test runnable in ROCKSDB_LITE Summary: Remove plain table tests from table_test since plain table is not supported in ROCKSDB_LITE Test Plan: table_test Reviewers: igor, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D42153	2015-07-20 11:09:14 -07:00
Islam AbdelRahman	cf6a7bebc8	Block cuckoo table tests in ROCKSDB_LITE Summary: Cuckoo table is not supported in ROCKSDB_LITE, blocking it's tests Test Plan: cuckoo_table_builder_test cuckoo_table_db_test cuckoo_table_reader_test Reviewers: sdong, igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D42141	2015-07-20 10:50:46 -07:00
sdong	6e9fbeb27c	Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future. Test Plan: Run all existing unit tests. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42321	2015-07-17 16:58:18 -07:00
Andres Notzli	1d20fa9d0f	Fixed and simplified merge_helper Summary: MergeUntil was not reporting a success when merging an operand with a Value/Deletion despite the comments in MergeHelper and CompactionJob indicating otherwise. This lead to operands being written to the compaction output unnecessarily: M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M2 M3 M4 M5 (before the diff) M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M4 M5 (after the diff) In addition, the code handling Values/Deletion was basically identical. This patch unifies the code. Finally, this patch also adds testing for merge_helper. Test Plan: make && make check Reviewers: sdong, rven, yhchiang, tnovak, igor Reviewed By: igor Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42351	2015-07-17 09:27:24 -07:00
Dmitri Smirnov	d1a457181d	Ensure Windows build w/o port/port.h in public headers - Remove make file defines from public headers and use _WIN32 because it is compiler defined - use __GNUC__ and __clang__ to guard non-portable attributes - add #include "port/port.h" to some new .cc files. - minor changes in CMakeLists to reflect recent changes	2015-07-16 12:10:16 -07:00
lovro	e1c99e10c1	Replace std::priority_queue in MergingIterator with custom heap, take 2 Summary: Repeat of `b6655a679d` (reverted in `b7a2369fb2`) with a proper fix for the issue that `57d216ea65` was trying to fix. Test Plan: make check for i in $(seq 100); do ./db_stress --test_batches_snapshots=1 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=20 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/rocksdb_crashtest_KdCI5F --max_key=100000000 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --progress_reports=0 --disable_wal=0 --disable_data_sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0 --memtablerep=prefix_hash --prefix_size=7 --ops_per_thread=200 \|\| break; done Reviewers: anthony, sdong, igor, yhchiang Reviewed By: igor, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D41391	2015-07-15 03:34:40 -07:00
sdong	f9728640f3	"make format" against last 10 commits Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more. Test Plan: Build it. Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41961	2015-07-13 13:50:18 -07:00
sdong	76d3cd3286	Fix public API dependency on internal codes and dependency on MAX_INT32 Summary: Public API depends on port/port.h which is wrong. Fix it. Also with gcc 4.8.1 build was broken as MAX_INT32 was not recognized. Fix it by using ::max in linux. Test Plan: Build it and try to build an external project on top of it. Reviewers: anthony, yhchiang, kradhakrishnan, igor Reviewed By: igor Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41745	2015-07-11 10:32:11 -07:00
Siying Dong	e41cbd9c2f	Merge pull request #646 from yuslepukhin/ms_win_port Windows Port from Microsoft	2015-07-10 15:53:39 -07:00
sdong	041b6f95a2	perf_context: report time spent on reading index and bloom blocks Summary: Add a perf context counter to help users figure out time spent on reading indexes and bloom filter blocks. Test Plan: Will write a unit test Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41433	2015-07-10 14:45:42 -07:00
Dmitri Smirnov	c903ccc4c2	Merge from github/master	2015-07-09 18:01:08 -07:00
unknown	5c79132335	Revert the changes related to Options, as requested to seperate them into a different patch.	2015-07-09 11:31:42 -07:00
Dmitri Smirnov	ef4b87f1b2	Commit both PR and internal code review changes	2015-07-07 16:58:20 -07:00
Yueh-Hsuan Chiang	b7a2369fb2	Revert "Replace std::priority_queue in MergingIterator with custom heap" Summary: This patch reverts "Replace std::priority_queue in MergingIterator with custom heap" (commit commit `b6655a679d`) as it causes db_stress failure. Test Plan: ./db_stress --test_batches_snapshots=1 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=20 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/rocksdb_crashtest_KdCI5F --max_key=100000000 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --progress_reports=0 --disable_wal=0 --disable_data_sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0 --memtablerep=prefix_hash --prefix_size=7 --ops_per_thread=200 --kill_random_test=97 Reviewers: igor, anthony, lovro, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41343	2015-07-07 14:45:20 -07:00
Yueh-Hsuan Chiang	57d216ea65	Remove assert(current_ == CurrentReverse()) in MergingIterator::Prev() Summary: Remove assert(current_ == CurrentReverse()) in MergingIterator::Prev() because it is possible to have some keys larger than the seek-key inserted between Seek() and SeekToLast(), which makes current_ not equal to CurrentReverse(). Test Plan: db_stress Reviewers: igor, sdong, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41331	2015-07-07 12:45:06 -07:00
Yueh-Hsuan Chiang	685582a0b4	Revert two diffs related to DBIter::FindPrevUserKey() Summary: This diff reverts the following two previous diffs related to DBIter::FindPrevUserKey(), which makes db_stress unstable. We should bake a better fix for this. * "Fix a comparison in DBIter::FindPrevUserKey()" `ec70fea4c4`. * "Fixed endless loop in DBIter::FindPrevUserKey()" `acee2b08a2`. Test Plan: db_stress Reviewers: anthony, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41301	2015-07-07 11:36:24 -07:00
lovro	b6655a679d	Replace std::priority_queue in MergingIterator with custom heap Summary: While profiling compaction in our service I noticed a lot of CPU (~15% of compaction) being spent in MergingIterator and key comparison. Looking at the code I found MergingIterator was (understandably) using std::priority_queue for the multiway merge. Keys in our dataset include sequence numbers that increase with time. Adjacent keys in an L0 file are very likely to be adjacent in the full database. Consequently, compaction will often pick a chunk of rows from the same L0 file before switching to another one. It would be great to avoid the O(log K) operation per row while compacting. This diff replaces std::priority_queue with a custom binary heap implementation. It has a "replace top" operation that is cheap when the new top is the same as the old one (i.e. the priority of the top entry is decreased but it still stays on top). Test Plan: make check To test the effect on performance, I generated databases with data patterns that mimic what I describe in the summary (rows have a mostly increasing sequence number). I see a 10-15% CPU decrease for compaction (and a matching throughput improvement on tmpfs). The exact improvement depends on the number of L0 files and the amount of locality. Performance on randomly distributed keys seems on par with the old code. Reviewers: kailiu, sdong, igor Reviewed By: igor Subscribers: yoshinorim, dhruba, tnovak Differential Revision: https://reviews.facebook.net/D29133	2015-07-06 04:24:09 -07:00
Yueh-Hsuan Chiang	acee2b08a2	Fixed endless loop in DBIter::FindPrevUserKey() Summary: Fixed endless loop in DBIter::FindPrevUserKey() Test Plan: ./db_stress --test_batches_snapshots=1 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=20 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/rocksdb_crashtest_KdCI5F --max_key=100000000 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --progress_reports=0 --disable_wal=0 --disable_data_sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0 --memtablerep=prefix_hash --prefix_size=7 --ops_per_thread=200 --kill_random_test=97 Reviewers: tnovak, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41085	2015-07-02 16:10:31 -07:00
Dmitri Smirnov	9dbde7277c	Merge remote-tracking branch 'origin' into ms_win_port	2015-07-02 11:34:22 -07:00
Dmitri Smirnov	18285c1e2f	Windows Port from Microsoft Summary: Make RocksDb build and run on Windows to be functionally complete and performant. All existing test cases run with no regressions. Performance numbers are in the pull-request. Test plan: make all of the existing unit tests pass, obtain perf numbers. Co-authored-by: Praveen Rao praveensinghrao@outlook.com Co-authored-by: Sherlock Huang baihan.huang@gmail.com Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com	2015-07-01 16:13:56 -07:00
sdong	05e2831966	Allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free. Test Plan: valgrind check Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40929	2015-06-30 17:30:38 -07:00
Igor Canadi	0a019d74a0	Use malloc_usable_size() for accounting block cache size Summary: Currently, when we insert something into block cache, we say that the block cache capacity decreased by the size of the block. However, size of the block might be less than the actual memory used by this object. For example, 4.5KB block will actually use 8KB of memory. So even if we configure block cache to 10GB, our actually memory usage of block cache will be 20GB! This problem showed up a lot in testing and just recently also showed up in MongoRocks production where we were using 30GB more memory than expected. This diff will fix the problem. Instead of counting the block size, we will count memory used by the block. That way, a block cache configured to be 10GB will actually use only 10GB of memory. I'm using non-portable function and I couldn't find info on portability on Google. However, it seems to work on Linux, which will cover majority of our use-cases. Test Plan: 1. fill up mongo instance with 80GB of data 2. restart mongo with block cache size configured to 10GB 3. do a table scan in mongo 4. memory usage before the diff: 12GB. memory usage after the diff: 10.5GB Reviewers: sdong, MarkCallaghan, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40635	2015-06-26 11:48:09 -07:00
Giuseppe Ottaviano	782a1590f9	Implement a table-level row cache Summary: Implementation of a table-level row cache. It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache. Supports snapshots and merge operations. Test Plan: Ran `make valgrind_check commit-prereq` Reviewers: igor, philipp, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39849	2015-06-23 10:25:45 -07:00
sdong	6df589b446	Add TablePropertiesCollector::NeedCompact() to suggest DB to further compact output files Summary: It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file. It can be used to allow users to suggest DB to clear up delete tombstones faster. Test Plan: Add a unit test. Reviewers: igor, yhchiang, kradhakrishnan, rven Reviewed By: rven Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39585	2015-06-05 20:18:21 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
Igor Canadi	dbd95b7532	Add more table properties to EventLogger Summary: Example output: {"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}} Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter. Test Plan: make check + check out the output in the log Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38343	2015-05-12 15:53:55 -07:00
clark.kang	6ede020dc4	fix typos	2015-04-25 18:14:27 +09:00
sdong	98a44559d5	Build for CYGWIN Summary: Make it build for CYGWIN. Need to define "-std=gnu++11" instead of "-std=c++11" and use some replacement functions. Test Plan: Build it and run some unit tests in CYGWIN Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37605	2015-04-23 21:33:44 -07:00
sdong	fcb206b667	SyncPoint to allow a callback with an argument and use it to get DBTest.DynamicLevelCompressionPerLevel2 more straight-forward Summary: Allow users to give a callback function with parameter using sync point, so more complicated verification can be done in tests. Use it in DBTest.DynamicLevelCompressionPerLevel2 so that failures will be more easy to debug. Test Plan: Run all tests. Run DBTest.DynamicLevelCompressionPerLevel2 with valgrind check. Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36999	2015-04-14 16:18:50 -07:00
Igor Canadi	5e067a7b19	Clean up compression logging Summary: Now we add warnings when user configures compression and the compression is not supported. Test Plan: Configured compression to non-supported values. Observed messages in my log: 2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2. 2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data. Reviewers: rven, sdong, yhchiang Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35979	2015-04-06 12:50:44 -07:00
sdong	a45e7581b7	Avoid naming conflict of EntryType Summary: Fix build break on travis build: $ OPT=-DTRAVIS V=1 make unity && make clean && OPT=-DTRAVIS V=1 make db_test && ./db_test ...... In file included from unity.cc:65:0: ./table/plain_table_key_coding.cc: In member function ‘rocksdb::Status rocksdb::PlainTableKeyDecoder::NextPrefixEncodingKey(const char, const char, rocksdb::ParsedInternalKey, rocksdb::Slice, size_t, bool)’: ./table/plain_table_key_coding.cc:224:3: error: reference to ‘EntryType’ is ambiguous EntryType entry_type; ^ In file included from ./db/table_properties_collector.h:9:0, from ./db/builder.h:11, from ./db/builder.cc:10, from unity.cc:1: ./include/rocksdb/table_properties.h:81:6: note: candidates are: enum rocksdb::EntryType enum EntryType { ^ In file included from unity.cc:65:0: ./table/plain_table_key_coding.cc:16:6: note: enum rocksdb::{anonymous}::EntryType enum EntryType : unsigned char { ^ ./table/plain_table_key_coding.cc:231:51: error: ‘entry_type’ was not declared in this scope const char* pos = DecodeSize(key_ptr, limit, &entry_type, &size); ^ make: *** [unity.o] Error 1 Test Plan: OPT=-DTRAVIS V=1 make unity And make sure it doesn't break anymore. Reviewers: yhchiang, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36549	2015-04-06 11:49:13 -07:00
sdong	953a885ebf	A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge Summary: Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it. Also refactor the codes so that (1) make table property collector and internal table property collector two separate data structures with the later one now exposed (2) table builders only receive internal table properties Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector. Reviewers: yhchiang, igor.sugak, rven, igor Reviewed By: rven, igor Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D35373	2015-04-06 10:27:21 -07:00
Anurag Indu	1e57f2bf2b	Fix build Test Plan: Running make all Reviewers: sdong Reviewed By: sdong Subscribers: rven, yhchiang, igor, meyering, dhruba Differential Revision: https://reviews.facebook.net/D35889	2015-03-24 17:00:28 -07:00
Anurag Indu	211ca26aee	Fixing build issue Summary: Fixing issues with get context function. Test Plan: Run make commit-prereq Reviewers: sdong, meyering, yhchiang Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D35853	2015-03-24 16:27:24 -07:00
Anurag Indu	3d1a924ff3	Adding stats for the merge and filter operation Summary: We have addded new stats and perf_context for measuring the merge and filter operation time consumption. We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB. Test Plan: WIP Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34377	2015-03-24 14:42:04 -07:00

1 2 3 4 5 ...

636 Commits