rocksdb

Author	SHA1	Message	Date
anand76	6c7b1a0cc7	Batched MultiGet API for multiple column families (#5816 ) Summary: Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice. As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816 Test Plan: make check make asan_check asan_crash_test Differential Revision: D18408676 Pulled By: anand1976 fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d	2019-11-12 13:52:55 -08:00
sdong	e8263dbdaa	Apply formatter to recent 200+ commits. (#5830 ) Summary: Further apply formatter to more recent commits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830 Test Plan: Run all existing tests. Differential Revision: D17488031 fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab	2019-09-20 12:04:26 -07:00
Yanqin Jin	a9c5e8e944	Refactor deletefile_test.cc (#5822 ) Summary: Make DeleteFileTest inherit DBTestBase to avoid code duplication. Test Plan (on devserver) ``` $make deletefile_test $./deletefile_test ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5822 Differential Revision: D17456750 Pulled By: riversand963 fbshipit-source-id: 224e97967da7b98838a98981cd5095d3230a814f	2019-09-18 16:58:21 -07:00
Zhongyi Xie	de3fb9a6ff	exclude TEST_ENV_URI from rocksdb lite (#5686 ) Summary: PR https://github.com/facebook/rocksdb/pull/5676 added some test coverage for `TEST_ENV_URI`, which unfortunately isn't supported in lite mode, causing some test failures for rocksdb lite. For example, ``` db/db_test_util.cc: In constructor ‘rocksdb::DBTestBase::DBTestBase(std::__cxx11::string)’: db/db_test_util.cc:57:16: error: ‘ObjectRegistry’ has not been declared Status s = ObjectRegistry::NewInstance()->NewSharedObject(test_env_uri, ^ ``` This PR fixes these errors by excluding the new code from test functions for lite mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5686 Differential Revision: D16749000 Pulled By: miasantreble fbshipit-source-id: e8b3088c31a78b3dffc5fe7814261909d2c3e369	2019-08-10 19:15:05 -07:00
Yanqin Jin	5d9a67e718	Support loading custom objects in unit tests (#5676 ) Summary: Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc. This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning. By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop. Test plan (on devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 all $make check ``` All unit tests must pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676 Differential Revision: D16679157 Pulled By: riversand963 fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d	2019-08-09 15:12:08 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
anand76	fefd4b98c5	Introduce a new MultiGet batching implementation (#5011 ) Summary: This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching. Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to - 1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch() 2. Bloom filter cachelines can be prefetched, hiding the cache miss latency The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress. Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32). Batch Sizes 1 \| 2 \| 4 \| 8 \| 16 \| 32 Random pattern (Stride length 0) 4.158 \| 4.109 \| 4.026 \| 4.05 \| 4.1 \| 4.074 - Get 4.438 \| 4.302 \| 4.165 \| 4.122 \| 4.096 \| 4.075 - MultiGet (no batching) 4.461 \| 4.256 \| 4.277 \| 4.11 \| 4.182 \| 4.14 - MultiGet (w/ batching) Good locality (Stride length 16) 4.048 \| 3.659 \| 3.248 \| 2.99 \| 2.84 \| 2.753 4.429 \| 3.728 \| 3.406 \| 3.053 \| 2.911 \| 2.781 4.452 \| 3.45 \| 2.833 \| 2.451 \| 2.233 \| 2.135 Good locality (Stride length 256) 4.066 \| 3.786 \| 3.581 \| 3.447 \| 3.415 \| 3.232 4.406 \| 4.005 \| 3.644 \| 3.49 \| 3.381 \| 3.268 4.393 \| 3.649 \| 3.186 \| 2.882 \| 2.676 \| 2.62 Medium locality (Stride length 4096) 4.012 \| 3.922 \| 3.768 \| 3.61 \| 3.582 \| 3.555 4.364 \| 4.057 \| 3.791 \| 3.65 \| 3.57 \| 3.465 4.479 \| 3.758 \| 3.316 \| 3.077 \| 2.959 \| 2.891 dbbench command used (on a DB with 4 levels, 12 million keys)- TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011 Differential Revision: D14348703 Pulled By: anand1976 fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b	2019-04-11 14:28:26 -07:00
Siying Dong	cf3a671733	Remove cuckoo hash memtable (#4953 ) Summary: Cuckoo Hash is less useful than we initially expected. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953 Differential Revision: D13979264 Pulled By: siying fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed	2019-02-07 16:15:27 -08:00
Anand Ananthabhotla	b9d6eccac1	Lock free MultiGet (#4754 ) Summary: Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex. Tests: 1. Unit tests 2. make check 3. db_bench - TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1 readrandom : 0.167 micros/op 5983920 ops/sec; 426.2 MB/s (1000000 of 1000000 found) Multireadrandom with batch size 1: multireadrandom : 0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754 Differential Revision: D13363550 Pulled By: anand1976 fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4	2019-01-02 11:42:54 -08:00
Siying Dong	f0dda35d7d	Preload some files even if options.max_open_files (#3340 ) Summary: Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340 Differential Revision: D6686945 Pulled By: siying fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f	2018-12-28 18:02:28 -08:00
Abhishek Madan	81b6b09f6b	Remove v1 RangeDelAggregator (#4778 ) Summary: Now that v2 is fully functional, the v1 aggregator is removed. The v2 aggregator has been renamed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778 Differential Revision: D13495930 Pulled By: abhimadan fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6	2018-12-17 17:33:46 -08:00
Abhishek Madan	abf931afa6	Add compaction logic to RangeDelAggregatorV2 (#4758 ) Summary: RangeDelAggregatorV2 now supports ShouldDelete calls on snapshot stripes and creation of range tombstone compaction iterators. RangeDelAggregator is no longer used on any non-test code path, and will be removed in a future commit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758 Differential Revision: D13439254 Pulled By: abhimadan fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401	2018-12-17 13:20:51 -08:00
Abhishek Madan	8fe1e06ca0	Clean up FragmentedRangeTombstoneList (#4692 ) Summary: Removed `one_time_use` flag, which removed the need for some tests, and changed all `NewRangeTombstoneIterator` methods to return `FragmentedRangeTombstoneIterators`. These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones` and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692 Differential Revision: D13106570 Pulled By: abhimadan fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845	2018-11-28 15:29:02 -08:00
Abhishek Madan	457f77b9ff	Introduce RangeDelAggregatorV2 (#4649 ) Summary: The old RangeDelAggregator did expensive pre-processing work to create a collapsed, binary-searchable representation of range tombstones. With FragmentedRangeTombstoneIterator, much of this work is now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking in each iterator to find a covering tombstone in ShouldDelete, while doing minimal work in AddTombstones. The old RangeDelAggregator is still used during flush/compaction for now, though RangeDelAggregatorV2 will support those uses in a future PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649 Differential Revision: D13146964 Pulled By: abhimadan fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3	2018-11-21 10:56:45 -08:00
Andrew Kryczka	ea9454700a	Backup engine support for direct I/O reads (#4640 ) Summary: Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640 Differential Revision: D13015268 Pulled By: ajkr fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269	2018-11-13 11:17:25 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Yanqin Jin	de18a2d82e	Update test to cover a new case in file ingestion (#4614 ) Summary: The new case is directIO = true, write_global_seqno = false in which we no longer write global_seqno to the external SST file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4614 Differential Revision: D12885001 Pulled By: riversand963 fbshipit-source-id: 7541bdc608b3a0c93d3c3c435da1b162b36673d4	2018-11-01 16:23:49 -07:00
Bo Hou	cd9404bb77	xxhash 64 support Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607 Reviewed By: siying Differential Revision: D12836696 Pulled By: jsjhoubo fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870	2018-11-01 15:44:06 -07:00
Yanqin Jin	5b4c709fad	Enable atomic flush (#4023 ) Summary: Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023 Differential Revision: D8518381 Pulled By: riversand963 fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f	2018-10-26 15:08:43 -07:00
Maysam Yabandeh	65ac72edd9	Fix bug in partition filters with format_version=4 (#4381 ) Summary: Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381 Differential Revision: D9879505 Pulled By: maysamyabandeh fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1	2018-09-17 17:28:15 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Anand Ananthabhotla	e3eba52a5d	Re-enable kUniversalSubcompactions option_config (#4125 ) Summary: 1. Move kUniversalSubcompactions up before kEnd in db_test_util.h, so tests that cycle through all the option_configs include this 2. Skip kUniversalSubcompactions wherever kUniversalCompaction and kUniversalCompactionMultilevel are skipped Related to #3935 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4125 Differential Revision: D8828637 Pulled By: anand1976 fbshipit-source-id: 650dee15fd27d85281cf9bb4ca8ab460e04cac6f	2018-07-13 11:13:01 -07:00
Maysam Yabandeh	b73652169e	Extend format 3 to partitioned index/filters (#3958 ) Summary: format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters. Closes https://github.com/facebook/rocksdb/pull/3958 Differential Revision: D8294615 Pulled By: maysamyabandeh fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe	2018-06-06 16:58:16 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Zhongyi Xie	50d7ac0ea3	Fix test for rocksdb_lite: hide incompatible option kDirectIO Summary: Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build. Closes https://github.com/facebook/rocksdb/pull/3943 Differential Revision: D8242757 Pulled By: miasantreble fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a	2018-06-01 20:42:36 -07:00
Phani Shekhar Mantripragada	446b32cfc3	Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d	2018-04-05 19:58:20 -07:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Yi Wu	dc360df81e	Fix multiple build failures Summary: * Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure * Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366 * Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled * Fix ASAN failure with DBBasicTest::DBClose test Closes https://github.com/facebook/rocksdb/pull/3373 Differential Revision: D6732313 Pulled By: yiwu-arbug fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5	2018-01-16 17:30:39 -08:00
Maysam Yabandeh	78c2eedb4f	fix release order in validateNumberOfEntries Summary: ScopedArenaIterator should be defined after range_del_agg so that it destructs the assigned iterator, which depends on range_del_agg, before it range_del_agg is already destructed. Closes https://github.com/facebook/rocksdb/pull/3281 Differential Revision: D6592332 Pulled By: maysamyabandeh fbshipit-source-id: 89a15d8ed13d0fc856b0c47dce3d91778738dbac	2017-12-18 14:27:28 -08:00
Maysam Yabandeh	857adf388f	WritePrepared Txn: Refactor conf params Summary: Summary of changes: - Move seq_per_batch out of Options - Rename concurrent_prepare to two_write_queues - Add allocate_seq_only_for_data_ Closes https://github.com/facebook/rocksdb/pull/3136 Differential Revision: D6304458 Pulled By: maysamyabandeh fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c	2017-11-10 17:28:12 -08:00
Mikhail Antonov	7fe3b32896	Added support for differential snapshots Summary: The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2). This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages. From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff". This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR. For now, what's done here according to initial discussions: Preserving deletes: - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion. - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum. - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum. Iterator changes: - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum. TableCache changes: - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span. What's left: - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type. Closes https://github.com/facebook/rocksdb/pull/2999 Differential Revision: D6175602 Pulled By: mikhail-antonov fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900	2017-11-01 18:56:43 -07:00
zach shipko	386a57e6ef	Fix build on OpenBSD Summary: A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed. Closes https://github.com/facebook/rocksdb/pull/3061 Differential Revision: D6138800 Pulled By: ajkr fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c	2017-10-24 13:27:38 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Kamalalochana Subbaiah	e612e31740	Updated CRC32 Power Optimization Changes Summary: Support for PowerPC Architecture Detecting AltiVec Support Closes https://github.com/facebook/rocksdb/pull/2716 Differential Revision: D5606836 Pulled By: siying fbshipit-source-id: 720262453b1546e5fdbbc668eff56848164113f3	2017-08-31 14:16:30 -07:00
Siying Dong	e7697b8ce8	Fix LITE unit tests Summary: Closes https://github.com/facebook/rocksdb/pull/2649 Differential Revision: D5505778 Pulled By: siying fbshipit-source-id: 7e935603ede3d958ea087ed6b8cfc4121e8797bc	2017-07-26 21:11:47 -07:00
Siying Dong	c281b44829	Revert "CRC32 Power Optimization Changes" Summary: This reverts commit `2289d38115`. Closes https://github.com/facebook/rocksdb/pull/2652 Differential Revision: D5506163 Pulled By: siying fbshipit-source-id: 105e31dd9d99090453a6b9f32c165206cd3affa3	2017-07-26 19:31:36 -07:00
Kamalalochana Subbaiah	2289d38115	CRC32 Power Optimization Changes Summary: Support for PowerPC Architecture Detecting AltiVec Support Closes https://github.com/facebook/rocksdb/pull/2353 Differential Revision: D5210948 Pulled By: siying fbshipit-source-id: 859a8c063d37697addd89ba2b8a14e5efd5d24bf	2017-07-26 09:42:29 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Sushma Devendrappa	0655b58582	enable PinnableSlice for RowCache Summary: This patch enables using PinnableSlice for RowCache, changes include not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle. Closes https://github.com/facebook/rocksdb/pull/2492 Differential Revision: D5316216 Pulled By: maysamyabandeh fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362	2017-07-17 15:08:30 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Ewout Prangsma	51778612c9	Encryption at rest support Summary: This PR adds support for encrypting data stored by RocksDB when written to disk. It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files. The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done. Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!). The Counter operation mode uses an initial counter & random initialization vector (IV). Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize). The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there. To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests. Typically you would run it like this: ``` ENCRYPTED_ENV=1 make check_some ``` There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings. Closes https://github.com/facebook/rocksdb/pull/2424 Differential Revision: D5322178 Pulled By: sdwilsh fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f	2017-06-26 16:56:24 -07:00
Maysam Yabandeh	499ebb3ab5	Optimize for serial commits in 2PC Summary: Throughput: 46k tps in our sysbench settings (filling the details later) The idea is to have the simplest change that gives us a reasonable boost in 2PC throughput. Major design changes: 1. The WAL file internal buffer is not flushed after each write. Instead it is flushed before critical operations (WAL copy via fs) or when FlushWAL is called by MySQL. Flushing the WAL buffer is also protected via mutex_. 2. Use two sequence numbers: last seq, and last seq for write. Last seq is the last visible sequence number for reads. Last seq for write is the next sequence number that should be used to write to WAL/memtable. This allows to have a memtable write be in parallel to WAL writes. 3. BatchGroup is not used for writes. This means that we can have parallel writers which changes a major assumption in the code base. To accommodate for that i) allow only 1 WriteImpl that intends to write to memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes come via group commit phase which is serial anyway, ii) make all the parts in the code base that assumed to be the only writer (via EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are protected via a stat_mutex_. Note: the first commit has the approach figured out but is not clean. Submitting the PR anyway to get the early feedback on the approach. If we are ok with the approach I will go ahead with this updates: 0) Rebase with Yi's pipelining changes 1) Currently batching is disabled by default to make sure that it will be consistent with all unit tests. Will make this optional via a config. 2) A couple of unit tests are disabled. They need to be updated with the serial commit of 2PC taken into account. 3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires releasing mutex_ beforehand (the same way EnterUnbatched does). This needs to be cleaned up. Closes https://github.com/facebook/rocksdb/pull/2345 Differential Revision: D5210732 Pulled By: maysamyabandeh fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4	2017-06-24 14:11:29 -07:00
Yi Wu	ad19eb8686	Fixing blob db sequence number handling Summary: Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch. Stacking on #2375. Closes https://github.com/facebook/rocksdb/pull/2385 Differential Revision: D5148358 Pulled By: yiwu-arbug fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33	2017-05-31 10:56:45 -07:00
Andrew Kryczka	bb01c1880c	Introduce max_background_jobs mutable option Summary: - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure. - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions. - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled. Closes https://github.com/facebook/rocksdb/pull/2205 Differential Revision: D4937461 Pulled By: ajkr fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9	2017-05-24 11:29:08 -07:00
Aaron Gao	3e86c0f07c	disable direct reads for log and manifest and add direct io to tests Summary: Disable direct reads for log and manifest. Direct reads should not affect sequential_file Also add kDirectIO for option_config_ in db_test_util Closes https://github.com/facebook/rocksdb/pull/2337 Differential Revision: D5100261 Pulled By: lightmark fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b	2017-05-22 18:41:28 -07:00
Maysam Yabandeh	e7731d119a	Configure index partition size Summary: Allow the users to specify the target index partition size. With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut. Closes https://github.com/facebook/rocksdb/pull/2041 Differential Revision: D4780216 Pulled By: maysamyabandeh fbshipit-source-id: 95a0831	2017-03-28 12:09:12 -07:00
Maysam Yabandeh	8b0097b49b	Readers for partition filter Summary: This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests. Closes https://github.com/facebook/rocksdb/pull/1961 Differential Revision: D4672216 Pulled By: maysamyabandeh fbshipit-source-id: 6a2b829	2017-03-22 09:24:15 -07:00
Maysam Yabandeh	e6725e8c8d	Fix some bugs in MockEnv Summary: Fixing some bugs in MockEnv so it be actually used. Closes https://github.com/facebook/rocksdb/pull/1914 Differential Revision: D4609923 Pulled By: maysamyabandeh fbshipit-source-id: ca25735	2017-03-13 09:54:11 -07:00
Maysam Yabandeh	534581a356	Fix a bug in tests in options operator= Summary: Note: Using the default operator= is an unsafe approach for Options since it destructs shared_ptr in the same order of their creation, in contrast to destructors which destructs them in the opposite order of creation. One particular problme is that the cache destructor might invoke callback functions that use Option members such as statistics. To work around this problem, we manually call destructor of table_facotry which eventually clears the block cache. Closes https://github.com/facebook/rocksdb/pull/1950 Differential Revision: D4655473 Pulled By: maysamyabandeh fbshipit-source-id: 6c4bbff	2017-03-05 18:09:09 -08:00
Siying Dong	d5b607a43f	Make db_wal_test slightly faster Summary: Avoid to run db_wal_test in all the DB test options, and some small changes. Closes https://github.com/facebook/rocksdb/pull/1921 Differential Revision: D4622054 Pulled By: siying fbshipit-source-id: 890fd64	2017-02-28 17:39:10 -08:00
Siying Dong	1ba2804b7f	Remove XFunc tests Summary: Xfunc is hardly used. Remove it to keep the code simple. Closes https://github.com/facebook/rocksdb/pull/1905 Differential Revision: D4603220 Pulled By: siying fbshipit-source-id: 731f96d	2017-02-23 12:09:11 -08:00
Maysam Yabandeh	69d5262c81	Two-level Indexes Summary: Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition t15539501 Closes https://github.com/facebook/rocksdb/pull/1814 Differential Revision: D4473535 Pulled By: maysamyabandeh fbshipit-source-id: bffb87e	2017-02-06 16:39:12 -08:00
Islam AbdelRahman	3ce091fd73	Add KEEP_DB env var option Summary: When debugging tests, it's useful to preserve the DB to investigate it and check the logs This will allow us to set KEEP_DB=1 to preserve the DB Closes https://github.com/facebook/rocksdb/pull/1759 Differential Revision: D4393826 Pulled By: IslamAbdelRahman fbshipit-source-id: 1bff689	2017-01-17 13:54:20 -08:00
Aaron Gao	972f96b3fb	direct io write support Summary: rocksdb direct io support ``` [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 5.0 Date: Wed Nov 23 13:17:43 2016 CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz CPUCache: 25600 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 4.393 micros/op 227639 ops/sec; 25.2 MB/s [gzh@dev11575.prn2 ~/roc Closes https://github.com/facebook/rocksdb/pull/1564 Differential Revision: D4241093 Pulled By: lightmark fbshipit-source-id: 98c29e3	2016-12-22 13:09:19 -08:00
Yi Wu	c270735861	Iterator should be in corrupted status if merge operator return false Summary: Iterator should be in corrupted status if merge operator return false. Also add test to make sure if max_successive_merges is hit during write, data will not be lost. Closes https://github.com/facebook/rocksdb/pull/1665 Differential Revision: D4322695 Pulled By: yiwu-arbug fbshipit-source-id: b327b05	2016-12-16 11:09:16 -08:00
Daniel Black	cfc34d7c4e	Missing break in case in DBTestBase::CurrentOptions Summary: Found by gcc-7 compile error. This appeared to be a fault as these options seems too different. Closes https://github.com/facebook/rocksdb/pull/1667 Differential Revision: D4324174 Pulled By: yiwu-arbug fbshipit-source-id: 0f65383	2016-12-13 18:39:14 -08:00
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	2016-11-21 18:09:13 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Siying Dong	972e3ff295	Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default Summary: Closes https://github.com/facebook/rocksdb/pull/1496 Differential Revision: D4168080 Pulled By: siying fbshipit-source-id: 056ae62	2016-11-16 09:39:09 -08:00
Andrew Kryczka	9e7cf3469b	DeleteRange user iterator support Summary: Note: reviewed in https://reviews.facebook.net/D65115 - DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times. - DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205) - DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones Closes https://github.com/facebook/rocksdb/pull/1464 Differential Revision: D4131753 Pulled By: ajkr fbshipit-source-id: be86559	2016-11-04 12:09:22 -07:00
Islam AbdelRahman	869ae5d786	Support IngestExternalFile (remove AddFile restrictions) Summary: Changes in the diff API changes: - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear) - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file) - Deprecate AddFile() API Logic changes: - If our file overlap with the memtable we will flush the memtable - We will find the first level in the LSM tree that our file key range overlap with the keys in it - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it - We will assign a global sequence number to our new file - Remove AddFile restrictions by using global sequence numbers Other changes: - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob Test Plan: unit tests (still need to add more) addfile_stress (https://reviews.facebook.net/D65037) Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong Reviewed By: sdong Subscribers: jkedgar, hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65061	2016-10-20 17:05:32 -07:00
Islam AbdelRahman	b88f8e87c5	Support SST files with Global sequence numbers [reland] Summary: reland https://reviews.facebook.net/D62523 - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: sdong, yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65211	2016-10-18 16:59:37 -07:00
Islam AbdelRahman	d11c09d9e2	Eliminate memcpy from ForwardIterator Summary: This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators This diff is stacked on D61305 Test Plan: existing tests (updated them to test tailing iterator) new test Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60009	2016-08-11 19:10:16 -07:00
sdong	56dd034115	read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too. Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too. Test Plan: I am adding unit tests. Reviewers: andrewkr, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61503	2016-08-10 13:16:41 -07:00
Islam AbdelRahman	ccecf3f4fb	UniversalCompaction should ignore sorted runs being compacted (when compacting for file num) Summary: If we have total number of sorted runs greater than level0_file_num_compaction_trigger, Universal compaction will always issue a compaction even if the number of sorted runs that are not being compacted is less than level0_file_num_compaction_trigger. This diff changes this behaviour to relay on the `number of sorted runs not being compacted` instead of `total number of sorted runs` Test Plan: New unit test Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61533	2016-08-10 12:37:43 -07:00
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	2016-07-20 09:49:03 -07:00
Aaron Gao	dda6c72ac8	Add DestroyColumnFamilyHandle(ColumnFamilyHandle) to db.h Summary: add DestroyColumnFamilyHandle(ColumnFamilyHandle) to close column family instead of deleting cfh* User should call this to close a cf and then we can detect the deletion in this function. Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60765	2016-07-13 17:59:25 -07:00
Aaron Gao	8e6b38d895	update DB::AddFile to ingest list of sst files Summary: DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list). Test Plan: Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure: 1. files key ranges are not overlapping with each other 2. each file key range dont overlap with the DB key range 3. make sure no snapshots are held Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D58587	2016-07-11 10:43:12 -07:00
Islam AbdelRahman	fa813f7478	Update DB::AddFile() to ingest the file to the lowest possible level Summary: DB::AddFile() right now always add the ingested file to L0 update the logic to add the file to the lowest possible level Test Plan: unit tests Reviewers: jkedgar, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D59637	2016-06-21 17:57:59 -07:00
sdong	7b79238b65	Deprectate filter_deletes Summary: filter_deltes is not a frequently used feature. Remove it. Test Plan: Run all test suites. Reviewers: igor, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59427	2016-06-17 10:30:47 -07:00
sdong	6a14f7a976	Change several option defaults Summary: Changing several option defaults: options.max_open_files changes from 5000 to -1 options.base_background_compactions changes from max_background_compactions to 1 options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize Test Plan: Write unit tests to see OldDefaults() works as expected. Reviewers: IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56427	2016-04-28 17:50:58 -07:00
Andrew Kryczka	73a847ef89	Add per-level compression ratio property Summary: This is needed so we can measure compression ratio improvements achieved by D52287. The property compares raw data size against the total file size for a given level. If the level is empty it should return 0.0. Test Plan: new unit test Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56967	2016-04-20 18:46:54 -07:00
Yi Wu	792762c42c	Split db_test.cc Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase. Test Plan: make check Reviewers: sdong, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong Differential Revision: https://reviews.facebook.net/D56715	2016-04-18 09:42:50 -07:00
sdong	4b6833aec1	Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too. Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it. Test Plan: Try db_bench and see the stats are printed out. Reviewers: yhchiang Reviewed By: yhchiang Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56769	2016-04-15 10:22:18 -07:00
sdong	2feafa3db9	Change some RocksDB default options Summary: Change some RocksDB default options to make it more friendly to server workloads. Test Plan: Run all existing tests Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55941	2016-03-31 17:12:18 -07:00
Islam AbdelRahman	df9ba6df62	Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage Summary: Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB. When this limit is exceeded WriteImpl() will fail and return Status::Aborted() Test Plan: unit testing Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D53763	2016-02-17 15:20:23 -08:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
Islam AbdelRahman	8e6172bc57	Add BlockBasedTableOptions::index_block_restart_interval Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block Test Plan: unit tests Reviewers: yhchiang, anthony, andrewkr, sdong Reviewed By: sdong Subscribers: march, dhruba Differential Revision: https://reviews.facebook.net/D53721	2016-02-05 10:22:37 -08:00
Nathan Bronson	9c2cf9479b	Fix for --allow_concurrent_memtable_write with batching Summary: Concurrent memtable adds were incorrectly computing the last sequence number for a write batch group when the write batches were not solitary. This is the cause of https://github.com/facebook/mysql-5.6/issues/155 Test Plan: 1. unit tests 2. new unit test 3. parallel db_bench stress tests with batch size of 10 and asserts enabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D53595	2016-02-01 20:41:57 -08:00
Islam AbdelRahman	d6c838f1e1	Add SstFileManager (component tracking all SST file in DBs and control the deletion rate) Summary: Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler SstFileTracker can be used later to abort writes when we exceed a specific size Test Plan: unit tests Reviewers: rven, anthony, yhchiang, sdong Reviewed By: sdong Subscribers: igor, lovro, march, dhruba Differential Revision: https://reviews.facebook.net/D50469	2016-01-28 18:35:01 -08:00
sdong	da33dfe188	Parameterize DBTest.Randomized Summary: Break down DBTest.Randomized to multiple gtest tests based on config type Test Plan: Run the test and all tests. Make sure configurations are correctly set Reviewers: yhchiang, IslamAbdelRahman, rven, kradhakrishnan, andrewkr, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53247	2016-01-25 14:49:08 -08:00
Siying Dong	fa3dbf203f	Merge pull request #853 from Vaisman/enable_C4267_warning Enable C4267 warning	2015-12-08 17:59:24 -08:00
krad	d3bb572da6	Build break fix. Summary: Skip list now cannot estimate memory across allocators consistently and hence triggers flush at different time. This breaks certain unit tests. The fix is to adopt key count instead of size for flush. Test Plan: Ran test on dev box and mac (where it used to fail) Reviewers: sdong CC: leveldb@ Task ID: #9273334 Blame Rev:	2015-12-04 11:45:51 -08:00
sdong	ef8ed3681c	Fix DBTest.SuggestCompactRangeTest for disable jemalloc case Summary: DBTest.SuggestCompactRangeTest fails for the case when jemalloc is disabled, including ASAN and valgrind builds. It is caused by the improvement of skip list, which allocates different size of nodes for a new records. Fix it by using a special mem table that triggers a flush by number of entries. In that way the behavior will be consistent for all allocators. Test Plan: Run the test with both of DISABLE_JEMALLOC=1 and 0 Reviewers: anthony, rven, yhchiang, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51423	2015-11-30 16:40:47 -08:00
Vasili Svirski	41b32c6059	Enable C4267 warning * conversion from 'size_t' to 'type', by add static_cast Tested: * by build solution on Windows, Linux locally, * run tests * build CI system successful	2015-11-24 16:33:09 +03:00
sdong	189b3e03df	Fix uninitilizeded SpecialEnv::time_elapse_only_sleep_ Summary: SpecialEnv::time_elapse_only_sleep_ is not initialized, which might cause some test failures. Fix it. Test Plan: Run some unit tests. Since tests already broken. Might want to commit it sooner. Reviewers: IslamAbdelRahman, yhchiang, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D50937	2015-11-17 16:22:17 -08:00
Yueh-Hsuan Chiang	e114f0abb8	Enable RocksDB to persist Options file. Summary: This patch allows rocksdb to persist options into a file on DB::Open, SetOptions, and Create / Drop ColumnFamily. Options files are created under the same directory as the rocksdb instance. In addition, this patch also adds a fail_if_missing_options_file in DBOptions that makes any function call return non-ok status when it is not able to persist options properly. // If true, then DB::Open / CreateColumnFamily / DropColumnFamily // / SetOptions will fail if options file is not detected or properly // persisted. // // DEFAULT: false bool fail_if_missing_options_file; Options file names are formatted as OPTIONS-<number>, and RocksDB will always keep the latest two options files. Test Plan: Add options_file_test. options_test column_family_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48285	2015-11-10 22:58:01 -08:00
Nathan Bronson	b81b430987	Switch to thread-local random for skiplist Summary: Using a TLS random instance for skiplist makes it smaller (useful for hash_skiplist_rep) and prepares skiplist for concurrent adds. This diff also modifies the branching factor math to avoid an unnecessary division. This diff has the effect of changing the sequence of skip list node height choices made by tests, so it has the potential to cause unit test failures for tests that implicitly rely on the exact structure of the skip list. Tests that try to exactly trigger a compaction are likely suspects for this problem (these tests have always been brittle to changes in the skiplist details). I've minimizes this risk by reseeding the main thread's Random at the beginning of each test, increasing the universal compaction size_ratio limit from 101% to 105% for some tests, and verifying that the tests pass many times. Test Plan: for i in `seq 0 9`; do make check; done Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50439	2015-11-09 19:25:22 -08:00
Siying Dong	138876a62c	Merge pull request #746 from ceph/wip-recycle Add Options.recycle_log_file_num for Recycling WAL Files	2015-10-26 15:01:28 -07:00
Dmitri Smirnov	3c750b59ae	No need to #ifdef test only code on windows	2015-10-22 15:15:37 -07:00
Sage Weil	7188052107	db_test_util: add recycle_log_files to set of tested options Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Islam AbdelRahman	952ad994a9	Fix db_test under ROCKSDB_LITE Summary: This diff exclude alot of tests in db_test that are not compiling / failing under ROCKSD_LITE Test Plan: OPT=-DROCKSDB_LITE make check -j64 make check -j64 Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48771	2015-10-15 10:59:31 -07:00
Islam AbdelRahman	f55d3009c0	Make db_test_util compile under ROCKSDB_LITE Summary: db_test_util is used in multiple test files but it dont compile under ROCKSDB_LITE Test Plan: make check make static_lib OPT=-DROCKSDB_LITE make db_wal_test Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48579	2015-10-13 17:33:23 -07:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Islam AbdelRahman	c64ae05b1c	Move TEST_NewInternalIterator to NewInternalIterator Summary: Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517 This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc Test Plan: make check make static_lib Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48561	2015-10-12 17:22:37 -07:00
Venkatesh Radhakrishnan	f1fdf5205b	Clean up dependency: Move db_test_util.* to db directory Summary: As part of tech debt week, we are cleaning up dependencies. This diff moves db_test_util.[h,cc] from util to db directory. Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48543	2015-10-12 13:05:42 -07:00

1 2 3

150 Commits