rocksdb

Author	SHA1	Message	Date
Stanislau Hlebik	30c81e7717	Removing NewTotalOrderPlainTableFactory Summary: Seems like NewTotalOrderPlainTableFactory is useless and is semantically incorrect. Total order mode indicator is prefix_extractor == nullptr, but NewTotalOrderPlainTableFactory doesn't set it to be nullptr. That's why some tests in plain_table_db_tests is incorrect. Test Plan: make all check Reviewers: sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19587	2014-07-10 11:32:04 -07:00
Igor Canadi	f0a8be253e	JSON (Document) API sketch Summary: This is a rough sketch of our new document API. Would like to get some thoughts and comments about the high-level architecture and API. I didn't optimize for performance at all. Leaving some low-hanging fruit so that we can be happy when we fix them! :) Currently, bunch of features are not supported at all. Indexes can be only specified when creating database. There is no query planner whatsoever. This will all be added in due time. Test Plan: Added a simple unit test Reviewers: haobo, yhchiang, dhruba, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18747	2014-07-10 09:31:42 -07:00
Feng Zhu	222cf2555a	change the init parameter for FileDescriptor Summary: fix a bug in improve_file_key_search, change the parameter for FileDescriptor Test Plan: make all check Reviewers: sdong Reviewed By: sdong Differential Revision: https://reviews.facebook.net/D19611	2014-07-09 23:40:03 -07:00
Lei Jin	8a7d1fe616	disable rate limiter test Summary: The test is not stable because it relies on disk and only runs for a short period of time. So misisng a compaction/flush would greatly affect the rate. I am disabling it for now. What do you guys think? Test Plan: make Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19599	2014-07-09 22:46:15 -07:00
Feng Zhu	f697cad15c	create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster Summary: Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena Thus increase the file meta data locality, speed up "Get" and "FindFile" benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing" benchmark command: ./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920 Read Random: From 1.8363 ms/op, improve to 1.7587 ms/op. Read while writing: From 2.985 ms/op, improve to 2.924 ms/op. Test Plan: make all check Reviewers: ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, igor Differential Revision: https://reviews.facebook.net/D19419	2014-07-09 22:14:39 -07:00
Yueh-Hsuan Chiang	70828557ef	Some fixes on size compensation logic for deletion entry in compaction Summary: This patch include two fixes: 1. newly created Version will now takes the aggregated stats for average-value-size from the latest Version. 2. compensated size of a file is now computed only for newly created / loaded file, this addresses the issue where files are already sorted by their compensated file size but might sometimes observe some out-of-order due to later update on compensated file size. Test Plan: export ROCKSDB_TESTS=CompactionDele ./db_test Reviewers: ljin, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19557	2014-07-09 12:46:08 -07:00
Lei Jin	ef1aad97f9	fix one more internal_stats issue Summary: stall count is wrong Test Plan: make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19539	2014-07-08 15:29:13 -07:00
Lei Jin	73d7147096	make rate limiter test more reliable Summary: Randomize keys so that compaction actually happens. Change the config so that compaction happens more aggressively. The test takes longer time, but the results are more stable shown by iostat Test Plan: ran it Reviewers: igor, yhchiang Reviewed By: yhchiang Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19533	2014-07-08 15:15:00 -07:00
Lei Jin	8a9cc7885c	report correct interval amplification Summary: as title Test Plan: make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19515	2014-07-08 12:48:10 -07:00
Lei Jin	534357ca3a	integrate rate limiter into rocksdb Summary: Add option and plugin rate limiter for PosixWritableFile. The rate limiter only applies to flush and compaction. WAL and MANIFEST are excluded from this enforcement. Test Plan: db_test Reviewers: igor, yhchiang, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19425	2014-07-08 12:31:49 -07:00
Lei Jin	b278ae8e50	Apply fractional cascading in ForwardIterator::Seek() Summary: Use search hint to reduce FindFile range thus avoid comparison For a small DB with 50M keys, perf_context counter shows it reduces comparison from 2B to 1.3B for a 15-minute run. No perf change was observed for 1 seek thread, but quite good improvement was seen for 32 seek threads, when CPU was busy. will post detail results when ready Test Plan: db_bench and db_test Reviewers: haobo, sdong, dhruba, igor Reviewed By: igor Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D18879	2014-07-08 11:40:42 -07:00
Igor Canadi	1b95bf734d	Merge pull request #197 from rdallman/update-options C API: update options w/ convenience funcs & fifo compaction	2014-07-08 11:40:11 -07:00
Reed Allman	fd3fb4b0bf	C API: update options w/ convenience funcs & fifo compaction	2014-07-08 10:57:45 -07:00
Reed Allman	e9b18b6b89	C API: bugfix column_family_comact_range	2014-07-07 21:48:49 -07:00
Igor Canadi	4adf64e068	Fix compile issue	2014-07-07 14:54:11 -07:00
Igor Canadi	8a03935f8c	Fix valgrind error in c_test Summary: External contribution caused some valgrind errors: `1a34aaaef0` This diff fixes them Test Plan: ran valgrind Reviewers: sdong, yhchiang, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19485	2014-07-07 14:41:54 -07:00
Evan Shaw	13a130cc00	C API: Add test for compaction filter factories Also refactored the compaction filter tests to share some code and ensure that options were getting reset so future test results aren't confused.	2014-07-08 08:12:36 +12:00
Evan Shaw	3f7104d7c5	C API: Allow setting compaction filter factory	2014-07-08 08:12:36 +12:00
Evan Shaw	91bede79cc	C API: Add support for compaction filter factories (v1)	2014-07-08 08:12:36 +12:00
Radheshyam Balasundaram	f0660d5253	Adding NUMA support to db_bench tests Summary: Changes: - Adding numa_aware flag to db_bench.cc - Using numa.h library to bind memory and cpu of threads to a fixed NUMA node Result: There seems to be no significant change in the micros/op time with numa_aware enabled. I also tried this with other implementations, including a combination of pthread_setaffinity_np, sched_setaffinity and set_mempolicy methods. It'd be great if someone could point out where I'm going wrong and if we can achieve a better micors/op. Test Plan: Ran db_bench tests using following command: ./db_bench --db=/mnt/tmp --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=/mnt/tmp --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --duration=300 --benchmarks=readwhilewriting --use_existing_db=1 --num=157286400 --threads=24 --writes_per_second=10240 --numa_aware=[False/True] The tests were run in private devserver with 24 cores and the db was prepopulated using filluniquerandom test. The tests resulted in 0.145 us/op with numa_aware=False and 0.161 us/op with numa_aware=True. Reviewers: sdong, yhchiang, ljin, igor Reviewed By: ljin, igor Subscribers: igor, leveldb Differential Revision: https://reviews.facebook.net/D19353	2014-07-07 10:53:31 -07:00
Igor Canadi	0bc5fa9f40	Merge pull request #190 from edsrzf/c-api-writebatch-serialized C API: support constructing write batch from serialized representation	2014-07-07 10:17:43 -07:00
Reed Allman	1a34aaaef0	C API: column family support	2014-07-07 01:41:01 -07:00
Evan Shaw	9fc23d0c56	C API: support constructing write batch from serialized representation	2014-07-06 10:36:33 +12:00
Yueh-Hsuan Chiang	7b85c1e900	Improve SimpleWriteTimeoutTest to avoid false alarm. Summary: SimpleWriteTimeoutTest has two parts: 1) insert two large key/values to make memtable full and expect both of them are successful; 2) insert another key / value and expect it to be timed-out. Previously we also set a timeout in the first step, but this might sometimes cause false alarm. This diff makes the first two writes run without timeout setting. Test Plan: export ROCKSDB_TESTS=Time make db_test Reviewers: sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19461	2014-07-04 00:02:12 -07:00
Yueh-Hsuan Chiang	d33657a4a5	Fixed a warning in release mode. Summary: Removed a variable that is only used in assertion check. Test Plan: make release Reviewers: ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19455	2014-07-03 17:19:17 -07:00
Yueh-Hsuan Chiang	90a6aca48e	Finer report I/O stats about Flush and Compaction. Summary: This diff allows the I/O stats about Flush and Compaction to be reported in a more accurate way. Instead of measuring the size of a file, it measure I/O cost in per read / write basis. Test Plan: make all check Reviewers: sdong, igor, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19383	2014-07-03 16:28:03 -07:00
Yueh-Hsuan Chiang	d4d338de33	Add timeout_hint_us to WriteOptions and introduce Status::TimeOut. Summary: This diff adds timeout_hint_us to WriteOptions. If it's non-zero, then 1) writes associated with this options MAY be aborted when it has been waiting for longer than the specified time. If an abortion happens, associated writes will return Status::TimeOut. 2) the stall time of the associated write caused by flush or compaction will be limited by timeout_hint_us. The default value of timeout_hint_us is 0 (i.e., OFF.) The statistics of timeout writes will be recorded in WRITE_TIMEDOUT. Test Plan: export ROCKSDB_TESTS=WriteTimeoutAndDelayTest make db_test ./db_test Reviewers: igor, ljin, haobo, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18837	2014-07-03 15:47:02 -07:00
Igor Canadi	4203431e71	Fix mac os compile error	2014-07-03 23:03:24 +02:00
sdong	2459f7ec4e	Support Multiple DB paths (without having an interface to expose to users) Summary: In this patch, we allow RocksDB to support multiple DB paths internally. No user interface is supported yet so this patch is silent to users. Test Plan: make all check Reviewers: igor, haobo, ljin, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18921	2014-07-02 21:14:44 -07:00
Igor Canadi	f146cab261	Centralize compression decision to compaction picker Summary: Before this diff, we're deciding enable_compression in CompactionPicker and then we're deciding final compression type in DBImpl. This is kind of confusing. After the diff, the final compression type will be decided in CompactionPicker. The reason for this is that I want CompactFiles() to specify output compression type, so that people can mix and match compression styles in their compaction algorithms. This diff makes it much easier to do that. Test Plan: make check Reviewers: dhruba, haobo, sdong, yhchiang, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19137	2014-07-02 20:40:57 +02:00
sdong	1d05006740	Re-commit the correct part (WalDir) of the revision: Commit 6634844dba962b9a150646382f4d6531d1f2440b by sdong Two small fixes in db_test Summary: Two fixes: (1) WalDir to pick a directory under TmpDir to allow two tests running in parallel without impacting each other (2) kBlockBasedTableWithWholeKeyHashIndex is disabled by mistake (I assume). Enable it. Test Plan: ./db_test Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: nkg-, igor, dhruba, haobo, leveldb Differential Revision: https://reviews.facebook.net/D19389	2014-07-01 18:54:50 -07:00
sdong	30b20604db	Revert "Two small fixes in db_test" This reverts commit 6634844dba962b9a150646382f4d6531d1f2440b.	2014-07-01 17:41:38 -07:00
sdong	9c332aa11a	HashLinkList memtable switches a bucket to a skip list to reduce performance outliers Summary: In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search. Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list. The new data structure is documented in comments in the codes. Test Plan: make all check set threshold_use_skiplist in several tests Reviewers: yhchiang, haobo, ljin Reviewed By: yhchiang, ljin Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D19299	2014-07-01 17:14:15 -07:00
sdong	6634844dba	Two small fixes in db_test Summary: Two fixes: (1) WalDir to pick a directory under TmpDir to allow two tests running in parallel without impacting each other (2) kBlockBasedTableWithWholeKeyHashIndex is disabled by mistake (I assume). Enable it. Test Plan: ./db_test Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: nkg-, igor, dhruba, haobo, leveldb Differential Revision: https://reviews.facebook.net/D19389	2014-07-01 16:58:03 -07:00
Igor Canadi	f5d4df1c02	Fix compile error	2014-07-01 10:55:03 +02:00
Igor Canadi	a2e0d890ed	No need for files_by_size_ in universal compaction Summary: files_by_size_ is sorted by time in case of universal compaction. However, Version::files_ is also sorted by time. So no need for files_by_size_ Test Plan: 1) make check with the change 2) make check with `assert(last_index == c->input_version_->files_[level].size() - 1);` in compaction picker Reviewers: dhruba, haobo, yhchiang, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19125	2014-07-01 08:55:04 +02:00
Feng Zhu	5656367416	use arena to allocate memtable's bloomfilter and hashskiplist's buckets_ Summary: Bloomfilter and hashskiplist's buckets_ allocated by memtable's arena DynamicBloom: pass arena via constructor, allocate space in SetTotalBits HashSkipListRep: allocate space of buckets_ using arena. do not delete it in deconstructor because arena would take care of it. Several test files are changed. Test Plan: make all check Reviewers: ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: igor, dhruba Differential Revision: https://reviews.facebook.net/D19335	2014-06-30 15:54:31 -07:00
sdong	dd337bc0b2	In logging format, use PRIu64 instead of casting Summary: Code cleaning up, since we are already using __STDC_FORMAT_MACROS in printing uint64_t, change other places. Only logging is changed. Test Plan: make all check Reviewers: ljin Reviewed By: ljin Subscribers: dhruba, yhchiang, haobo, leveldb Differential Revision: https://reviews.facebook.net/D19113	2014-06-27 16:34:15 -07:00
Stanislau Hlebik	a3594867ba	Cache some conditions for DBImpl::MakeRoomForWrite Summary: Task 4580155. Some conditions in DBImpl::MakeRoomForWrite can be cached in ColumnFamilyData, because theirs value can be changed only during compaction, adding new memtable and/or add recalculation of compaction score. These conditions are: cfd->imm()->size() == cfd->options()->max_write_buffer_number - 1 cfd->current()->NumLevelFiles(0) >= cfd->options()->level0_stop_writes_trigger cfd->options()->soft_rate_limit > 0.0 && (score = cfd->current()->MaxCompactionScore()) > cfd->options()->soft_rate_limit cfd->options()->hard_rate_limit > 1.0 && (score = cfd->current()->MaxCompactionScore()) > cfd->options()->hard_rate_limit P.S. As it's my first diff, Siying suggested to add everybody as a reviewers for this diff. Sorry, if I forgot someone or add someone by mistake. Test Plan: make all check Reviewers: haobo, xjin, dhruba, yhchiang, zagfox, ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19311	2014-06-26 16:45:27 -07:00
sdong	19de6a7aad	Remove MemTableRep::GetIterator(const Slice& slice) Summary: It seems to me that when ever function MemTableRep::GetIterator(const Slice& slice) is used, we can use MemTableRep::GetDynamicPrefixIterator() instead. Just delete it to simplify the codes. Test Plan: make all check Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: xjin, dhruba, haobo, leveldb Differential Revision: https://reviews.facebook.net/D19281	2014-06-25 14:09:29 -07:00
Yueh-Hsuan Chiang	8898a0a0d1	Reorder the member variables of FileMetaData to improve cache locality. Summary: Move stats related member variables of FileMetaData to the bottom to improve cache locality of normal DB operations. Test Plan: make Reviewers: haobo, ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19287	2014-06-24 19:22:11 -06:00
Yueh-Hsuan Chiang	e813f5b6d9	Allow compaction to reclaim storage more effectively. Summary: This diff allows compaction to reclaim storage more effectively. In the current design, compactions are mainly triggered based on the file sizes. However, since deletion entries does not have value, files which have many deletion entries are less likely to be compacted. As a result, it may took a while to make deletion entries to be compacted. This diff address issue by compensating the size of deletion entries during compaction process: the size of each deletion entry in the compaction process is augmented by 2x average value size. The diff applies to both leveled and universal compacitons. Test Plan: develop CompactionDeletionTrigger make db_test ./db_test Reviewers: haobo, igor, ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19029	2014-06-24 16:37:06 -06:00
Yueh-Hsuan Chiang	faa8d21922	Improve an assertion in RandomGenerator::Generate() in db_bench. Summary: RandomGenerator::Generate() currently has an assertion len < data_.size(). However, it is actually fine to have len == data_.size(). This diff change the assertion to len <= data_.size(). Test Plan: make db_bench ./db_bench Reviewers: haobo, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19269	2014-06-24 15:29:28 -06:00
Lei Jin	3b0dc76699	db_bench: measure the real latency of write/delete Summary: as title Test Plan: make release Reviewers: haobo, sdong, yhchiang Reviewed By: yhchiang Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19227	2014-06-23 13:23:02 -07:00
Lei Jin	a1b5650a75	db_bench: sanity check on compression ratio Summary: as requested by mark Test Plan: make release Reviewers: sdong, haobo Reviewed By: haobo Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19221	2014-06-23 10:46:16 -07:00
Igor Canadi	d4a8423334	Remove seek compaction Summary: As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases. This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction. There is one test case that relied on didIO information. I'll try to find another way to implement it. Test Plan: make check Reviewers: sdong, haobo, yhchiang, ljin, dhruba Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19161	2014-06-20 10:23:02 +02:00
Igor Canadi	107e08baa7	Use same sorting for all level 0 files Summary: We decided that one of the long term goals is to unify level and universal compaction. As a small first step, I'm unifying level 0 sorting methods. Previously, we used to sort level 0 files in level compaction by file number and in universal compaction by sequence number. But it turns out that in level compaction, sorting by file number is exactly the same as sorting by sequence number. Test Plan: Ran make check with bunch of asserts to verify the sorting order is exactly the same. Also, make check with this patch Reviewers: haobo, yhchiang, ljin, dhruba, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19131	2014-06-20 09:12:14 +02:00
Haobo Xu	7a9dd5f214	[RocksDB] Make block based table hash index more adaptive Summary: Currently, RocksDB returns error if a db written with prefix hash index, is later opened without providing a prefix extractor. This is uncessarily harsh. Without a prefix extractor, we could always fallback to the normal binary index. Test Plan: unit test, also manually veried LOG that fallback did occur. Reviewers: sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19191	2014-06-19 16:40:32 -07:00
Yueh-Hsuan Chiang	4f5ccfd179	Fixed a potential write hang Summary: Currently, when something badly happen in the DB::Write() while the write-queue contains more than one element, the current design seems to forget to clean up the queue as well as wake-up all the writers, this potentially makes rocksdb hang on writes. Test Plan: make all check Reviewers: sdong, ljin, igor, haobo Reviewed By: haobo Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19167	2014-06-19 14:53:03 -07:00
Igor Canadi	bae495740d	Merge pull request #179 from edsrzf/c-api-compaction-filter Support for compaction filters in the C API	2014-06-19 21:22:46 +02:00

1 2 3 4 5 ...

1037 Commits