rocksdb

Author	SHA1	Message	Date
Kai Liu	219b35be6a	Generalize footer reading from file Summary: Generalizing this process will help us to re-use the code for plain table Test Plan: ran ./table_test	2013-12-04 16:35:48 -08:00
Kai Liu	5dec7acd91	Introducing the concept of NULL block handle	2013-12-04 15:45:14 -08:00
Kai Liu	3a0e98d558	Parameterize table magic number Summary: As we are having different types of tables and they all might share the same structure in block-based table: [metaindex block] [index block] [Footer] To be able to identify differnt types of tables, we need to parameterize the "magic number" in the `Footer`. Test Plan: make check	2013-12-04 15:17:40 -08:00
Siying Dong	f040e536e4	[RocksDB Performance Branch] A more customized index in PlainTableReader Summary: PlainTableReader to use a more customized hash table. This patch assumes the SST file is smaller than 2GB: (1) Every bucket uses 32-bit integer (2) no key is stored in bucket (3) use the first bit of the bucket value to distinguish it points to the file offset or a second level index. This index schema fits the use case that most of prefixes have very small number of keys Test Plan: plain_table_db_test Reviewers: haobo, kailiu, dhruba Reviewed By: haobo CC: nkg-, leveldb Differential Revision: https://reviews.facebook.net/D14343	2013-12-04 13:43:45 -08:00
Kai Liu	1966b63137	Merge branch 'master' into perf	2013-11-27 11:47:40 -08:00
Haobo Xu	5b825d6964	[RocksDB] Use raw pointer instead of shared pointer when passing Statistics object internally Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient. Test Plan: make check Reviewers: dhruba, MarkCallaghan, igor Reviewed By: dhruba CC: leveldb, reconnect.grayhat Differential Revision: https://reviews.facebook.net/D14289	2013-11-25 10:38:15 -08:00
Siying Dong	dfa1460d88	[For Performance Branch] Bloom filter in PlainTableIterator::Seek() - Update 1 Summary: Address @haobo's comments in D14277 Test Plan: ./indexed_table_db_test Reviewers: haobo CC: Task ID: # Blame Rev:	2013-11-21 23:33:45 -08:00
Siying Dong	718488abc5	Add BloomFilter to PlainTableIterator::Seek() Summary: This patch adds a simple bloom filter in PlainTableIterator::Seek() Test Plan: N/A Reviewers: CC: Task ID: # Blame Rev:	2013-11-21 22:26:39 -08:00
kailiu	0c93df912e	Improve the readability of the TableProperties::ToString()	2013-11-21 17:54:23 -08:00
Siying Dong	3e35aa6412	Revert "Allow users to profile a query and see bottleneck of the query" This reverts commit 3d8ac31d7168c916d6f2f0729eb627b07d8f082b.	2013-11-21 17:40:39 -08:00
Siying Dong	b135d01e7b	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001 Conflicts: table/merger.cc	2013-11-21 17:39:19 -08:00
Siying Dong	3d8ac31d71	Allow users to profile a query and see bottleneck of the query Summary: Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later. Test Plan: Enable this profiling in seveal existing tests. Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14001	2013-11-21 16:29:57 -08:00
kailiu	7b10fe9fac	Fix a memory leak happened in table_test	2013-11-20 20:49:23 -08:00
Siying Dong	b59d4d5a50	A Simple Plain Table Summary: A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes. Test Plan:Add unit test Reviewers:haobo,dhruba,kailiu CC: Task ID: # Blame Rev:	2013-11-20 18:44:22 -08:00
kailiu	56589ab81f	Add TableOptions for BlockBasedTableFactory We are having more and more options to specify for this table so it makes sense to have a TableOptions for future extension.	2013-11-20 18:42:12 -08:00
Siying Dong	15b31b57df	MergingIterator.Seek() to lazily initialize MinHeap Summary: For the use cases that prefix filtering is enabled, initializing heaps when doing MergingIterator.Seek() might introduce non-negligible costs. This patch makes it lazily done. Test Plan: make all check Reviewers: haobo,dhruba,kailiu CC: Task ID: # Blame Rev:	2013-11-20 17:29:35 -08:00
kailiu	1c8b819be2	Fix a memory leak happened in table_test	2013-11-20 13:45:32 -08:00
kailiu	6eb5649800	Move flush_block_policy from Options to TableFactory Summary: Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables. It will make more sense to move it to block based table's own factory class. Test Plan: make check to pass existing tests Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14211	2013-11-19 22:00:48 -08:00
Igor Canadi	469a9f32a7	Fix two nasty use-after-free-bugs Summary: These bugs were caught by ASAN crash test. 1. The first one, in table/filter_block.cc is very nasty. We first reference entries_ and store the reference to Slice prev. Then, we call entries_.append(), which can change the reference. The Slice prev now points to junk. 2. The second one is a bug in a test, so it's not very serious. Once we set read_opts.prefix, we never clear it, so some other function might still reference it. Test Plan: asan crash test now runs more than 5 mins. Before, it failed immediately. I will run the full one, but the full one takes quite some time (5 hours) Reviewers: dhruba, haobo, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14223	2013-11-19 21:01:48 -08:00
kailiu	1415f8820d	Improve the "table stats" Summary: The primary motivation of the changes is to make it easier to figure out the inside of the tables. * rename "table stats" to "table properties" since now we have more than "integers" to store in the property block. * Add filter block size to the basic table properties. * Whenever a table is built, we'll log the table properties (the sample output is in Test Plan). * Make an api to expose deleted keys. Test Plan: Passed all existing test. and the sample output of table stats: ================================================================== Basic Properties ------------------------------------------------------------------ # data blocks: 1 # entries: 1 raw key size: 9 raw average key size: 9 raw value size: 9 raw average value size: 0 data block size: 25 index block size: 27 filter block size: 18 (estimated) table size: 70 filter policy: rocksdb.BuiltinBloomFilter ================================================================== User collected properties: InternalKeyPropertiesCollector ------------------------------------------------------------------ kDeletedKeys: 1 ================================================================== Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14187	2013-11-19 16:29:42 -08:00
kailiu	97d8e573a6	make util/env_posix.cc work under mac Summary: This diff invoves some more complicated issues in the posix environment. Test Plan: works under mac os. will need to verify dev box. Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14061	2013-11-16 23:44:39 -08:00
Kai Liu	7604e2f70c	Make the options in table_builder/block_builder less misleading Summary: By original design, the regular `block options` and index `block options` in table_builder is mutable. We can use ChangeOptions to change the options directly. However, with my last change, `BlockBuilder` no longer hold the reference to the index_block_options -- as a result, any changes made after the creation of index block builder will be of no effect. But still the code is very error-prone and developers can easily fall into the trap without aware of it. To avoid this problem from happening in the future, I deleted the `ChangeOptions` and the `index_block_options`, as well as many other changes to make it less misleading. Test Plan: make make check make release Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13707	2013-11-16 23:21:15 -08:00
Siying Dong	55baa3d955	Add an option to table_reader_bench to access the table from DB And Iterating non-existing prefix case. Summary: This patch adds an option to table_reader_bench that queries run against DB level (which has one table). It is useful if user wants to see the extra costs DB level introduces. Test Plan: Run the benchmark with and without the new parameter Reviewers: haobo, dhruba, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13863	2013-11-15 22:23:12 -08:00
Igor Canadi	e0ad0f26b8	Fix bloom filters Summary: https://reviews.facebook.net/D13167 broke bloom filters. If filter is not in cache, we want to return true (safe thing). Am I right? Test Plan: when benchmarking https://reviews.facebook.net/D14031 I got different results when using bloom filters vs. when not using them. This fixed the issue. I will also be putting this change to the other diff, but that one will probably be in review for longer time. Reviewers: kailiu, dhruba, haobo Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D14085	2013-11-14 14:05:15 -08:00
Kai Liu	80bb81c6fe	Add the correct table_factory for tables in table_tests	2013-11-12 23:54:31 -08:00
Kai Liu	88ba331c1a	Add the index/filter block cache Summary: This diff leverage the existing block cache and extend it to cache index/filter block. Test Plan: Added new tests in db_test and table_test The correctness is checked by: 1. make check 2. make valgrind_check Performance is test by: 1. 10 times of build_tools/regression_build_test.sh on two versions of rocksdb before/after the code change. Test results suggests no significant difference between them. For the two key operatons `overwrite` and `readrandom`, the average iops are both 20k and ~260k, with very small variance). 2. db_stress. Reviewers: dhruba Reviewed By: dhruba CC: leveldb, haobo, xjin Differential Revision: https://reviews.facebook.net/D13167	2013-11-12 22:46:51 -08:00
kailiu	21587760b9	Fixing the warning messages captured under mac os # Consider using `git commit -m 'One line title' && arc diff`. # You will save time by running lint and unit in the background. Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os.. Test Plan: ran make in mac os Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14049	2013-11-12 20:05:28 -08:00
Kai Liu	0ef628537c	Don't not suggest flushing data when data block is still empty Summary: This diff fix the bug when the Options::block_size is too small.	2013-11-11 21:05:16 -08:00
Kai Liu	551ecfa416	Move down the time consuming tests in table_test Summary: it helps us to better check the tests we really care. Test Plan: make	2013-11-10 01:17:32 -08:00
Kai Liu	fd075d6edd	Provide mechanism to configure when to flush the block Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block. Test Plan: make check Reviewers: dhruba, haobo, sdong, igor Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D13875	2013-11-07 21:27:21 -08:00
Dhruba Borthakur	292c2b3357	Fix stress test failure when using mmap-reads. Summary: The mmap-read file->Read() does not use the scratch buffer to read in file-contents. Test Plan: ./db_stress --test_batches_snapshots=1 --ops_per_thread=100000000 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=0 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/dhruba --max_key=100000000 --disable_seek_compaction=0 --mmap_read=1 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=1 --disable_wal=0 --disable_data_sync=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0 Reviewers: haobo, kailiu Reviewed By: kailiu CC: leveldb, kailiu, emayanke Differential Revision: https://reviews.facebook.net/D13923	2013-11-06 15:40:26 -08:00
Igor Canadi	36409e0016	Fix slow no-io iterator Summary: This fixes #3130525. Dhruba's suggestion and Tnovak's implementation :) The issue was with SkipEmptyDataBlocksForward(), but I also changed SkipEmptyDataBlocksBackward(). Is that OK? Test Plan: Run the logdevice test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13911	2013-11-06 14:11:52 -08:00
Dhruba Borthakur	b4ad5e89ae	Implement a compressed block cache. Summary: Rocksdb can now support a uncompressed block cache, or a compressed block cache or both. Lookups first look for a block in the uncompressed cache, if it is not found only then it is looked up in the compressed cache. If it is found in the compressed cache, then it is uncompressed and inserted into the uncompressed cache. It is possible that the same block resides in the compressed cache as well as the uncompressed cache at the same time. Both caches have their own individual LRU policy. Test Plan: Unit test case attached. Reviewers: kailiu, sdong, haobo, leveldb Reviewed By: haobo CC: xjin, haobo Differential Revision: https://reviews.facebook.net/D12675	2013-11-01 14:31:35 -07:00
Siying Dong	82b7e37f6e	Fix a bug of table_reader_bench Summary: Iterator benchmark case is timed incorrectly. Fix it Test Plan: Run the benchmark Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13845	2013-10-31 15:26:06 -07:00
Siying Dong	7caadf2e52	A very simple benchmark to measure Table implemenation's Get() And Iterator performance Summary: It is a very simple benchmark to measure a Table implementation's Get() and iterator performance if all the data is in memory. Test Plan: N/A Reviewers: dhruba, haobo, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13743	2013-10-31 13:38:54 -07:00
Naman Gupta	b4fab3be2a	Merge branch 'master' of github.com:facebook/rocksdb into inplace	2013-10-31 11:51:03 -07:00
Naman Gupta	fe25070242	In-place updates for equal keys and similar sized values Summary: Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case: 1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot 2. Latest value type is a 'put' ie kTypeValue 3. New value size is less than existing value, to avoid reallocating memory TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge. TODO: Update the transaction log, to allow consistent reload of the memtable. Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next. Reviewers: xinyaohu, sumeet, haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12423 Automatic commit by arc	2013-10-31 11:27:12 -07:00
Siying Dong	f03b2df010	Follow-up Cleaning-up After D13521 Summary: This patch is to address @haobo's comments on D13521: 1. rename Table to be TableReader and make its factory function to be GetTableReader 2. move the compression type selection logic out of TableBuilder but to compaction logic 3. more accurate comments 4. Move stat name constants into BlockBasedTable implementation. 5. remove some uncleaned codes in simple_table_db_test Test Plan: pass test suites. Reviewers: haobo, dhruba, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13785	2013-10-30 10:52:33 -07:00
Kai Liu	79d8dad331	Change a typo in method signature	2013-10-28 21:23:17 -07:00
Siying Dong	d4eec30ed0	Make "Table" pluggable Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder. Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented. Reviewers: dhruba, haobo, kailiu, emayanke, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13521	2013-10-28 17:54:09 -07:00
Kai Liu	994575c134	Support user-defined table stats collector Summary: 1. Added a new option that support user-defined table stats collection. 2. Added a deleted key stats collector in `utilities` Test Plan: Added a unit test for newly added code. Also ran make check to make sure other tests are not broken. Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13491	2013-10-28 15:45:14 -07:00
Kai Liu	1ca86f0391	Fix a bug that index block's restart_block_interval is not 1 Summary: This bug may affect the seek performance. Test Plan: make make check Also gdb into some index block builder to make sure the restart_block_interval is `1`.	2013-10-28 10:51:34 -07:00
Kai Liu	43ee5e2b3a	Fix the valgrind error in newly added unittests for table stats Summary: Previous the newly added test called NewBloomFilter without releasing it at the end of the test, which resulted in memory leak and was detected by valgrind. Test Plan: Ran valgrind test.	2013-10-20 22:02:05 -07:00
Siying Dong	9edda37027	Universal Compaction to Have a Size Percentage Threshold To Decide Whether to Compress Summary: This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases. Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later. Test Plan: add test DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12 Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13467	2013-10-17 13:33:39 -07:00
Kai Liu	aac44226a0	Add bloom filter to predefined table stats Summary: As title. Test Plan: Updated the unit tests to make sure new statistic is correctly written/read. Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13497	2013-10-17 11:43:06 -07:00
Dhruba Borthakur	9cd221094c	Add appropriate LICENSE and Copyright message. Summary: Add appropriate LICENSE and Copyright message. Test Plan: make check Reviewers: CC: Task ID: # Blame Rev:	2013-10-16 17:48:41 -07:00
Kai Liu	86ef6c3f74	Add statistics to sst file Summary: So far we only have key/value pairs as well as bloom filter stored in the sst file. It will be great if we are able to store more metadata about this table itself, for example, the entry size, bloom filter name, etc. This diff is the first step of this effort. It allows table to keep the basic statistics mentioned in http://fburl.com/14995441, as well as allowing writing user-collected stats to stats block. After this diff, we will figure out the interface of how to allow user to collect their interested statistics. Test Plan: 1. Added several unit tests. 2. Ran `make check` to ensure it doesn't break other tests. Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13419	2013-10-14 15:56:13 -07:00
Kai Liu	1f8ade6bd6	Fix a bug in table builder Summary: In talbe.cc, when reading the metablock, it uses BytewiseComparator(); However in table_builder.cc, we use r->options.comparator. After tracing the creation of r->options.comparator, I found this comparator is an InternalKeyComparator, which wraps the user defined comparator(details can be found in DBImpl::SanitizeOptions(). I encountered this problem when adding metadata about "bloom filter" before. With different comparator, we may fail to do the binary sort. Current code works well since there is only one entry in meta block. Test Plan: make all check I've also tested this change in https://reviews.facebook.net/D8283 before. Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13335	2013-10-07 18:53:19 -07:00
Dhruba Borthakur	4463b11cad	Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Test Plan: make check Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13311	2013-10-06 00:14:26 -07:00
Dhruba Borthakur	a143ef9b38	Change namespace from leveldb to rocksdb Summary: Change namespace from leveldb to rocksdb. This allows a single application to link in open-source leveldb code as well as rocksdb code into the same process. Test Plan: compile rocksdb Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13287	2013-10-04 11:59:26 -07:00

1 2 3

118 Commits