rocksdb

Author	SHA1	Message	Date
kshair	25ae380784	Fix comment spelling (#7960 ) Summary: terated -> treated Pull Request resolved: https://github.com/facebook/rocksdb/pull/7960 Reviewed By: ajkr Differential Revision: D26677005 Pulled By: zhichao-cao fbshipit-source-id: 6221305afb263aa60f674a4113aa30cb8f3914e6	2021-03-29 10:37:24 -07:00
kshair	d2e9eab1ea	Fix mis-spelling (#8001 ) Summary: concurrnet -> concurrent Pull Request resolved: https://github.com/facebook/rocksdb/pull/8001 Reviewed By: ajkr Differential Revision: D26659381 Pulled By: riversand963 fbshipit-source-id: 890d102d1cf836ed3b183da66d3d56a3158017d0	2021-03-09 01:19:18 -08:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Lingjing You	1a928c22a0	Add insert hints for each writebatch (#5728 ) Summary: Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it. Bench result (qps): `./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4` master: \| batch size \ thread num \| 1 \| 2 \| 4 \| 8 \| \| ----------------------- \| ------- \| ------- \| ------- \| ------- \| \| 1 \| 387883 \| 220790 \| 308294 \| 490998 \| \| 10 \| 1397208 \| 978911 \| 1275684 \| 1733395 \| \| 100 \| 2045414 \| 1589927 \| 1798782 \| 2681039 \| \| 1000 \| 2228038 \| 1698252 \| 1839877 \| 2863490 \| fillseq with writebatch hint: \| batch size \ thread num \| 1 \| 2 \| 4 \| 8 \| \| ----------------------- \| ------- \| ------- \| ------- \| ------- \| \| 1 \| 286005 \| 223570 \| 300024 \| 466981 \| \| 10 \| 970374 \| 813308 \| 1399299 \| 1753588 \| \| 100 \| 1962768 \| 1983023 \| 2676577 \| 3086426 \| \| 1000 \| 2195853 \| 2676782 \| 3231048 \| 3638143 \| Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728 Differential Revision: D17297240 fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c	2019-09-12 17:15:18 -07:00
Siying Dong	1f7f5a5a79	Run automatic formatter against public header files (#5115 ) Summary: Automatically format public headers so it looks more consistent. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5115 Differential Revision: D14632854 Pulled By: siying fbshipit-source-id: ce9929ea62f9dcd65c69660b23eed1931cb0ae84	2019-03-27 13:24:25 -07:00
Siying Dong	cf3a671733	Remove cuckoo hash memtable (#4953 ) Summary: Cuckoo Hash is less useful than we initially expected. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953 Differential Revision: D13979264 Pulled By: siying fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed	2019-02-07 16:15:27 -08:00
Gauresh Rane	ad789e4e0d	Adding a method for memtable class for memtable getting flushed. (#4304 ) Summary: Memtables are selected for flushing by the flush job. Currently we have listener which is invoked when memtables for a column family are flushed. That listener does not indicate which memtable was flushed in the notification. If clients want to know if particular data in the memtable was retired, there is no straight forward way to know this. This method will help users who implement memtablerep factory and extend interface for memtablerep, to know if the data in the memtable was retired. Another option that was tried, was to depend on memtable destructor to be called after flush to mark that data was persisted. This works all the time but sometimes there can huge delays between actual flush happening and memtable getting destroyed. Hence, if anyone who is waiting for data to persist will have to wait that longer. It is expected that anyone who is implementing this method to have return quickly as it blocks RocksDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4304 Reviewed By: riversand963 Differential Revision: D9472312 Pulled By: gdrane fbshipit-source-id: 8e693308dee749586af3a4c5d4fcf1fa5276ea4d	2018-08-23 17:14:25 -07:00
Radoslaw Zarzynski	09b6bf828a	InlineSkiplist: don't decode keys unnecessarily during comparisons Summary: Summary ======== `InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that. Rough performance comparison ===== Big keys, no compression. ``` $ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256 (...) fillrandom : 4.222 micros/op 236836 ops/sec; 80.4 MB/s ``` ``` $ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256 (...) fillrandom : 4.064 micros/op 246059 ops/sec; 83.5 MB/s ``` TODO ====== In ~~a separated~~ this PR: - [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely. - [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key. Closes https://github.com/facebook/rocksdb/pull/3516 Differential Revision: D7059300 Pulled By: ajkr fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb	2018-03-23 12:14:30 -07:00
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	2018-03-08 10:27:25 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Maysam Yabandeh	8eb1d445c3	Unbreak MemTableRep API change Summary: The MemTableRep API was broken by this commit: `813719e952` This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it. Closes https://github.com/facebook/rocksdb/pull/3513 Differential Revision: D7004134 Pulled By: maysamyabandeh fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f	2018-02-15 17:27:24 -08:00
Maysam Yabandeh	813719e952	WritePrepared Txn: Duplicate Keys, Memtable part Summary: Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions. Closes https://github.com/facebook/rocksdb/pull/3418 Differential Revision: D6822412 Pulled By: maysamyabandeh fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77	2018-01-31 18:57:07 -08:00
topilski	b9873162f0	Fixed get version on windows, moved throwing exceptions into cc file. Summary: Fixes for msys2 and mingw, hide exceptions into cpp file. Closes https://github.com/facebook/rocksdb/pull/3377 Differential Revision: D6746707 Pulled By: yiwu-arbug fbshipit-source-id: 456b38df80bc48b8386a2cf87f669b5a4f9999a4	2018-01-18 14:56:56 -08:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Siying Dong	95b0e89b5d	Improve write buffer manager (and allow the size to be tracked in block cache) Summary: Improve write buffer manager in several ways: 1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower. 2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit. 3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable. Closes https://github.com/facebook/rocksdb/pull/2350 Differential Revision: D5110648 Pulled By: siying fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017	2017-06-02 14:26:56 -07:00
Andrew Kryczka	a4d9c02511	Pass CF ID to MemTableRepFactory Summary: Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff: - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload. - updates MemTable to create MemTableRep's using the new overload. Closes https://github.com/facebook/rocksdb/pull/2346 Differential Revision: D5108061 Pulled By: ajkr fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616	2017-06-02 12:12:06 -07:00
Yi Wu	dfb6fe6755	Unified InlineSkipList::Insert algorithm with hinting Summary: This PR is based on nbronson's diff with small modifications to wire it up with existing interface. Comparing to previous version, this approach works better for inserting keys in decreasing order or updating the same key, and impose less restriction to the prefix extractor. ---- Summary from original diff ---- This diff introduces a single InlineSkipList::Insert that unifies the existing sequential insert optimization (prev_), concurrent insertion, and insertion using externally-managed insertion point hints. There's a deep symmetry between insertion hints (cursors) and the concurrent algorithm. In both cases we have partial information from the recent past that is likely but not certain to be accurate. This diff introduces the struct InlineSkipList::Splice, which encodes predecessor and successor information in the same form that was previously only used within a single call to InsertConcurrently. Splice holds information about an insertion point that can be used to levera Closes https://github.com/facebook/rocksdb/pull/1561 Differential Revision: D4217283 Pulled By: yiwu-arbug fbshipit-source-id: 33ee437	2016-11-22 14:09:13 -08:00
Yi Wu	1ea79a78c9	Optimize sequential insert into memtable - Part 1: Interface Summary: Currently our skip-list have an optimization to speedup sequential inserts from a single stream, by remembering the last insert position. We extend the idea to support sequential inserts from multiple streams, and even tolerate small reordering wihtin each stream. This PR is the interface part adding the following: - Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key. - Add `InsertWithHint()` interface to memtable, to allow underlying implementation to return a hint of insert position, which can be later pass back to optimize inserts. - Memtable will maintain a map from prefix to hints and pass the hint via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null. Closes https://github.com/facebook/rocksdb/pull/1419 Differential Revision: D4079367 Pulled By: yiwu-arbug fbshipit-source-id: 3555326	2016-11-13 19:09:18 -08:00
Aaron Gao	f517d9dd09	Add SeekForPrev() to Iterator Summary: Add new Iterator API, `SeekForPrev`: find the last key that <= target key support prefix_extractor support prefix_same_as_start support upper_bound not supported in iterators without Prev() Also add tests in db_iter_test and db_iterator_test Pass all tests Cheers! Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64149	2016-09-27 18:20:57 -07:00
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	2016-02-09 15:12:00 -08:00
sdong	b1887c5dd9	Explictly fail when memtable doesn't support concurrent insert Summary: If users turn on concurrent insert but the memtable doesn't support it, they might see unexcepted crash. Fix it by explicitly fail. Test Plan: Run different setting of stress_test and make sure it fails correctly. Will add a unit test too. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, andrewkr, ngbronson Reviewed By: ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53895	2016-02-05 14:15:50 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
sdong	40f562e747	Allow GetApproximateSize() to include mem table size if it is skip list memtable Summary: Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables. To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>. Test Plan: Add a test case Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40119	2015-06-16 18:13:23 -07:00
agiardullo	84c5bd7eb9	Add thread-safety documentation to MemTable and related classes Summary: Other than making some class members private, this is a documentation-only change Test Plan: unit tests Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36567	2015-04-08 21:10:35 -07:00
Jonah Cohen	a14b7873ee	Enforce write buffer memory limit across column families Summary: Introduces a new class for managing write buffer memory across column families. We supplement ColumnFamilyOptions::write_buffer_size with ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer instance that enforces memory limits before flushing out to disk. Test Plan: Added SharedWriteBuffer unit test to db_test.cc Reviewers: sdong, rven, ljin, igor Reviewed By: igor Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim Differential Revision: https://reviews.facebook.net/D22581	2014-12-02 12:09:20 -08:00
Tomislav Novak	88edfd90ae	SkipListRep::LookaheadIterator Summary: This diff introduces the `lookahead` argument to `SkipListFactory()`. This is an optimization for the tailing use case which includes many seeks. E.g. consider the following operations on a skip list iterator: Seek(x), Next(), Next(), Seek(x+2), Next(), Seek(x+3), Next(), Next(), ... If `lookahead` is positive, `SkipListRep` will return an iterator which also keeps track of the previously visited node. Seek() then first does a linear search starting from that node (up to `lookahead` steps). As in the tailing example above, this may require fewer than ~log(n) comparisons as with regular skip list search. Test Plan: Added a new benchmark (`fillseekseq`) which simulates the usage pattern. It first writes N records (with consecutive keys), then measures how much time it takes to read them by calling `Seek()` and `Next()`. $ time ./db_bench -num 10000000 -benchmarks fillseekseq -prefix_size 1 \ -key_size 8 -write_buffer_size $[102410241024] -value_size 50 \ -seekseq_next 2 -skip_list_lookahead=0 [...] DB path: [/dev/shm/rocksdbtest/dbbench] fillseekseq : 0.389 micros/op 2569047 ops/sec; real 0m21.806s user 0m12.106s sys 0m9.672s $ time ./db_bench [...] -skip_list_lookahead=2 [...] DB path: [/dev/shm/rocksdbtest/dbbench] fillseekseq : 0.153 micros/op 6540684 ops/sec; real 0m19.469s user 0m10.192s sys 0m9.252s Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb, march, lovro Differential Revision: https://reviews.facebook.net/D23997	2014-10-07 11:48:23 -07:00
sdong	9c332aa11a	HashLinkList memtable switches a bucket to a skip list to reduce performance outliers Summary: In this patch, we enhance HashLinkList memtable to reduce performance outliers when a bucket contains too many entries. We switch to skip list for this case to enable binary search. Add threshold_use_skiplist parameter to determine when a bucket needs to switch to skip list. The new data structure is documented in comments in the codes. Test Plan: make all check set threshold_use_skiplist in several tests Reviewers: yhchiang, haobo, ljin Reviewed By: yhchiang, ljin Subscribers: nkg-, xjin, dhruba, yhchiang, leveldb Differential Revision: https://reviews.facebook.net/D19299	2014-07-01 17:14:15 -07:00
sdong	19de6a7aad	Remove MemTableRep::GetIterator(const Slice& slice) Summary: It seems to me that when ever function MemTableRep::GetIterator(const Slice& slice) is used, we can use MemTableRep::GetDynamicPrefixIterator() instead. Just delete it to simplify the codes. Test Plan: make all check Reviewers: yhchiang, ljin Reviewed By: ljin Subscribers: xjin, dhruba, haobo, leveldb Differential Revision: https://reviews.facebook.net/D19281	2014-06-25 14:09:29 -07:00
sdong	8c265c08f1	HashLinkList to log distribution of number of entries aross buckets Summary: Add two parameters of hash linked list to log distribution of number of entries across all buckets, and a sample row when there are too many entries in one single bucket. Test Plan: Turn it on in plain_table_db_test and see the logs. Reviewers: haobo, ljin Reviewed By: ljin Subscribers: leveldb, nkg-, dhruba, yhchiang Differential Revision: https://reviews.facebook.net/D19095	2014-06-17 17:55:36 -07:00
sdong	df9069d23f	In DB::NewIterator(), try to allocate the whole iterator tree in an arena Summary: In this patch, try to allocate the whole iterator tree starting from DBIter from an arena 1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it. 2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator. 3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it. Limitations: (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc (2) Two level iterator itself is allocated in arena, but not iterators inside it. Test Plan: make all check Reviewers: ljin, haobo Reviewed By: haobo Subscribers: leveldb, dhruba, yhchiang, igor Differential Revision: https://reviews.facebook.net/D18513	2014-06-02 17:44:57 -07:00
sdong	3a171dcb51	Pass logger to memtable rep and TLB page allocation error logged to info logs Summary: TLB page allocation errors are now logged to info logs, instead of stderr. In order to do that, mem table rep's factory functions take a info logger now. Test Plan: make all check Reviewers: haobo, igor, yhchiang Reviewed By: yhchiang CC: leveldb, yhchiang, dhruba Differential Revision: https://reviews.facebook.net/D18471	2014-05-05 16:43:37 -07:00
sdong	4a7c747064	Revert "Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB"" And make the default 0 for hash linked list memtable This reverts commit `d69dc64be7`.	2014-05-04 13:56:29 -07:00
Igor Canadi	d69dc64be7	Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB" This reverts commit `7dafa3a1d7`.	2014-05-04 08:37:09 -07:00
sdong	7dafa3a1d7	Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table. Test Plan: make all check Reviewers: haobo, ljin Reviewed By: haobo CC: nkg-, dhruba, leveldb, igor, yhchiang Differential Revision: https://reviews.facebook.net/D18357	2014-04-30 11:02:26 -07:00
Yueh-Hsuan Chiang	9d9d2965cb	Add a new mem-table representation based on cuckoo hash. Summary: = Major Changes = * Add a new mem-table representation, HashCuckooRep, which is based cuckoo hash. Cuckoo hash uses multiple hash functions. This allows each key to have multiple possible locations in the mem-table. - Put: When insert a key, it will try to find whether one of its possible locations is vacant and store the key. If none of its possible locations are available, then it will kick out a victim key and store at that location. The kicked-out victim key will then be stored at a vacant space of its possible locations or kick-out another victim. In this diff, the kick-out path (known as cuckoo-path) is found using BFS, which guarantees to be the shortest. - Get: Simply tries all possible locations of a key --- this guarantees worst-case constant time complexity. - Time complexity: O(1) for Get, and average O(1) for Put if the fullness of the mem-table is below 80%. - Default using two hash functions, the number of hash functions used by the cuckoo-hash may dynamically increase if it fails to find a short-enough kick-out path. - Currently, HashCuckooRep does not support iteration and snapshots, as our current main purpose of this is to optimize point access. = Minor Changes = * Add IsSnapshotSupported() to DB to indicate whether the current DB supports snapshots. If it returns false, then DB::GetSnapshot() will always return nullptr. Test Plan: Run existing tests. Will develop a test specifically for cuckoo hash in the next diff. Reviewers: sdong, haobo Reviewed By: sdong CC: leveldb, dhruba, igor Differential Revision: https://reviews.facebook.net/D16155	2014-04-29 17:13:46 -07:00
Lei Jin	3995e801ab	kill ReadOptions.prefix and .prefix_seek Summary: also add an override option total_order_iteration if you want to use full iterator with prefix_extractor Test Plan: make all check Reviewers: igor, haobo, sdong, yhchiang Reviewed By: haobo CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D17805	2014-04-25 12:21:34 -07:00
Igor Canadi	588bca2020	RocksDBLite Summary: Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile. Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping) Test Plan: compiles :) Reviewers: dhruba, haobo, ljin, sdong, yhchiang Reviewed By: yhchiang CC: leveldb Differential Revision: https://reviews.facebook.net/D17835	2014-04-15 13:39:26 -07:00
Lei Jin	c90d446ee7	make hash_link_list Node's key space consecutively followed at the end Summary: per sdong's request, this will help processor prefetch on n->key case. Test Plan: make all check Reviewers: sdong, haobo, igor Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D17415	2014-04-04 15:37:28 -07:00
Lei Jin	8d007b4aaf	Consolidate SliceTransform object ownership Summary: (1) Fix SanitizeOptions() to also check HashLinkList. The current dynamic case just happens to work because the 2 classes have the same layout. (2) Do not delete SliceTransform object in HashSkipListFactory and HashLinkListFactory destructor. Reason: SanitizeOptions() enforces prefix_extractor and SliceTransform to be the same object when HashFactory is used. This makes the behavior strange: when HashFactory is used, prefix_extractor will be released by RocksDB. If other memtable factory is used, prefix_extractor should be released by user. Test Plan: db_bench && make asan_check Reviewers: haobo, igor, sdong Reviewed By: igor CC: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D16587	2014-03-10 12:56:46 -07:00
Siying Dong	33042669f6	Reduce malloc of iterators in Get() code paths Summary: This patch optimized Get() code paths by avoiding malloc of iterators. Iterator creation is moved to mem table rep implementations, where a callback is called when any key is found. This is the same practice as what we do in (SST) table readers. db_bench result for readrandom following a writeseq, with no compression, single thread and tmpfs, we see throughput improved to 144958 from 139027, about 3%. Test Plan: make all check Reviewers: dhruba, haobo, igor Reviewed By: haobo CC: leveldb, yhchiang Differential Revision: https://reviews.facebook.net/D14685	2014-02-11 10:32:51 -08:00
Dhruba Borthakur	0982c38020	Fix compilation error with gcc 4.7 Summary: Fix compilation error with gcc 4.7 Test Plan: make clean make Reviewers: CC: Task ID: # Blame Rev:	2014-02-07 13:52:54 -08:00
Yueh-Hsuan Chiang	3ce8d9a988	Add support for plain table format to sst_dump. Summary: This diff enables the command line tool `sst_dump` to work for sst files under plain table format. Changes include: * In tools/sst_dump.cc: - add support for plain table format - display prefix_extractor information when --show_properties is on * In table/format.cc - Now the table magic number of a Footer can be later initialized via ReadFooterFromFile(). * In table/meta_bocks: - add function ReadTableMagicNumber() that reads the magic number of the specified file. Minor fixes: - remove a duplicate #include in table/table_test.cc - fix a commentary typo in include/rocksdb/memtablerep.h - fix lint errors. Test Plan: Runs sst_dump with both block-based and plain-table format files with different arguments, specifically those with --show-properties and --from. * sample output: https://reviews.facebook.net/P261 Reviewers: kailiu, sdong, xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D15903	2014-02-07 11:15:00 -08:00
kailiu	4e0298f23c	Clean up arena API Summary: Easy thing goes first. This patch moves arena to internal dir; based on which, the coming patch will deal with memtable_rep. Test Plan: make check Reviewers: haobo, sdong, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15615	2014-01-30 22:10:10 -08:00
kailiu	3170abd297	Remove unused classes Summary: This is a followup diff for https://reviews.facebook.net/D15447, which picks the most simple task: delete some unused memtable reps. Test Plan: make Reviewers: haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D15585	2014-01-29 16:40:36 -08:00
kailiu	a5e220f5ef	Merge branch 'master' into performance Conflicts: Makefile db/db_impl.cc db/db_test.cc db/memtable_list.cc db/memtable_list.h table/block_based_table_reader.cc table/table_test.cc util/cache.cc util/coding.cc	2014-01-28 10:35:55 -08:00
Siying Dong	b20486f294	[Performance Branch] HashLinkList to avoid to convert length prefixed string back to internal keys Summary: Converting from length prefixed buffer back to internal key costs some CPU but it is not necessary. In this patch, internal keys are pass though the functions so that we don't need to convert back to it. Test Plan: make all check Reviewers: haobo, kailiu Reviewed By: kailiu CC: igor, leveldb Differential Revision: https://reviews.facebook.net/D15393	2014-01-27 10:26:14 -08:00

1 2

71 Commits