rocksdb

Author	SHA1	Message	Date
Siying Dong	58e1956d50	[Only for Performance Branch] A Hacky patch to lazily generate memtable key for prefix-hashed memtables. Summary: For prefix mem tables, encoding mem table key may be unnecessary if the prefix doesn't have any key. This patch is a little bit hacky but I want to try out the performance gain of removing this lazy initialization. In longer term, we might want to revisit the way we abstract mem tables implementations. Test Plan: make all check Reviewers: haobo, igor, kailiu Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D14265	2013-11-20 20:49:23 -08:00
Siying Dong	b59d4d5a50	A Simple Plain Table Summary: A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes. Test Plan:Add unit test Reviewers:haobo,dhruba,kailiu CC: Task ID: # Blame Rev:	2013-11-20 18:44:22 -08:00
kailiu	6eb5649800	Move flush_block_policy from Options to TableFactory Summary: Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables. It will make more sense to move it to block based table's own factory class. Test Plan: make check to pass existing tests Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14211	2013-11-19 22:00:48 -08:00
kailiu	1415f8820d	Improve the "table stats" Summary: The primary motivation of the changes is to make it easier to figure out the inside of the tables. * rename "table stats" to "table properties" since now we have more than "integers" to store in the property block. * Add filter block size to the basic table properties. * Whenever a table is built, we'll log the table properties (the sample output is in Test Plan). * Make an api to expose deleted keys. Test Plan: Passed all existing test. and the sample output of table stats: ================================================================== Basic Properties ------------------------------------------------------------------ # data blocks: 1 # entries: 1 raw key size: 9 raw average key size: 9 raw value size: 9 raw average value size: 0 data block size: 25 index block size: 27 filter block size: 18 (estimated) table size: 70 filter policy: rocksdb.BuiltinBloomFilter ================================================================== User collected properties: InternalKeyPropertiesCollector ------------------------------------------------------------------ kDeletedKeys: 1 ================================================================== Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14187	2013-11-19 16:29:42 -08:00
Igor Canadi	f611aba559	Move the compiler back to 4.8.1 + more small fixes Summary: 1. Moved the compiler back to 4.8.1 and uses Centos 5.2 binaries if OS is Centos 5.2. 2. Fixes this issue: https://github.com/facebook/rocksdb/issues/7 3. We use lot of c++11 features, so we can't pretend we can compile without them. Makes it a first class dependency. 4. Fix blob_store_test, which failes on Ubuntu with "too many files opened" error 5. Removed dependency on port/port_chromium.h, which does not even exist on our system Test Plan: make clean; make check Reviewers: dhruba, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D14145	2013-11-18 11:40:16 -08:00
kailiu	97d8e573a6	make util/env_posix.cc work under mac Summary: This diff invoves some more complicated issues in the posix environment. Test Plan: works under mac os. will need to verify dev box. Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14061	2013-11-16 23:44:39 -08:00
Pascal Borreli	443e04e62d	Fixed typos	2013-11-16 11:21:34 +00:00
Kai Liu	80bb81c6fe	Add the correct table_factory for tables in table_tests	2013-11-12 23:54:31 -08:00
Kai Liu	22e1b04deb	Quick fix for a string format Summary: Fix one more string format issue that throws warning in mac	2013-11-12 21:22:32 -08:00
Kai Liu	35460ccb53	Fix the string format issue Summary: mac and our dev server has totally differnt definition of uint64_t, therefore fixing the warning in mac has actually made code in linux uncompileable. Test Plan: make clean && make -j32	2013-11-12 21:05:39 -08:00
kailiu	21587760b9	Fixing the warning messages captured under mac os # Consider using `git commit -m 'One line title' && arc diff`. # You will save time by running lint and unit in the background. Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os.. Test Plan: ran make in mac os Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D14049	2013-11-12 20:05:28 -08:00
Igor Canadi	9bc4a26f56	Small changes in Deleting obsolete files Summary: @haobo's suggestions from https://reviews.facebook.net/D13827 Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop. I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though. Test Plan: Ran deletefile_test Reviewers: haobo Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D14025	2013-11-12 11:53:26 -08:00
lovro	8a46ecd357	WriteBatch::Put() overload that gathers key and value from arrays of slices Summary: In our project, when writing to the database, we want to form the value as the concatenation of a small header and a larger payload. It's a shame to have to copy the payload just so we can give RocksDB API a linear view of the value. Since RocksDB makes a copy internally, it's easy to support gather writes. Test Plan: write_batch_test, new test case Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13947	2013-11-08 16:34:32 -08:00
Igor Canadi	1510339e52	Speed up FindObsoleteFiles Summary: Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file. It might speed things up a bit, but makes code uglier. I don't really like it. Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though. Let's discuss offline today. Test Plan: Ran ./db_stress, verified that files are getting deleted Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D13827	2013-11-08 15:23:46 -08:00
Igor Canadi	8b3379dc0a	Implementing DynamicIterator for TransformRepNoLock Summary: What @haobo done with TransformRep, now in TransformRepNoLock. Similar implementation, except that I made DynamicIterator a subclass of Iterator which makes me have less iterator initializations. Test Plan: ./prefix_test. Seeing huge savings vs. TransformRep again! Reviewers: dhruba, haobo, sdong, kailiu Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D13953	2013-11-08 00:31:09 -08:00
Kai Liu	fd075d6edd	Provide mechanism to configure when to flush the block Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block. Test Plan: make check Reviewers: dhruba, haobo, sdong, igor Reviewed By: sdong CC: leveldb Differential Revision: https://reviews.facebook.net/D13875	2013-11-07 21:27:21 -08:00
Igor Canadi	444cf88a56	Flush the log outside of lock Summary: Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held. We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem. Test Plan: db_test Reviewers: dhruba, haobo, kailiu, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13935	2013-11-07 11:31:56 -08:00
Haobo Xu	fd2044883a	[RocksDB] Generalize prefix-aware iterator to be used for more than one Seek Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too. Test Plan: test it on Leaf Reviewers: dhruba, kailiu, sdong, igor Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D13929	2013-11-06 20:45:49 -08:00
shamdor	c2be2cba04	WAL log retention policy based on archive size. Summary: Archive cleaning will still happen every WAL_ttl seconds but archived logs will be deleted only if archive size is greater then a WAL_size_limit value. Empty archived logs will be deleted evety WAL_ttl. Test Plan: 1. Unit tests pass. 2. Benchmark. Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13869	2013-11-06 18:46:28 -08:00
Igor Canadi	95a8213f75	Log flush every 0 seconds Summary: We have to be able to catch last few log outputs before a crash Test Plan: no Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13917	2013-11-06 14:19:46 -08:00
Igor Canadi	be96f2498e	TransformRep - use array instead of unordered_map Summary: I'm sending this diff together with https://reviews.facebook.net/D13881 because it didn't allow me to send only the array one. Here I also replaced unordered_map with just an array of shared_ptrs. This elminated all the locks. I will run the new benchmark and post the results here. Test Plan: db_test Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13893	2013-11-06 11:55:43 -08:00
Dhruba Borthakur	b4ad5e89ae	Implement a compressed block cache. Summary: Rocksdb can now support a uncompressed block cache, or a compressed block cache or both. Lookups first look for a block in the uncompressed cache, if it is not found only then it is looked up in the compressed cache. If it is found in the compressed cache, then it is uncompressed and inserted into the uncompressed cache. It is possible that the same block resides in the compressed cache as well as the uncompressed cache at the same time. Both caches have their own individual LRU policy. Test Plan: Unit test case attached. Reviewers: kailiu, sdong, haobo, leveldb Reviewed By: haobo CC: xjin, haobo Differential Revision: https://reviews.facebook.net/D12675	2013-11-01 14:31:35 -07:00
Piyush Garg	1e4375d2ef	Task #3071144 Enhance ldb (db dump tool for leveldb) to report row counters for each row type Summary: Added an option --count_delim=<char> which takes the given character as delimiter ('.' by default) and reports count of each row type found in the db Test Plan: 1. Created test in file (for DBDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",". 2. Created test in file (for InternalDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",". 3. Manually created a database with several keys of several type and verified by running the command ./ldb db=<path> dump --count_delim="<char>" ./ldb db=<path> idump --count_delim="<char>" Reviewers: vamsi, dhruba, emayanke, kailiu Reviewed By: vamsi CC: leveldb Differential Revision: https://reviews.facebook.net/D13815	2013-11-01 13:59:14 -07:00
Igor Canadi	b572e81f94	Flush Log every 5 seconds Summary: This might help with p99 performance, but does not solve the real problem. More discussion on #2947135 Test Plan: make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13809	2013-10-31 15:36:40 -07:00
Naman Gupta	b4fab3be2a	Merge branch 'master' of github.com:facebook/rocksdb into inplace	2013-10-31 11:51:03 -07:00
Naman Gupta	fe25070242	In-place updates for equal keys and similar sized values Summary: Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case: 1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot 2. Latest value type is a 'put' ie kTypeValue 3. New value size is less than existing value, to avoid reallocating memory TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge. TODO: Update the transaction log, to allow consistent reload of the memtable. Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next. Reviewers: xinyaohu, sumeet, haobo, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12423 Automatic commit by arc	2013-10-31 11:27:12 -07:00
Siying Dong	d4eec30ed0	Make "Table" pluggable Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder. Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented. Reviewers: dhruba, haobo, kailiu, emayanke, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13521	2013-10-28 17:54:09 -07:00
Kai Liu	994575c134	Support user-defined table stats collector Summary: 1. Added a new option that support user-defined table stats collection. 2. Added a deleted key stats collector in `utilities` Test Plan: Added a unit test for newly added code. Also ran make check to make sure other tests are not broken. Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13491	2013-10-28 15:45:14 -07:00
Igor Canadi	cb8a7302e4	Implement max_size in BlobStore Summary: I added max_size option in blobstore. Since we now know the maximum number of buckets we'll ever use, we can allocate an array of buckets and access its elements without use of any locks! Common case Get doesn't lock anything now. Benchmarks on 16KB block size show no impact on speed, though. Test Plan: unittests + benchmark Reviewers: dhruba, haobo, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13641	2013-10-23 14:38:52 -07:00
Haobo Xu	2fb361ad98	[RocksDB] Add perf_context.wal_write_time to track time spent on writing the recovery log. Summary: as title Test Plan: make check; ./perf_context_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13629	2013-10-23 13:38:39 -07:00
Igor Canadi	30f1b97a06	Enable blobs to be fragmented Summary: I have implemented a FreeList version that supports fragmented blob chunks. Each block gets allocated and freed in FIFO order. Since the idea for the blocks to be big, we will not take a big hit of non-sequential IO. Free list is also faster, taking only O(k) size in both free and allocate instead of O(N) as before. See more info on the task: https://our.intern.facebook.com/intern/tasks/?t=2990558 Also, I'm taking Slice instead of const char * and size in Put function. Test Plan: unittests Reviewers: haobo, kailiu, dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13569	2013-10-22 17:44:00 -07:00
Mayank Agarwal	9b50106f9a	Dbid feature Summary: Create a new type of file on startup if it doesn't already exist called DBID. This will store a unique number generated from boost library's uuid header file. The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery) the key point to note is that DBID is not stored in a backup or db snapshot It's preferable to use Boost for uuid because: 1) A non-standard way of generating uuid is not good 2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean 3) c++ doesn't have any direct way to get a uuid 4) Boost is a very good library that was already having linkage in rocksdb from third-party Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug. I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it. @kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify. Test Plan: Expand db_test to test 2 cases 1) Restarting db with Id file present - verify that no change to Id 2)Restarting db with Id file deleted - verify that a different Id is there after reopen Also run make all check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13587	2013-10-22 12:23:34 -07:00
Igor Canadi	c674b42d52	Rephrasing the comment Summary: Per @haobo's request, rephrasing the comment for allocate Test Plan: It's a comment! Reviewers: haobo, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13575	2013-10-21 10:23:56 -07:00
Igor Canadi	bcc8557901	tmpfs does not support fallocate Summary: This caused Siying's unit test to fail. Test Plan: Unittest Reviewers: dhruba, kailiu, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13539	2013-10-17 22:15:57 -07:00
Vamsi Ponnekanti	6731997f64	[ldb compact is not allowing ttl flag] Summary: Allow ttl flag Test Plan: tested on my database that has merge operations and ttl Revert Plan: OK Task ID: #3038186 Reviewers: emayanke, dhruba, haobo Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13503	2013-10-16 18:06:54 -07:00
Dhruba Borthakur	9cd221094c	Add appropriate LICENSE and Copyright message. Summary: Add appropriate LICENSE and Copyright message. Test Plan: make check Reviewers: CC: Task ID: # Blame Rev:	2013-10-16 17:48:41 -07:00
Igor Canadi	fc4616d898	External Value Store Summary: Developing a capability for storing values on external backing file(s). This is just a highly unoptimized first pass - supports: 1) Allocating some portion of external file to be used to store value 2) Freeing the range, enabling it to be reused by other values As next steps, I plan to: 1) Create some kind of stress testing. Once I can measure stuff, I can focus on optimizing. 2) Optimize locking. 3) Optimize freelist data structure. Currently we have O(n) for both freeing and allocation. 4) Figure out how to do recovery. Test Plan: Created a unit test. Reviewers: dhruba, haobo, kailiu Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13389	2013-10-16 17:33:49 -07:00
Dhruba Borthakur	b825df81e2	Fix error in previous commit of 'ftruncate' to 'fallocate'. Summary: Fix error in previous commit of 'ftruncate' to 'fallocate'. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-10-15 13:57:29 -07:00
Dhruba Borthakur	8457b74c26	Fix Unit test when run on tmpfs Summary: tmpfs might not support fallocate(). Fix unit test so that this does not cause a unit test to fail. Test Plan: ./env_test Reviewers: emayanke, igor, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13455	2013-10-15 12:03:09 -07:00
Mayank Agarwal	da2fd001a6	Fix rocksdb->levledb BytewiseComparator and inverted order of error in db/version_set.cc Summary: This is needed to make existing dbs be able to open and also because BytewiseComparator was not changed since leveldb. The inverted order in the error message caused confusion prebiously Test Plan: make; open existing db Reviewers: leveldb, dhruba Reviewed By: dhruba Differential Revision: https://reviews.facebook.net/D13449	2013-10-14 18:16:54 -07:00
sdong	f8509653ba	LRUCache to try to clean entries not referenced first. Summary: With this patch, when LRUCache.Insert() is called and the cache is full, it will first try to free up entries whose reference counter is 1 (would become 0 after remo\ ving from the cache). We do it in two passes, in the first pass, we only try to release those unreferenced entries. If we cannot free enough space after traversing t\ he first remove_scan_cnt_ entries, we start from the beginning again and remove those entries being used. Test Plan: add two unit tests to cover the codes Reviewers: dhruba, haobo, emayanke Reviewed By: emayanke CC: leveldb, emayanke, xjin Differential Revision: https://reviews.facebook.net/D13377	2013-10-11 09:26:21 -07:00
Siying Dong	40a1e31fa5	Minor: Fix a lint error in cache_test.cc Summary: As title. Fix an lint error: Lint: CppLint Error Single-argument constructor 'Value(int v)' may inadvertently be used as a type conversion constructor. Prefix the function with the 'explicit' keyword to avoid this, or add an /* implicit */ comment to suppress this warning. Test Plan: N/A Reviewers: emayanke, haobo, dhruba Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13401	2013-10-10 13:47:25 -07:00
Igor Canadi	d2ca2bd183	Fixing build failure Summary: virtual NewRandomRWFile is not implemented on EnvHdfs, causing build failure. Test Plan: make clean; make all check Reviewers: dhruba, haobo, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D13383	2013-10-10 01:01:16 -07:00
Igor Canadi	d0beadd456	Env class that can randomly read and write Summary: I have implemented basic simple use case that I need for External Value Store I'm working on. There is a potential for making this prettier by refactoring/combining WritableFile and RandomAccessFile, avoiding some copypasta. However, I decided to implement just the basic functionality, so I can continue working on the other diff. Test Plan: Added a unittest Reviewers: dhruba, haobo, kailiu Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13365	2013-10-10 00:03:08 -07:00
Naman Gupta	cbf4a06427	Add option for storing transaction logs in a separate dir Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory. Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases. Reviewers: dhruba, haobo, leveldb CC: xinyaohu, sumeet Differential Revision: https://reviews.facebook.net/D13239	2013-10-08 17:40:27 -07:00
Igor Canadi	fa46ddb41f	Move delete and free outside of crtical section Summary: Split Unref into two parts -> cheap and expensive. Try to call expensive Unref outside of critical section to decrease lock contention. Test Plan: unittests Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, kailiu Differential Revision: https://reviews.facebook.net/D13299	2013-10-07 15:37:40 -07:00
Dhruba Borthakur	4463b11cad	Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix. Test Plan: make check Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13311	2013-10-06 00:14:26 -07:00
Dhruba Borthakur	a143ef9b38	Change namespace from leveldb to rocksdb Summary: Change namespace from leveldb to rocksdb. This allows a single application to link in open-source leveldb code as well as rocksdb code into the same process. Test Plan: compile rocksdb Reviewers: emayanke Reviewed By: emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D13287	2013-10-04 11:59:26 -07:00
Haobo Xu	fa798e9e28	[Rocksdb] Submit mem table flush job in a different thread pool Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow . Test Plan: Try it in shadow test. Reviewers: dhruba, xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D12933	2013-10-03 14:37:19 -07:00
Haobo Xu	71046971f0	[RocksDB] Added perf counters to track skipped internal keys during iteration Summary: as title. unit test not polished. this is for a quick live test Test Plan: live Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13221	2013-10-02 10:48:41 -07:00
Haobo Xu	22bb7c754b	[RocksDB] print the name of options.memtable_factory in LOG so we know Summary: as title Test Plan: make check Reviewers: dhruba, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D13179	2013-09-28 20:57:29 -07:00
Dhruba Borthakur	f1a60e5c3e	The vector rep implementation was segfaulting because of incorrect initialization of vector. Summary: The constructor for Vector memtable has a parameter called 'count' that specifies the capacity of the vector to be reserved at allocation time. It was incorrectly used to initialize the size of the vector. Test Plan: Enhanced db_test. Reviewers: haobo, xjin, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13083	2013-09-25 11:33:52 -07:00
Dhruba Borthakur	87d6eb2f6b	Implement apis in the Environment to clear out pages in the OS cache. Summary: Added a new api to the Environment that allows clearing out not-needed pages from the OS cache. This will be helpful when the compressed block cache replaces the OS cache. Test Plan: EnvPosixTest.InvalidateCache Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D13041	2013-09-23 22:05:03 -07:00
Dhruba Borthakur	5e9f3a9aa7	Better locking in vectorrep that increases throughput to match speed of storage. Summary: There is a use-case where we want to insert data into rocksdb as fast as possible. Vector rep is used for this purpose. The background flush thread needs to flush the vectorrep to storage. It acquires the dblock then sorts the vector, releases the dblock and then writes the sorted vector to storage. This is suboptimal because the lock is held during the sort, which prevents new writes for occuring. This patch moves the sorting of the vector rep to outside the db mutex. Performance is now as fastas the underlying storage system. If you are doing buffered writes to rocksdb files, then you can observe throughput upwards of 200 MB/sec writes. This is an early draft and not yet ready to be reviewed. Test Plan: make check Task ID: # Blame Rev: Reviewers: haobo Reviewed By: haobo CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D12987	2013-09-19 21:48:10 -07:00
Rajat Goel	11c65021fb	Revert "Minor fixes found while trying to compile it using clang on Mac OS X" This reverts commit `5f2c136c32`.	2013-09-15 23:01:26 -07:00
Rajat Goel	5f2c136c32	Minor fixes found while trying to compile it using clang on Mac OS X	2013-09-15 22:06:14 -07:00
Haobo Xu	8866448001	[RocksDB] fix build env_test Summary: move the TwoPools test to the end of thread related tests. Otherwise, the SetBackgroundThreads call would increase the Low pool size and affect the result of other tests. Test Plan: make env_test; ./env_test Reviewers: dhruba, emayanke, xjin Reviewed By: xjin CC: leveldb Differential Revision: https://reviews.facebook.net/D12939	2013-09-13 21:13:20 -07:00
Dhruba Borthakur	4012ca1c7b	Added a parameter to limit the maximum space amplification for universal compaction. Summary: Added a new field called max_size_amplification_ratio in the CompactionOptionsUniversal structure. This determines the maximum percentage overhead of space amplification. The size amplification is defined to be the ratio between the size of the oldest file to the sum of the sizes of all other files. If the size amplification exceeds the specified value, then min_merge_width and max_merge_width are ignored and a full compaction of all files is done. A value of 10 means that the size a database that stores 100 bytes of user data could occupy 110 bytes of physical storage. Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added. Reviewers: haobo, emayanke, xjin Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12825	2013-09-13 16:27:18 -07:00
Haobo Xu	1565dab809	[RocksDB] Enhance Env to support two thread pools LOW and HIGH Summary: this is the ground work for separating memtable flush jobs to their own thread pool. Both SetBackgroundThreads and Schedule take a third parameter Priority to indicate which thread pool they are working on. The names LOW and HIGH are just identifiers for two different thread pools, and does not indicate real difference in 'priority'. We can set number of threads in the pools independently. The thread pool implementation is refactored. Test Plan: make check Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12885	2013-09-12 16:15:36 -07:00
Haobo Xu	0e422308aa	[RocksDB] Remove Log file immediately after memtable flush Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later. Test Plan: make check; db_bench Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11709	2013-09-12 11:54:44 -07:00
Haobo Xu	f2f4c8072f	[RocksDB] Added nano second stopwatch and new perf counters to track block read cost Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call. Test Plan: perf_context_test, also testing with viewstate shadow traffic. Reviewers: dhruba Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D12351	2013-09-07 21:14:54 -07:00
Dhruba Borthakur	197034e4c3	An iterator may automatically invoke reseeks. Summary: An iterator invokes reseek if the number of sequential skips over the same userkey exceeds a configured number. This makes iter->Next() faster (bacause of fewer key compares) if a large number of adjacent internal keys in a table (sst or memtable) have the same userkey. Test Plan: Unit test DBTest.IterReseek. Reviewers: emayanke, haobo, xjin Reviewed By: xjin CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11865	2013-09-06 11:50:53 -07:00
Xing Jin	42c109cc2e	New ldb command to convert compaction style Summary: Add new command "change_compaction_style" to ldb tool. For universal->level, it shows "nothing to do". For level->universal, it compacts all files into a single one and moves the file to level 0. Also add check for number of files at level 1+ when opening db with universal compaction style. Test Plan: 'make all check'. New unit test for internal convertion function. Also manully test various cmd like: ./ldb change_compaction_style --old_compaction_style=0 --new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test Reviewers: haobo, dhruba Reviewed By: haobo CC: vamsi, emayanke Differential Revision: https://reviews.facebook.net/D12603	2013-09-04 13:13:08 -07:00
Jim Paton	5c3b254ef2	Fix memory leak Summary: There is a memory leak because TransformRepFactory does not delete its SliceTransform pointer. This patch adds a delete to the destructor. Test Plan: make check make valgrind_check Reviewers: dhruba, emayanke, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12513	2013-08-23 15:39:49 -07:00
Dhruba Borthakur	1186192ed1	Replace include/leveldb with include/rocksdb. Summary: Replace include/leveldb with include/rocksdb. Test Plan: make clean; make check make clean; make release Differential Revision: https://reviews.facebook.net/D12489	2013-08-23 10:51:00 -07:00
Jim Paton	74781a0c49	Add three new MemTableRep's Summary: This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep. UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that. VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector. PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets. I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed). Test Plan: make -j32 check ./db_stress --memtablerep=vector ./db_stress --memtablerep=unsorted ./db_stress --memtablerep=prefixhash --prefix_size=10 Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12117	2013-08-22 23:10:02 -07:00
Tyler Harter	a8f47a4006	Add options to dump. Summary: added options to Dump() I missed in D12027. I also ran a script to look for other missing options and found a couple which I added. Should we also print anything for "PrepareForBulkLoad", "memtable_factory", and "statistics"? Or should we leave those alone since it's not easy to print useful info for those? Test Plan: run anything and look at LOG file to make sure these are printed now. Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12219	2013-08-14 09:06:10 -07:00
Tyler Harter	f5f1842282	Prefix filters for scans (v4) Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3. Also, make the CreateFilter code faster and cleaner. Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test Reviewers: dhruba Reviewed By: dhruba CC: haobo, emayanke Differential Revision: https://reviews.facebook.net/D12027	2013-08-13 14:04:56 -07:00
sumeet	3b81df34bd	Separate compaction filter for each compaction Summary: If we have same compaction filter for each compaction, application cannot know about the different compaction processes. Later on, we can put in more details in compaction filter for the application to consume and use it according to its needs. For e.g. In the universal compaction, we have a compaction process involving all the files while others don't involve all the files. Applications may want to collect some stats only when during full compaction. Test Plan: run existing unit tests Reviewers: haobo, dhruba Reviewed By: dhruba CC: xinyaohu, leveldb Differential Revision: https://reviews.facebook.net/D12057	2013-08-13 10:56:20 -07:00
Dhruba Borthakur	03bd4461ad	Merge branch 'performance' of github.com:facebook/rocksdb into performance	2013-08-12 10:31:21 -07:00
Dhruba Borthakur	f3967a5132	Merge remote-tracking branch 'origin' into performance	2013-08-12 09:58:50 -07:00
Haobo Xu	3a3b1c3e6c	[RocksDB] Improve manifest dump to print internal keys in hex for version edits. Summary: Currently, VersionEdit::DebugString always display internal keys in the original ascii format. This could cause manifest dump to be truncated if internal keys contain special charactors (like null). Also added an option --input_key_hex for ldb idump to indicate that the passed in user keys are in hex. Test Plan: run ldb manifest_dump Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D12111	2013-08-08 16:19:01 -07:00
Xing Jin	17b8f786a3	Fix unit tests/bugs for universal compaction (first step) Summary: This is the first step to fix unit tests and bugs for universal compactiion. I added universal compaction option to ChangeOptions(), and fixed all unit tests calling ChangeOptions(). Some of these tests obviously assume more than 1 level and check file number/values in level 1 or above levels. I set kSkipUniversalCompaction for these tests. The major bug I found is manual compaction with universal compaction never stops. I have put a fix for it. I have also set universal compaction as the default compaction and found at least 20+ unit tests failing. I haven't looked into the details. The next step is to check all unit tests without calling ChangeOptions(). Test Plan: make all check Reviewers: dhruba, haobo Differential Revision: https://reviews.facebook.net/D12051	2013-08-07 14:05:44 -07:00
Dhruba Borthakur	f5fa26b6a9	Merge branch 'performance' of github.com:facebook/rocksdb into performance Conflicts: db/builder.cc db/db_impl.cc db/version_set.cc include/leveldb/statistics.h	2013-08-07 11:58:06 -07:00
Mayank Agarwal	1d7b4765c3	Expose base db object from ttl wrapper Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is' Test Plan: make Reviewers: dhruba, vamsi, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11919	2013-08-05 18:44:14 -07:00
Jim Paton	1036537c94	Add soft and hard rate limit support Summary: This diff adds support for both soft and hard rate limiting. The following changes are included: 1) Options.rate_limit is renamed to Options.hard_rate_limit. 2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds. 3) Options.soft_rate_limit is added. 4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit. 5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit. 6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default. Test Plan: make -j32 check ./db_stress Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12003	2013-08-05 15:43:49 -07:00
Dhruba Borthakur	711a30cb30	Merge branch 'master' into performance Conflicts: include/leveldb/options.h include/leveldb/statistics.h util/options.cc	2013-08-02 10:22:08 -07:00
Xing Jin	0f0a24e298	Make arena block size configurable Summary: Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size. I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer. Test Plan: new unit test, I am running db_test. For passing paramter from configured option to Arena, I tried tests like: TEST(DBTest, Arena_Option) { std::string dbname = test::TmpDir() + "/db_arena_option_test"; DestroyDB(dbname, Options()); DB* db = nullptr; Options opts; opts.create_if_missing = true; opts.arena_block_size = 1000000; // tested 99, 999999 Status s = DB::Open(opts, dbname, &db); db->Put(WriteOptions(), "a", "123"); } and printed some debug info. The results look good. Any suggestion for such a unit-test? Reviewers: haobo, dhruba, emayanke, jpaton Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D11799	2013-07-31 12:42:23 -07:00
Jim Paton	52d7ecfc78	Virtualize SkipList Interface Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations. Test Plan: make clean make -j32 check ./db_stress Reviewers: dhruba, emayanke, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11739	2013-07-23 14:42:27 -07:00
Mayank Agarwal	bf66c10b13	Use KeyMayExist for WriteBatch-Deletes Summary: Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete. Added code to skip getting Table from disk if not already present in table_cache. Some renaming of variables. Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch. Changed KeyMayExist to not be pure virtual and provided a default implementation. Expanded unit-tests in db_test to check appropriately. Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1. Test Plan: db_stress;make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11745	2013-07-23 13:36:50 -07:00
Dhruba Borthakur	9357a53a7d	Fix merge problems with options. Summary: Fix merge problems with options. Test Plan: Reviewers: CC: Task ID: # Blame Rev:	2013-07-17 15:08:56 -07:00
Dhruba Borthakur	4a745a5666	Merge branch 'master' into performance Conflicts: db/version_set.cc include/leveldb/options.h util/options.cc	2013-07-17 15:05:57 -07:00
Mayank Agarwal	2a986919d6	Make rocksdb-deletes faster using bloom filter Summary: Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving: 1. Put of delete type 2. Space in the db,and 3. Compaction time Test Plan: make all check; will run db_stress and db_bench and enhance unit-test once the basic design gets approved Reviewers: dhruba, haobo, vamsi Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11607	2013-07-11 12:11:11 -07:00
Haobo Xu	a8d5f8dde2	[RocksDB] Remove old readahead options Summary: As title. Test Plan: make check; db_bench Reviewers: dhruba, MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D11643	2013-07-09 11:22:33 -07:00
Dhruba Borthakur	116ec527f2	Renamed 'hybrid_compaction' tp be "Universal Compaction'. Summary: All the universal compaction parameters are encapsulated in a new file universal_compaction.h Test Plan: make check	2013-07-03 15:47:53 -07:00
Haobo Xu	92ca816a60	[RocksDB] Support internal key/value dump for ldb Summary: This diff added a command 'idump' to ldb tool, which dumps the internal key/value pairs. It could be useful for diagnosis and estimating the per user key 'overhead'. Also cleaned up the ldb code a bit where I touched. Test Plan: make check; ldb idump Reviewers: emayanke, sheki, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11517	2013-07-03 10:41:31 -07:00
Dhruba Borthakur	47c4191fe8	Reduce write amplification by merging files in L0 back into L0 Summary: There is a new option called hybrid_mode which, when switched on, causes HBase style compactions. Files from L0 are compacted back into L0. This meat of this compaction algorithm is in PickCompactionHybrid(). All files reside in L0. That means all files have overlapping keys. Each file has a time-bound, i.e. each file contains a range of keys that were inserted around the same time. The start-seqno and the end-seqno refers to the timeframe when these keys were inserted. Files that have contiguous seqno are compacted together into a larger file. All files are ordered from most recent to the oldest. The current compaction algorithm starts to look for candidate files starting from the most recent file. It continues to add more files to the same compaction run as long as the sum of the files chosen till now is smaller than the next candidate file size. This logic needs to be debated and validated. The above logic should reduce write amplification to a large extent... will publish numbers shortly. Test Plan: dbstress runs for 6 hours with no data corruption (tested so far). Differential Revision: https://reviews.facebook.net/D11289	2013-06-30 20:07:04 -07:00
Mayank Agarwal	b858da709a	Simplify bucketing logic in ldb-ttl Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff. Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1 Reviewers: vamsi, haobo Reviewed By: vamsi Differential Revision: https://reviews.facebook.net/D11433	2013-06-21 09:49:24 -07:00
Mayank Agarwal	61f1baaedf	Introducing timeranged scan, timeranged dump in ldb. Also the ability to count in time-batches during Dump Summary: Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly. Test Plan: python tools/ldb_test.py Reviewers: vamsi, dhruba, haobo, sheki Reviewed By: vamsi CC: leveldb Differential Revision: https://reviews.facebook.net/D11403	2013-06-19 18:45:13 -07:00
Haobo Xu	96be2c4ee0	[RocksDB] Add mmap_read option for db_stress Summary: as title, also removed an incorrect assertion Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0 Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D11367	2013-06-19 10:28:32 -07:00
Abhishek Kona	5ef6bb8c37	[rocksdb][refactor] statistic printing code to one place Summary: $title Test Plan: db_bench --statistics=1 Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11373	2013-06-18 20:28:41 -07:00
Haobo Xu	3cc1af2062	[RocksDB] Option for incremental sync Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise. Test Plan: make check; db_bench Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11295	2013-06-18 15:00:32 -07:00
Dhruba Borthakur	6acbe0fc45	Compact multiple memtables before flushing to storage. Summary: Merge multiple multiple memtables in memory before writing it out to a file in L0. There is a new config parameter min_write_buffer_number_to_merge that specifies the number of write buffers that should be merged together to a single file in storage. The system will not flush wrte buffers to storage unless at least these many buffers have accumulated in memory. The default value of this new parameter is 1, which means that a write buffer will be immediately flushed to disk as soon it is ready. Test Plan: make check Differential Revision: https://reviews.facebook.net/D11241	2013-06-18 14:28:04 -07:00
Abhishek Kona	00124683de	[rocksdb] do not trim range for level0 in manual compaction Summary: https://code.google.com/p/leveldb/issues/detail?can=1&q=178&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&id=178 Ported the solution as is to RocksDB. Test Plan: moved the unit test as manual_compaction_test Reviewers: dhruba Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11331	2013-06-17 13:58:17 -07:00
Abhishek Kona	bff718d81c	[Rocksdb] Implement filluniquerandom Summary: Use a bit set to keep track of which random number is generated. Currently only supports single-threaded. All our perf tests are run with threads=1 Copied over bitset implementation from common/datastructures Test Plan: printed the generated keys, and verified all keys were present. Reviewers: MarkCallaghan, haobo, dhruba Reviewed By: MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D11247	2013-06-14 16:17:56 -07:00
Haobo Xu	778e179046	[RocksDB] Sync file to disk incrementally Summary: During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic. This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO. In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too. Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark. What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller. For the write only test 800bytes test: Before the change: await occillate between 10ms and 3ms After the change: await ranges 1-3ms Will test against read-modify-write workload too, see if high read latency P99 could be resolved. Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions. Test Plan: make check; db_bench; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11115	2013-06-12 12:53:59 -07:00
Haobo Xu	bdf1085944	[RocksDB] cleanup EnvOptions Summary: This diff simplifies EnvOptions by treating it as POD, similar to Options. - virtual functions are removed and member fields are accessed directly. - StorageOptions is removed. - Options.allow_readahead and Options.allow_readahead_compactions are deprecated. - Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite Test Plan: make check; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11175	2013-06-12 11:17:19 -07:00
Haobo Xu	043573b24f	[RocksDB] Include 64bit random number generator Summary: As title. Test Plan: make check; Reviewers: chip, MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D11061	2013-06-04 13:52:27 -07:00
Haobo Xu	d897d33bf1	[RocksDB] Introduce Fast Mutex option Summary: This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now. Quickly tested 8 thread cpu bound 100 byte random read. No fast mutex: ~750k/s ops With fast mutex: ~880k/s ops Test Plan: make check; db_bench; db_stress Reviewers: dhruba CC: MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D11031	2013-06-01 23:11:34 -07:00
Haobo Xu	ab8d2f6ab2	[RocksDB] [Performance] Allow different posix advice to be applied to the same table file Summary: Current posix advice implementation ties up the access pattern hint with the creation of a file. It is not possible to apply different advice for different access (random get vs compaction read), without keeping two open files for the same table. This patch extended the RandomeAccessFile interface to accept new access hint at anytime. Particularly, we are able to set different access hint on the same table file based on when/how the file is used. Two options are added to set the access hint, after the file is first opened and after the file is being compacted. Test Plan: make check; db_stress; db_bench Reviewers: dhruba Reviewed By: dhruba CC: MarkCallaghan, leveldb Differential Revision: https://reviews.facebook.net/D10905	2013-05-30 19:08:44 -07:00

1 2 3 4 5 ...

320 Commits