rocksdb

Author	SHA1	Message	Date
sdong	edf1cd497f	Not generating "__attribute__((__unused__))" for padding fields if it is not CLANG Summary: Adding "__attribute__((__unused__))" after padding fields will pass CLANG build but will fail gcc 4.8.1. Fix it by not generating it under GCC 4.8.1. Test Plan: Build under four combinations of USE_CLANG=0,1 and ROCKSDB_FBCODE_BUILD_WITH_481=0.1. Reviewers: yhchiang, rven, ngbronson, anthony, IslamAbdelRahman Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52371	2015-12-28 18:37:23 -08:00
sdong	11672df19a	Fix CLANG errors introduced by `7d87f02799` Summary: Fix some CLANG errors introduced in `7d87f02799` Test Plan: Build with both of CLANG and gcc Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52329	2015-12-28 10:00:58 -08:00
Siying Dong	7fafd52dce	Merge pull request #900 from shuzhang1989/hdfs_env_fix add a factory method for creating hdfs env	2015-12-28 09:28:04 -08:00
Shu Zhang	2b7c810db8	more foramt	2015-12-26 19:52:35 -08:00
Shu Zhang	b79ccbd573	indent	2015-12-26 19:50:28 -08:00
Nathan Bronson	7d87f02799	support for concurrent adds to memtable Summary: This diff adds support for concurrent adds to the skiplist memtable implementations. Memory allocation is made thread-safe by the addition of a spinlock, with small per-core buffers to avoid contention. Concurrent memtable writes are made via an additional method and don't impose a performance overhead on the non-concurrent case, so parallelism can be selected on a per-batch basis. Write thread synchronization is an increasing bottleneck for higher levels of concurrency, so this diff adds --enable_write_thread_adaptive_yield (default off). This feature causes threads joining a write batch group to spin for a short time (default 100 usec) using sched_yield, rather than going to sleep on a mutex. If the timing of the yield calls indicates that another thread has actually run during the yield then spinning is avoided. This option improves performance for concurrent situations even without parallel adds, although it has the potential to increase CPU usage (and the heuristic adaptation is not yet mature). Parallel writes are not currently compatible with inplace updates, update callbacks, or delete filtering. Enable it with --allow_concurrent_memtable_write (and --enable_write_thread_adaptive_yield). Parallel memtable writes are performance neutral when there is no actual parallelism, and in my experiments (SSD server-class Linux and varying contention and key sizes for fillrandom) they are always a performance win when there is more than one thread. Statistics are updated earlier in the write path, dropping the number of DB mutex acquisitions from 2 to 1 for almost all cases. This diff was motivated and inspired by Yahoo's cLSM work. It is more conservative than cLSM: RocksDB's write batch group leader role is preserved (along with all of the existing flush and write throttling logic) and concurrent writers are blocked until all memtable insertions have completed and the sequence number has been advanced, to preserve linearizability. My test config is "db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write" on a two-socket Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive. With 1 thread I get ~440Kops/sec. Peak performance for 1 socket (numactl -N1) is slightly more than 1Mops/sec, at 16 threads. Peak performance across both sockets happens at 30 threads, and is ~900Kops/sec, although with fewer threads there is less performance loss when the system has background work. Test Plan: 1. concurrent stress tests for InlineSkipList and DynamicBloom 2. make clean; make check 3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench 4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench 5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench 6. make clean; OPT=-DROCKSDB_LITE make check 7. verify no perf regressions when disabled Reviewers: igor, sdong Reviewed By: sdong Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba Differential Revision: https://reviews.facebook.net/D50589	2015-12-25 11:03:40 -08:00
Shu Zhang	b4aa823661	format	2015-12-24 20:38:35 -08:00
Shu Zhang	4dfdd1d928	format	2015-12-24 20:32:29 -08:00
Siying Dong	298ba27ae2	Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata Enable MS compiler warning c4244.	2015-12-23 22:59:42 -08:00
Siying Dong	7810aa802a	Merge pull request #899 from zhipeng-jia/fix_clang_warning Fix clang warnings	2015-12-23 22:58:52 -08:00
Zhipeng Jia	ec2664fefd	Fix clang compile error under Linux	2015-12-24 12:41:40 +08:00
Shu Zhang	4fd23fb130	add a factory method for creating hdfs env	2015-12-23 17:26:50 -08:00
sdong	15b8902264	Change default options.delayed_write_rate Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be. Test Plan: Run all tests. Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52281	2015-12-23 14:51:55 -08:00
sdong	b9f77ba12b	When slowdown is triggered, reduce the write rate Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate. Test Plan: Add a unit test Test with db_bench setting: TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000 and make sure without the commit, write stop will happen, but with the commit, it will not happen. Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52131	2015-12-23 11:33:15 -08:00
Igor Canadi	8ac7fb8377	Merge pull request #863 from zhangyybuaa/fix_hdfs_error Fix build error with hdfs	2015-12-22 09:27:51 +01:00
sdong	167fb919a5	ZSTD to use CompressionOptions.level Summary: Now ZSTD hard code level 1. Change it to use the compression level setting. Test Plan: Run it with hacked codes of sst_dump and show ZSTD compression sizes with different levels. Reviewers: rven, anthony, yhchiang, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52041	2015-12-16 16:58:04 -08:00
Islam AbdelRahman	aececc209e	Introduce ReadOptions::pin_data (support zero copy for keys) Summary: This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed. Benchmark results (using https://phabricator.fb.com/P20083553) ``` // $ du -h /home/tec/local/normal.4K.Snappy/db10077 // 6.1G /home/tec/local/normal.4K.Snappy/db10077 // $ du -h /home/tec/local/zero.8K.LZ4/db10077 // 6.4G /home/tec/local/zero.8K.LZ4/db10077 // Benchmarks for shard db10077 // _build/opt/rocks/benchmark/rocks_copy_benchmark \ // --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \ // --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077" // First run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 1.73s 576.97m // BM_StringPiece 103.74% 1.67s 598.55m // ============================================================================ // Match rate : 1000000 / 1000000 // Second run // ============================================================================ // rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s // ============================================================================ // BM_StringCopy 611.99ms 1.63 // BM_StringPiece 203.76% 300.35ms 3.33 // ============================================================================ // Match rate : 1000000 / 1000000 ``` Test Plan: Unit tests Reviewers: sdong, igor, anthony, yhchiang, rven Reviewed By: rven Subscribers: dhruba, lovro, adsharma Differential Revision: https://reviews.facebook.net/D48999	2015-12-16 12:08:30 -08:00
Venkatesh Radhakrishnan	030215bf01	Running manual compactions in parallel with other automatic or manual compactions in restricted cases Summary: This diff provides a framework for doing manual compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to BackgroundCompactions, so that RunManualCompactions can be reentrant. Parallelism is controlled by the two routines ConflictingManualCompaction to allow/disallow new parallel/manual compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs. I will be adding more tests later. Test Plan: Rocksdb regression + new tests + valgrind Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47973	2015-12-14 11:20:34 -08:00
Dmitri Smirnov	aca403d2b5	Fix another rebase problems.	2015-12-11 17:33:40 -08:00
Dmitri Smirnov	236fe21c92	Enable MS compiler warning c4244. Mostly due to the fact that there are differences in sizes of int,long on 64 bit systems vs GNU.	2015-12-11 16:47:34 -08:00
Yueh-Hsuan Chiang	00d6edf6a0	Ensure the destruction order of PosixEnv and ThreadLocalPtr Summary: By default, RocksDB initializes the singletons of ThreadLocalPtr first, then initializes PosixEnv via static initializer. Destructor terminates objects in reverse order, so terminating PosixEnv (calling pthread_mutex_lock), then ThreadLocal (calling pthread_mutex_destroy). However, in certain case, application might initialize PosixEnv first, then ThreadLocalPtr. This will cause core dump at the end of the program (eg. https://github.com/facebook/mysql-5.6/issues/122) This patch fix this issue by ensuring the destruction order by moving the global static singletons to function static singletons. Since function static singletons are initialized when the function is first called, this property allows us invoke to enforce the construction of the static PosixEnv and the singletons of ThreadLocalPtr by calling the function where the ThreadLocalPtr singletons belongs right before we initialize the static PosixEnv. Test Plan: Verified in the MyRocks. Reviewers: yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony, sdong, MarkCallaghan Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51789	2015-12-11 00:21:58 -08:00
charsyam	c30b499541	fix typos in comments	2015-12-11 01:54:48 +09:00
sdong	56e77f0967	Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune. Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests. Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony Reviewed By: anthony Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51117	2015-12-09 18:22:45 -08:00
sdong	d6e1035a1f	A new compaction picking priority that optimizes for write amplification for random updates. Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases. Test Plan: Add a unit test and run it in valgrind too. Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51459	2015-12-09 18:13:03 -08:00
yuslepukhin	49957f9a98	Prefer integer arithmetics The code had conversion to double then casting to size_t and then casting uint32_t which caused compiler warning (VS15).	2015-12-09 14:06:23 -08:00
Siying Dong	9c227923c6	Merge pull request #788 from OpenChannelSSD/to_fb_master2 Move posix threads into a library	2015-12-08 18:06:38 -08:00
Siying Dong	fa3dbf203f	Merge pull request #853 from Vaisman/enable_C4267_warning Enable C4267 warning	2015-12-08 17:59:24 -08:00
Siying Dong	56bbecc316	Merge pull request #867 from SherlockNoMad/CacheFix Replace malloc with new for LRU Cache Handle	2015-12-08 17:58:29 -08:00
Yueh-Hsuan Chiang	774b80e99e	Resubmit the fix for a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51717	2015-12-08 17:01:02 -08:00
sdong	ea11923550	Upgrade to ZSTD 0.4.2 Summary: Change to call the new compression function. Test Plan: build and run db_bench with the compression to make sure it compresses. Reviewers: anthony, rven, kradhakrishnan, IslamAbdelRahman, igor, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51603	2015-12-08 16:33:26 -08:00
sdong	770dea9325	Fix occasional failure of DBTest.DynamicCompactionOptions Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run. Test Plan: Run the test multiple times in an environment which can cause failure. Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51687	2015-12-07 18:38:39 -08:00
sdong	f307036bde	Revert "Fix a race condition in persisting options" This reverts commit `2fa3ed5180`. It breaks RocksDB lite build	2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang	2fa3ed5180	Fix a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51609	2015-12-07 15:25:12 -08:00
Javier González	b2863017b1	Move posix threads into a library Summary: This patch moves all posix thread logic to a separate library. The motivation is to allow another environments to easily reuse posix threads. HDFS wraps already posix threads; this split would simplify this code. Test Plan: No new functionality is added to posix Env or the threading library, thus the current tests should suffice.	2015-12-07 12:03:38 +01:00
SherlockNoMad	3a98a7ae7f	Replace malloc with new for LRU Cache Handle	2015-12-04 15:12:07 -08:00
Zhang Yangyang	4687ced5db	fix ToString() not declared error	2015-12-02 21:45:28 +08:00
sdong	d27ea4c9e5	Initialize options.row_cache Summary: options.row_cache should already been initialized as null by default. Still try to set it following current convention, because one valgrind failure reports a failure related to it. Test Plan: Run all unit tests Reviewers: yhchiang, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51303	2015-11-30 10:30:35 -08:00
Nathan Bronson	9a9d4759b2	InlineSkipList part 3/3 - new skiplist type that colocates key and node Summary: This diff completes the creation of InlineSkipList<Cmp>, which is like SkipList<const char, Cmp> but it always allocates the key contiguously with the node. This allows us to remove the pointer from the node to the key. As a result the memory usage of the skip list is reduced (by 1 to sizeof(void) bytes depending on the padding required to align the key storage), cache locality is improved, and we halve the number of calls to the allocator. For skip lists whose keys are freshly-allocated const char*, InlineSkipList is stricly preferrable to SkipList. This diff doesn't replace SkipList, however, because some of the use cases of SkipList in RocksDB are either character sequences that are not allocated at the same time as the skip list node allocation (for example hash_linklist_rep) or have different key types (for example write_batch_with_index). Taking advantage of inline allocation for those cases is left to future work. The perf win is biggest for small values. For single-threaded CPU-bound (32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For large values the improvement is less pronounced, but seems to be between 5% and 10% on the same configuration. Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D51123	2015-11-24 15:16:02 -08:00
Vasili Svirski	41b32c6059	Enable C4267 warning * conversion from 'size_t' to 'type', by add static_cast Tested: * by build solution on Windows, Linux locally, * run tests * build CI system successful	2015-11-24 16:33:09 +03:00
yuslepukhin	047bd22aae	Build on Visual Studio 2015 Update 1	2015-11-20 15:31:47 -08:00
Dmitri Smirnov	89bacb7e7d	Enable MS Warning C4804 : unsafe use of type 'bool' in operation	2015-11-18 16:23:19 -08:00
Islam AbdelRahman	4159ab8169	Merge pull request #839 from SherlockNoMad/memtableOption Support Memtable Factory Parse in option_helper.cc	2015-11-17 17:09:49 -08:00
sdong	6170fec251	Fix build broken by previous commit of "option helper refactor" Summary: The commit of option helper refactor broken the build: (1) a git merge problem (2) some uncaught compiler warning Fix it. Test Plan: Make sure "make all" passes Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D50943	2015-11-17 16:52:54 -08:00
Siying Dong	3a6643c2fd	Merge pull request #805 from SherlockNoMad/OptionHelperFix Option Helper Refactoring	2015-11-17 16:24:52 -08:00
SherlockNoMad	bd7be035e0	Support Memtable Factory Parse in option_helper.cc	2015-11-17 14:29:01 -08:00
Islam AbdelRahman	a163cc2d5a	Lint everything Summary: ``` arc2 lint --everything ``` run the linter on the whole code repo to fix exisitng lint issues Test Plan: make check -j64 Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50769	2015-11-16 12:56:21 -08:00
Yueh-Hsuan Chiang	e11f676e34	Add OptionsUtil::LoadOptionsFromFile() API Summary: This patch adds OptionsUtil::LoadOptionsFromFile() and OptionsUtil::LoadLatestOptionsFromDB(), which allow developers to construct DBOptions and ColumnFamilyOptions from a RocksDB options file. Note that most pointer-typed options such as merge_operator will not be constructed. With this API, developers no longer need to remember all the options in order to reopen an existing rocksdb instance like the following: DBOptions db_options; std::vector<std::string> cf_names; std::vector<ColumnFamilyOptions> cf_opts; // Load primitive-typed options from an existing DB OptionsUtil::LoadLatestOptionsFromDB( dbname, &db_options, &cf_names, &cf_opts); // Initialize necessary pointer-typed options cf_opts[0].merge_operator.reset(new MyMergeOperator()); ... // Construct the vector of ColumnFamilyDescriptor std::vector<ColumnFamilyDescriptor> cf_descs; for (size_t i = 0; i < cf_opts.size(); ++i) { cf_descs.emplace_back(cf_names[i], cf_opts[i]); } // Open the DB DB* db = nullptr; std::vector<ColumnFamilyHandle*> cf_handles; auto s = DB::Open(db_options, dbname, cf_descs, &handles, &db); Test Plan: Augment existing tests in column_family_test options_test db_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49095	2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang	e114f0abb8	Enable RocksDB to persist Options file. Summary: This patch allows rocksdb to persist options into a file on DB::Open, SetOptions, and Create / Drop ColumnFamily. Options files are created under the same directory as the rocksdb instance. In addition, this patch also adds a fail_if_missing_options_file in DBOptions that makes any function call return non-ok status when it is not able to persist options properly. // If true, then DB::Open / CreateColumnFamily / DropColumnFamily // / SetOptions will fail if options file is not detected or properly // persisted. // // DEFAULT: false bool fail_if_missing_options_file; Options file names are formatted as OPTIONS-<number>, and RocksDB will always keep the latest two options files. Test Plan: Add options_file_test. options_test column_family_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48285	2015-11-10 22:58:01 -08:00
Dmitri Smirnov	5270b33bd3	Make use of portable `uint64_t` type to make possible file access in 64-bit. Currently, a signed off_t type is being used for the following interfaces for both offset and the length in bytes: * `Allocate` * `RangeSync` On Linux `off_t` is automatically either 32 or 64-bit depending on the platform. On Windows it is always a 32-bit signed long which limits file access and in particular space pre-allocation to effectively 2 Gb. Proposal is to replace off_t with uint64_t as a portable type always access files with 64-bit interfaces. May need to modify posix code but lack resources to test it.	2015-11-10 17:03:42 -08:00
Nathan Bronson	505accda38	remove constexpr from util/random.h for MSVC compat Summary: Scoped anonymous enums seem to be better supported than static constexpr at the moment, so this diff replaces the latter with the former. Also, this diff removes an incorrect inclusion of pthread.h. MSVC build was broken starting with D50439. Test Plan: 1. build 2. observe proper skiplist behavior by absence of pathological slowdown 3. push diff to tmp_try_windows branch to tickle AppVeyor 4. wait for contbuild before committing to master Reviewers: sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50517	2015-11-10 16:41:23 -08:00

1 2 3 4 5 ...

1129 Commits