rocksdb

Author	SHA1	Message	Date
Andrew Kryczka	a4d9c02511	Pass CF ID to MemTableRepFactory Summary: Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff: - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload. - updates MemTable to create MemTableRep's using the new overload. Closes https://github.com/facebook/rocksdb/pull/2346 Differential Revision: D5108061 Pulled By: ajkr fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616	2017-06-02 12:12:06 -07:00
Yi Wu	f68d88be51	Fix DBWriteTest::ReturnSequenceNumberMultiThreaded data race Summary: rocksdb::Random is not thread-safe. Have one Random for each thread instead. Closes https://github.com/facebook/rocksdb/pull/2400 Differential Revision: D5173919 Pulled By: yiwu-arbug fbshipit-source-id: 1a99c7b877f3893eb22355af49e321bcad4e53e6	2017-06-02 11:42:11 -07:00
Andrew Kryczka	215076ef06	Fix TSAN: avoid arena mode with range deletions Summary: The range deletion meta-block iterators weren't getting cleaned up properly since they don't support arena allocation. I didn't implement arena support since, in the general case, each iterator is used only once and separately from all other iterators, so there should be no benefit to data locality. Anyways, this diff fixes up #2370 by treating range deletion iterators as non-arena-allocated. Closes https://github.com/facebook/rocksdb/pull/2399 Differential Revision: D5171119 Pulled By: ajkr fbshipit-source-id: bef6f5c4c5905a124f4993945aed4bd86e2807d8	2017-06-01 22:26:49 -07:00
Andrew Kryczka	3a8a848a55	account for L0 size in estimated compaction bytes Summary: also changed the `>` in the comparison against `level0_file_num_compaction_trigger` into a `>=` since exactly `level0_file_num_compaction_trigger` can trigger a compaction from L0. Closes https://github.com/facebook/rocksdb/pull/2179 Differential Revision: D4915772 Pulled By: ajkr fbshipit-source-id: e38fec6253de6f9a40e61734615c6670d84038aa	2017-06-01 17:56:59 -07:00
Tamir Duberstein	0dc3040d54	db: avoid `#include`ing malloc and jemalloc simultaneously Summary: This fixes a compilation failure on Linux when the system libc is not glibc. jemalloc's configure script incorrectly assumes that glibc is always used on Linux systems, producing glibc-style signatures; when the system libc is e.g. musl, the following error is observed: ``` [ 0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0, from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77: /x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void)' has a different exception specifier size_t malloc_usable_size(void ); ^~~~~~~~~~~~~~~~~~ In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0: /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()' # define je_malloc_usable_size malloc_usable_size ^ /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size' JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size( ^~~~~~~~~~~~~~~~~~~~~ CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed ``` This works around the issue by rearranging the sources such that jemalloc's headers are never in the same scope as the system's malloc header. The jemalloc issue has been reported as well, see: https://github.com/jemalloc/jemalloc/issues/778. cc tschottdorf Closes https://github.com/facebook/rocksdb/pull/2188 Differential Revision: D5163048 Pulled By: siying fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d	2017-05-31 22:43:02 -07:00
Andrew Kryczka	9c9909bf7d	Support ingest file when range deletions exist Summary: Previously we returned NotSupported when ingesting files into a database containing any range deletions. This diff adds the support. - Flush if any memtable contains range deletions overlapping the to-be-ingested file - Place to-be-ingested file before any level that contains range deletions overlapping it. - Added support for `Version` to return iterators over range deletions in a given level. Previously, we piggybacked getting range deletions onto `Version`'s `Get()` / `AddIterator()` functions by passing them a `RangeDelAggregator*`. But file ingestion needs to get iterators over range deletions, not populate an aggregator (since the aggregator does collapsing and doesn't expose the actual ranges). Closes https://github.com/facebook/rocksdb/pull/2370 Differential Revision: D5127648 Pulled By: ajkr fbshipit-source-id: 816faeb9708adfa5287962bafdde717db56e3f1a	2017-05-31 13:57:19 -07:00
Yi Wu	ad19eb8686	Fixing blob db sequence number handling Summary: Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch. Stacking on #2375. Closes https://github.com/facebook/rocksdb/pull/2385 Differential Revision: D5148358 Pulled By: yiwu-arbug fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33	2017-05-31 10:56:45 -07:00
Siying Dong	51ac91f586	Histogram of number of merge operands Summary: Add a histogram in statistics to help users understand how many merge operands they merge. Closes https://github.com/facebook/rocksdb/pull/2373 Differential Revision: D5139983 Pulled By: siying fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182	2017-05-31 07:41:44 -07:00
Tamir Duberstein	103d0692ea	Avoid unsupported attributes when not building with UBSAN Summary: yiwu-arbug see individual commits. Closes https://github.com/facebook/rocksdb/pull/2318 Differential Revision: D5141520 Pulled By: yiwu-arbug fbshipit-source-id: 7987c92ab4461eef36afce5a133d3a0ee0c96300	2017-05-30 11:13:01 -07:00
赵星宇	d03c34497c	update comment of GetNextFile Summary: Closes https://github.com/facebook/rocksdb/pull/2377 Differential Revision: D5141274 Pulled By: lightmark fbshipit-source-id: c237a285b73ad93488c080ea80c71a29a17f1be0	2017-05-26 15:12:13 -07:00
Aaron Gao	f7bb1a0060	support merge and delete in file ingestion Summary: Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested. Closes https://github.com/facebook/rocksdb/pull/2361 Differential Revision: D5139402 Pulled By: lightmark fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88	2017-05-26 12:11:21 -07:00
Sagar Vemuri	7bb1f5d483	Increase of compaction threads should be logged at info level instead of a warning Summary: This log message shouldn't be a warning; some services are seeing high warning count due to this. The count for the below line is a few hundreds of millions, as per Logview: ``` [rocksdb/src/db/column_family.cc:729] [checkpoints] Increasing compaction threads because we have 2 level-0 files ``` Closes https://github.com/facebook/rocksdb/pull/2364 Differential Revision: D5123565 Pulled By: sagar0 fbshipit-source-id: a07ce499a4f82f0ebde9cda9f4948fb9df6a734c	2017-05-26 09:56:13 -07:00
Andrew Kryczka	a99fb9928f	fix column_family_test asan Summary: stop calling Close() at the end of tests holding a compaction pressure token since it causes the write controller to be deleted while it's still needed. these calls were pointless anyways since Close() is already called in the test's destructor. Closes https://github.com/facebook/rocksdb/pull/2367 Differential Revision: D5125906 Pulled By: ajkr fbshipit-source-id: 6cad8673e5546a82ff602ac0ba59cc3f68dbde46	2017-05-24 16:41:51 -07:00
Andrew Kryczka	bb01c1880c	Introduce max_background_jobs mutable option Summary: - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure. - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions. - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled. Closes https://github.com/facebook/rocksdb/pull/2205 Differential Revision: D4937461 Pulled By: ajkr fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9	2017-05-24 11:29:08 -07:00
Siying Dong	41cbb72749	options.delayed_write_rate use the rate of rate_limiter by default. Summary: It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude. Closes https://github.com/facebook/rocksdb/pull/2357 Differential Revision: D5115324 Pulled By: siying fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754	2017-05-24 09:58:24 -07:00
Igor Canadi	52d9e5f7b6	Fix column family seconds_up accounting Summary: `cf_stats_snapshot_.seconds_up` appears to be never updated, unlike `db_stats_snapshot_.seconds_up`, which is updated here: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.cc#L883 This leads to wrong information in the log, for example: ``` Compaction Stats [default] .... Uptime(secs): 85591.2 total, 85591.2 interval ``` Even though DB's interval is correctly logged as 60 seconds: ``` DB Stats Uptime(secs): 85591.2 total, 637.8 interval ``` Closes https://github.com/facebook/rocksdb/pull/2338 Differential Revision: D5114131 Pulled By: sagar0 fbshipit-source-id: 85243a38213236ccbb601a7f7aaa8865eaa8083c	2017-05-23 17:14:04 -07:00
Sagar Vemuri	7d8207f1f2	Fix errors in clang-analyzer builds Summary: Fix build error in db_iter.cc when running clang-analyzer. ``` CC db/db_iter.o db/db_iter.cc:938:21: error: no matching constructor for initialization of 'rocksdb::ParsedInternalKey' ParsedInternalKey ikey(Slice(), 0, 0); ^ ~~~~~~~~~~~~~ ./db/dbformat.h:84:3: note: candidate constructor not viable: no known conversion from 'int' to 'rocksdb::ValueType' for 3rd argument ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t) ^ ./db/dbformat.h:78:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided struct ParsedInternalKey { ^ ./db/dbformat.h:78:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided ./db/dbformat.h:83:3: note: candidate constructor not viable: requires 0 arguments, but 3 were provided ParsedInternalKey() { } // Intentionally left uninitialized (for speed) ^ 1 error generated. ``` Closes https://github.com/facebook/rocksdb/pull/2354 Differential Revision: D5115751 Pulled By: sagar0 fbshipit-source-id: b0e386d4e935e4725b07761c3ca5f7a8cbde3692	2017-05-23 15:11:42 -07:00
Andrew Kryczka	6cc9aef162	New API for background work in single thread pool Summary: Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool. Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool. Closes https://github.com/facebook/rocksdb/pull/2204 Differential Revision: D4936256 Pulled By: ajkr fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01	2017-05-23 11:12:27 -07:00
Yi Wu	9d0a07ed52	Fix rocksdb.estimate-num-keys DB property underflow Summary: rocksdb.estimate-num-keys is compute from `estimate_num_keys - 2 * estimate_num_deletes`. If `2 * estimate_num_deletes > estimate_num_keys` it will underflow. Fixing it. Closes https://github.com/facebook/rocksdb/pull/2348 Differential Revision: D5109272 Pulled By: yiwu-arbug fbshipit-source-id: e1bfb91346a59b7282a282b615002507e9d7c246	2017-05-23 10:42:59 -07:00
Aaron Gao	3e86c0f07c	disable direct reads for log and manifest and add direct io to tests Summary: Disable direct reads for log and manifest. Direct reads should not affect sequential_file Also add kDirectIO for option_config_ in db_test_util Closes https://github.com/facebook/rocksdb/pull/2337 Differential Revision: D5100261 Pulled By: lightmark fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b	2017-05-22 18:41:28 -07:00
Sagar Vemuri	228f49d20a	Fix data races caught by tsan Summary: This fixes the tsan build failures in: - write_callback_test - persistent_cache_test.* Closes https://github.com/facebook/rocksdb/pull/2339 Differential Revision: D5101190 Pulled By: sagar0 fbshipit-source-id: 537e19ed05272b1f34cfbf793aa822b2264a1643	2017-05-22 10:27:23 -07:00
Yi Wu	07bdcb91fe	New WriteImpl to pipeline WAL/memtable write Summary: PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that 1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit. 2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes. Merging #2056 and #2058 into this PR. Closes https://github.com/facebook/rocksdb/pull/2286 Differential Revision: D5054606 Pulled By: yiwu-arbug fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d	2017-05-19 14:26:42 -07:00
Yi Wu	d746aead1a	Suppress clang-analyzer false positive Summary: Fixing two types of clang-analyzer false positives: * db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated. * ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running. Closes https://github.com/facebook/rocksdb/pull/2334 Differential Revision: D5093801 Pulled By: yiwu-arbug fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d	2017-05-19 10:56:28 -07:00
Siying Dong	217b866f47	column_family_test: EnvCounter::num_new_writable_file_ to be atomic Summary: TSAN shows warning of data race of EnvCounter::num_new_writable_file_. Make it atomic. Closes https://github.com/facebook/rocksdb/pull/2331 Differential Revision: D5089215 Pulled By: siying fbshipit-source-id: 15f6dcfb770a3310cbb6337c22482c8b330daffc	2017-05-18 13:56:12 -07:00
yizhu.sun	f5ba131bf8	Fixed some spelling mistakes Summary: Closes https://github.com/facebook/rocksdb/pull/2314 Differential Revision: D5079601 Pulled By: sagar0 fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349	2017-05-17 23:12:36 -07:00
hyunwoo	0ebdd70579	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2312 Differential Revision: D5079631 Pulled By: sagar0 fbshipit-source-id: e4c8d1d89b244ee69e9dea1dd013227cc5241026	2017-05-17 16:41:49 -07:00
Mikhail Antonov	ba685a472a	Support ingest_behind for IngestExternalFile Summary: First cut for early review; there are few conceptual points to answer and some code structure issues. For conceptual points - - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true \|\| disable_auto_compaction=false), the user is responsible to properly open and close DB with required params - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise? - Modifying AssignLevelForIngestedFile seems the place we we'd handle that. On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites. Closes https://github.com/facebook/rocksdb/pull/2144 Differential Revision: D4873732 Pulled By: lightmark fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a	2017-05-17 11:42:42 -07:00
boolean5	cb9392a094	add Transactions and Checkpoint to C API Summary: I've added functions to the C API to support Transactions as requested in #1637 and to support Checkpoint. I have also added the corresponding tests to c_test.c For now, the following is omitted: 1. Optimistic Transactions 2. The column family variation of functions Closes https://github.com/facebook/rocksdb/pull/2236 Differential Revision: D4989510 Pulled By: yiwu-arbug fbshipit-source-id: 518cb39f76d5e9ec9690d633fcdc014b98958071	2017-05-16 22:59:43 -07:00
hyunwoo	f720796e24	fixed typo Summary: fixed exisitng -> existing Closes https://github.com/facebook/rocksdb/pull/2305 Differential Revision: D5070169 Pulled By: yiwu-arbug fbshipit-source-id: 8c8450acf50757b767cf78b78314018395738d96	2017-05-16 11:07:58 -07:00
siddontang	1ca723dbd1	C API: support pinnable get Summary: Closes https://github.com/facebook/rocksdb/pull/2254 Differential Revision: D5053590 Pulled By: yiwu-arbug fbshipit-source-id: 2f365a031b3a2947b4fba21d26d4f8f52af9b9f0	2017-05-16 11:07:58 -07:00
赵星宇	4f9e69ccf4	fix log err Summary: Closes https://github.com/facebook/rocksdb/pull/2206 Differential Revision: D5054222 Pulled By: yiwu-arbug fbshipit-source-id: d8742bda1bf3e76d7b68eeb86df4608031b5cbc8	2017-05-15 16:15:38 -07:00
Yi Wu	3907c94ffb	Fix ColumnFamilyTest:BulkAddDrop Summary: Fix ColumnFamilyTest:BulkAddDrop not deleted CF handles at the end, causing ASAN failure. Closes https://github.com/facebook/rocksdb/pull/2275 Differential Revision: D5040724 Pulled By: yiwu-arbug fbshipit-source-id: 86cd4070c944d01173a3cc36462bb800698af192	2017-05-10 23:05:44 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
Aaron Gao	492fc49a86	fix readampbitmap tests Summary: fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose. test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check Closes https://github.com/facebook/rocksdb/pull/2271 Differential Revision: D5038133 Pulled By: lightmark fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8	2017-05-10 12:29:23 -07:00
Andrew Kryczka	be421b0b16	portable sched_getcpu calls Summary: - added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available. - refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred Closes https://github.com/facebook/rocksdb/pull/2272 Differential Revision: D5038093 Pulled By: ajkr fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7	2017-05-10 12:29:23 -07:00
Aaron Gao	259a00eaca	unbiase readamp bitmap Summary: Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a) and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31. An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block. It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i32, (i+1)32). This diff makes each bit represent a single byte: i32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte. But there is one exception: the last bit will always set with the old way.* (*) - assuming read_amp_bytes_per_bit = 32. Closes https://github.com/facebook/rocksdb/pull/2259 Differential Revision: D5035652 Pulled By: lightmark fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356	2017-05-10 01:49:52 -07:00
Islam AbdelRahman	4897eb250b	dont skip IO for filter blocks Summary: Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek Actually I am not sure why this behavior existed in the first place Closes https://github.com/facebook/rocksdb/pull/2255 Differential Revision: D5010721 Pulled By: maysamyabandeh fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4	2017-05-09 09:52:02 -07:00
Changjian Gao	3f73d54bbd	Add C API to set max_file_opening_threads option Summary: Add `rocksdb_options_set_max_file_opening_threads()` API Closes https://github.com/facebook/rocksdb/pull/2184 Differential Revision: D4923090 Pulled By: lightmark fbshipit-source-id: c4ddce17733d999d426d02f7202b33a46ed6faed	2017-05-08 22:49:32 -07:00
Yi Wu	2cd00773c7	Add bulk create/drop column family API Summary: Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time. The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO. Closes https://github.com/facebook/rocksdb/pull/2248 Differential Revision: D5001578 Pulled By: yiwu-arbug fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03	2017-05-07 23:20:46 -07:00
Tamir Duberstein	fdaefa0309	travis: add Windows cross-compilation Summary: - downcase includes for case-sensitive filesystems - give targets the same name (librocksdb) on all platforms With this patch it is possible to cross-compile RocksDB for Windows from a Linux host using mingw. cc yuslepukhin orgads Closes https://github.com/facebook/rocksdb/pull/2107 Differential Revision: D4849784 Pulled By: siying fbshipit-source-id: ad26ed6b4d393851aa6551e6aa4201faba82ef60	2017-05-05 23:20:01 -07:00
Aaron Gao	a30a696034	do not read next datablock if upperbound is reached Summary: Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks: [a b c d][f g h] Then the index key for the first block will be 'e'. then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already. This diff can reduce RA in most cases. Closes https://github.com/facebook/rocksdb/pull/2239 Differential Revision: D4990693 Pulled By: lightmark fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff	2017-05-05 23:20:01 -07:00
Aaron Gao	2d42cf5ea9	Roundup read bytes in ReadaheadRandomAccessFile Summary: Fix alignment in ReadaheadRandomAccessFile Closes https://github.com/facebook/rocksdb/pull/2253 Differential Revision: D5012336 Pulled By: lightmark fbshipit-source-id: 10d2c829520cb787227ef653ef63d5d701725778	2017-05-05 12:14:14 -07:00
Siying Dong	264d3f540c	Allow IntraL0 compaction in FIFO Compaction Summary: Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files. Closes https://github.com/facebook/rocksdb/pull/2163 Differential Revision: D4895953 Pulled By: siying fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231	2017-05-04 18:16:13 -07:00
Andrew Kryczka	8c3a180e83	Set lower-bound on dynamic level sizes Summary: Changed dynamic leveling to stop setting the base level's size bound below `max_bytes_for_level_base`. Behavior for config where `max_bytes_for_level_base == level0_file_num_compaction_trigger * write_buffer_size` and same amount of data in L0 and base-level: - Before #2027, compaction scoring would favor base-level due to dividing by size smaller than `max_bytes_for_level_base`. - After #2027, L0 and Lbase get equal scores. The disadvantage is L0 is often compacted before reaching the num files trigger since `write_buffer_size` can be bigger than the dynamically chosen base-level size. This increases write-amp. - After this diff, L0 and Lbase still get equal scores. Now it takes `level0_file_num_compaction_trigger` files of size `write_buffer_size` to trigger L0 compaction by size, fixing the write-amp problem above. Closes https://github.com/facebook/rocksdb/pull/2123 Differential Revision: D4861570 Pulled By: ajkr fbshipit-source-id: 467ddef56ed1f647c14d86bb018bcb044c39b964	2017-05-04 18:16:12 -07:00
Andrew Kryczka	7c1c8ce5ac	Avoid calling fallocate with UINT64_MAX Summary: When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249. Test setup: - command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize` - filesystem: xfs before this diff: `fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)` after this diff: `fallocate(10, 01, 0, 1977) = 0` Closes https://github.com/facebook/rocksdb/pull/2252 Differential Revision: D5007275 Pulled By: ajkr fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880	2017-05-04 17:43:22 -07:00
Leonidas Galanis	a45e98a5b5	max_open_files dynamic set, follow up Summary: Followup to make 0x40000 a TableCache constant that indicates infinite capacity Closes https://github.com/facebook/rocksdb/pull/2247 Differential Revision: D5001349 Pulled By: lgalanis fbshipit-source-id: ce7bd2e54b0975bb9f8680fdaa0f8bb0e7ae81a2	2017-05-04 10:42:45 -07:00
Leonidas Galanis	e7ae4a3a02	Max open files mutable Summary: Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache. Closes https://github.com/facebook/rocksdb/pull/2185 Differential Revision: D4979189 Pulled By: yiwu-arbug fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda	2017-05-03 21:13:14 -07:00
siddontang	b551104e04	support PopSavePoint for WriteBatch Summary: Try to fix https://github.com/facebook/rocksdb/issues/1969 Closes https://github.com/facebook/rocksdb/pull/2170 Differential Revision: D4907333 Pulled By: yiwu-arbug fbshipit-source-id: 417b420ff668e6c2fd0dad42a94c57385012edc5	2017-05-03 10:57:45 -07:00
Siying Dong	af6fe69e4c	Fix an issue of manual / auto compaction data race Summary: A data race between a manual and an auto compaction can cause a scheduled automatic compaction to be cancelled and never rescheduled again. This may cause a condition of hanging forever. Fix this by always making sure the cancelled compaction is put back to the compaction queue. Closes https://github.com/facebook/rocksdb/pull/2238 Differential Revision: D4984591 Pulled By: siying fbshipit-source-id: 3ab153886403c7b991896dcb2158b96cac12f227	2017-05-02 15:11:59 -07:00
Siying Dong	aeaba07b2a	Remove an assert that causes TSAN failure. Summary: ColumnFamilyData::ConstructNewMemtable is called out of DB mutex, and it asserts current_ is not empty, but current_ should only be accessed inside DB mutex. Remove this assert to make TSAN happy. Closes https://github.com/facebook/rocksdb/pull/2235 Differential Revision: D4978531 Pulled By: siying fbshipit-source-id: 423685a7dae88ed3faaa9e1b9ccb3427ac704a4b	2017-05-01 16:35:15 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Aaron Gao	0ca3ead0cb	add GetRootDB() in DeleteFilesInRange Summary: In case users cast a subclass of db* into dbimpl* Closes https://github.com/facebook/rocksdb/pull/2222 Differential Revision: D4964486 Pulled By: lightmark fbshipit-source-id: 0ccdc08ee8e7a193dfbbe0218c3cbfd795662ca1	2017-04-27 14:33:17 -07:00
Dmitri Smirnov	cdad04b051	Remove double buffering on RandomRead on Windows. Summary: Remove double buffering on RandomRead on Windows. With more logic appear in file reader/write Read no longer obeys forwarding calls to Windows implementation. Previously direct_io (unbuffered) was only available on Windows but now is supported as generic. We remove intermediate buffering on Windows. Remove random_access_max_buffer_size option which was windows specific. Non-zero values for that opton introduced unnecessary lock contention. Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are no longer necessary. Add aligned buffer reads for cases when requested reads exceed read ahead size. Closes https://github.com/facebook/rocksdb/pull/2105 Differential Revision: D4847770 Pulled By: siying fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f	2017-04-27 12:30:05 -07:00
Siying Dong	e15382c09c	Disable two flaky tests Summary: Closes https://github.com/facebook/rocksdb/pull/2217 Differential Revision: D4959351 Pulled By: siying fbshipit-source-id: ce7c3a430bae0d15e06b3d5c958ebce969d08564	2017-04-26 17:13:46 -07:00
Andrew Kryczka	efc361ef7d	Add user stats Reset API Summary: It resets all the ticker and histogram stats to zero. Needed to change the locking a bit since Reset() is the only operation that manipulates multiple tickers/histograms together, and that operation should be seen as atomic by other operations that access tickers/histograms. Closes https://github.com/facebook/rocksdb/pull/2213 Differential Revision: D4952232 Pulled By: ajkr fbshipit-source-id: c0475c3e4c7b940120d53891b69c3091149a0679	2017-04-26 15:57:01 -07:00
Andrew Kryczka	f6a27d0bce	Extract statistics tests into separate file Summary: I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there. Closes https://github.com/facebook/rocksdb/pull/2211 Differential Revision: D4951558 Pulled By: ajkr fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4	2017-04-26 14:47:23 -07:00
Aaron Gao	7eddecce12	support bulk loading with universal compaction Summary: Support buck load with universal compaction. More test cases to be added. Closes https://github.com/facebook/rocksdb/pull/2202 Differential Revision: D4935360 Pulled By: lightmark fbshipit-source-id: cc3ca1b6f42faa503207dab1408d6bcf393ee5b5	2017-04-26 13:41:32 -07:00
Aaron Gao	72c21fb3f2	call GetRootDB() before cast to DBImpl* in CancelAllBackgroundWork Summary: User could call this with wrapper class of DB or DBImpl Closes https://github.com/facebook/rocksdb/pull/2200 Differential Revision: D4935530 Pulled By: lightmark fbshipit-source-id: df9cb61d67d0f3bbcf62f714d77523a459a92883	2017-04-24 13:47:17 -07:00
Tomas Kolda	04d58970cb	AIX and Solaris Sparc Support Summary: Replacement of #2147 The change was squashed due to a lot of conflicts. Closes https://github.com/facebook/rocksdb/pull/2194 Differential Revision: D4929799 Pulled By: siying fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581	2017-04-21 20:48:04 -07:00
Aaron Gao	cb885bccfe	set compaction_iterator earliest_snapshot to max if no snapshot Summary: It is a potential bug that will be triggered if we ingest files before inserting the first key into an empty db. 0 is a special value reserved to indicate the concept of non-existence. But not good for seqno in this case because 0 is a valid seqno for ingestion(bulk loading) Closes https://github.com/facebook/rocksdb/pull/2183 Differential Revision: D4919827 Pulled By: lightmark fbshipit-source-id: 237eea40f88bd6487b66806109d90065dc02c362	2017-04-20 19:56:59 -07:00
Andrew Kryczka	1dd7760513	Change L0 compaction score using level size Summary: The goal is to avoid the problem of small number of L0 files triggering compaction to base level (which increased write-amp), while still allowing L0 compaction-by-size (so intra-L0 compactions cause score to increase). Closes https://github.com/facebook/rocksdb/pull/2172 Differential Revision: D4908552 Pulled By: ajkr fbshipit-source-id: 4b170142b2b368e24bd7948b2a6f24c69fabf73d	2017-04-19 12:00:01 -07:00
Yi Wu	966ebb02f5	Hide event listeners from lite build Summary: Fixing lite build failure introduce by #2169. Closes https://github.com/facebook/rocksdb/pull/2174 Reviewed By: sagar0 Differential Revision: D4910619 Pulled By: yiwu-arbug fbshipit-source-id: 5213b7b7431cc258688793c8c28153025588d8d9	2017-04-18 17:26:19 -07:00
Siying Dong	c49d704656	Add DB:ResetStats() Summary: Add a function to allow users to reset internal stats without restarting the DB. Closes https://github.com/facebook/rocksdb/pull/2167 Differential Revision: D4907939 Pulled By: siying fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68	2017-04-18 16:56:48 -07:00
Yi Wu	0fcdccc33e	Blob storage helper methods Summary: Split out interfaces needed for blob storage from #1560, including * CompactionEventListener and OnFlushBegin listener interfaces. * Blob filename support. Closes https://github.com/facebook/rocksdb/pull/2169 Differential Revision: D4905463 Pulled By: yiwu-arbug fbshipit-source-id: 564e73448f1b7a367e5e46216a521e57ea9011b5	2017-04-18 12:42:38 -07:00
Yi Wu	e9e6e53247	Simplify write thread logic Summary: The concept about early exit in write thread implementation is a confusing one. It means that if early exit is allowed, batch group leader will not responsible to exit the batch group, but the last finished writer do. In case we need to mark log synced, or encounter memtable insert error, early exit is disallowed. This patch remove such a concept by: * In all cases, the last finished writer (not necessary leader) is responsible to exit batch group. * In case of parallel memtable write, leader will also mark log synced after memtable insert and before signal finish (call `CompleteParallelWorker()`). The purpose is to allow mark log synced (which require locking mutex) can run in parallel to memtable insert in other writers. * The last finish writer should handle memtable insert error (update bg_error_) before exiting batch group. Closes https://github.com/facebook/rocksdb/pull/2134 Differential Revision: D4869667 Pulled By: yiwu-arbug fbshipit-source-id: aec170847c85b90f4179d6a4608a4fe1361544e3	2017-04-13 16:12:04 -07:00
Aaron Gao	44fa8ece9b	change use_direct_writes to use_direct_io_for_flush_and_compaction Summary: Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get(). Closes https://github.com/facebook/rocksdb/pull/2117 Differential Revision: D4860912 Pulled By: lightmark fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19	2017-04-13 16:12:04 -07:00
Igor Canadi	b6b9359ece	Fix BYTES_WRITTEN accounting Summary: BYTES_WRITTEN accounting doesn't work with disabled WAL. For example, this is what we get in the LOG: ``` Cumulative writes: 9794K writes, 228M keys, 9794K commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s ``` WAL bytes are tracked in a different statistic: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.h#L105. BYTES_WRITTEN should count all the writes. Closes https://github.com/facebook/rocksdb/pull/2133 Differential Revision: D4880615 Pulled By: yiwu-arbug fbshipit-source-id: 8fd0b223099f3f5ad7df79d4e737d313687fec69	2017-04-13 16:12:03 -07:00
Sagar Vemuri	6a6723ee1e	Move MergeOperatorPinning tests to be with other merge operator tests Summary: Moved MergeOperatorPinning tests from db_test2.cc to db_merge_operator_test.cc. [This is the same code as PR #2104 , which has already been reviewed, but I am creating a new PR as I cannot import from #2104 onto phabricator anymore even after rebasing. I'll close and discard #2104.] Closes https://github.com/facebook/rocksdb/pull/2125 Differential Revision: D4863312 Pulled By: sagar0 fbshipit-source-id: 0f71a7690aa09c1d03ee85ce2bc1d2d89e4f4399	2017-04-11 16:15:06 -07:00
Siying Dong	8f47a97512	File level histogram should be printed per CF, not per DB Summary: Currently level histogram is only printed out for DB stats and for default CF. This is confusing. Change to print for every CF instead. Closes https://github.com/facebook/rocksdb/pull/2126 Differential Revision: D4865373 Pulled By: siying fbshipit-source-id: 1c853e0ac66e00120ee931cabc9daf69ccc2d577	2017-04-11 08:42:03 -07:00
Manuel Ung	1f8b119ed6	Limit maximum memory used in the WriteBatch representation Summary: Extend TransactionOptions to include max_write_batch_size which determines the maximum size of the writebatch representation. If memory limit is exceeded, the operation will abort with subcode kMemoryLimit. Closes https://github.com/facebook/rocksdb/pull/2124 Differential Revision: D4861842 Pulled By: lth fbshipit-source-id: 46fd172ea67cc90bbba829bf0d70cfab2261c161	2017-04-10 15:42:26 -07:00
Maysam Yabandeh	20778f2f92	Adding comments to the write path Summary: also did minor refactoring Closes https://github.com/facebook/rocksdb/pull/2115 Differential Revision: D4855818 Pulled By: maysamyabandeh fbshipit-source-id: fbca6ac57e5c6677fffe8354f7291e596a50cb77	2017-04-10 12:43:34 -07:00
Sagar Vemuri	7124268a09	Reduce the number of params needed to construct DBIter Summary: DBIter, and in-turn NewDBIterator and NewArenaWrappedDBIterator, take a bunch of params. They can be reduced by passing in ReadOptions directly instead of passing in every new param separately. It also seems much cleaner as a bunch of the params towards the end seem to be optional. (Recently I introduced max_skippable_internal_keys, which added one more to the already huge count). Idea courtesy IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/2116 Differential Revision: D4857128 Pulled By: sagar0 fbshipit-source-id: 7d239df094b94bd9ea79d145cdf825478ac037a8	2017-04-10 11:14:14 -07:00
Islam AbdelRahman	61730186df	dummy diff Summary: Closes https://github.com/facebook/rocksdb/pull/2114 Differential Revision: D4854860 Pulled By: IslamAbdelRahman fbshipit-source-id: b871c5b9ccc52d20f5ceacdd172dc70b1dbf9110	2017-04-07 17:07:37 -07:00
Ayappan	dd8f9e38e9	Fix compilation for GCC-5 Summary: Fixes this issue https://github.com/facebook/rocksdb/issues/2108 Closes https://github.com/facebook/rocksdb/pull/2109 Differential Revision: D4851965 Pulled By: yiwu-arbug fbshipit-source-id: 6ee807b	2017-04-07 10:54:12 -07:00
Siying Dong	ff97287016	Refactor compaction picker code Summary: 1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h. 2. Rename some functions to make the code easier to understand. 3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions. Closes https://github.com/facebook/rocksdb/pull/2100 Differential Revision: D4845948 Pulled By: siying fbshipit-source-id: efa0ab4	2017-04-06 20:09:34 -07:00
Sagar Vemuri	343b59d6ee	Move various string utility functions into string_util Summary: This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper. Check the sub-commits for easier reviewing. Closes https://github.com/facebook/rocksdb/pull/2094 Differential Revision: D4837730 Pulled By: sagar0 fbshipit-source-id: 344278a	2017-04-06 14:54:12 -07:00
Yi Wu	df6f5a3772	Move memtable related files into memtable directory Summary: Move memtable related files into memtable directory. Closes https://github.com/facebook/rocksdb/pull/2087 Differential Revision: D4829242 Pulled By: yiwu-arbug fbshipit-source-id: ca70ab6	2017-04-06 14:09:13 -07:00
Tamir Duberstein	107c5f6a60	CMake: more MinGW fixes Summary: siying this is a resubmission of #2081 with the 4th commit fixed. From that commit message: > Note that the previous use of quotes in PLATFORM_{CC,CXX}FLAGS was incorrect and caused GCC to produce the incorrect define: > > #define ROCKSDB_JEMALLOC -DJEMALLOC_NO_DEMANGLE 1 > > This was the cause of the Linux build failure on the previous version of this change. I've tested this locally, and the Linux build succeeds now. Closes https://github.com/facebook/rocksdb/pull/2097 Differential Revision: D4839964 Pulled By: siying fbshipit-source-id: cc51322	2017-04-06 14:09:13 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Islam AbdelRahman	c50e3750dc	Use a human readable size for level report Summary: Current ``` Compaction Stats [default] Level Files Size(MB} Score Read(GB} Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop ---------------------------------------------------------------------------------------------------------------------------------------------------------- L0 2/0 49.02 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 76.1 1 2 0.322 0 0 Sum 2/0 49.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 76.1 1 2 0.322 0 0 Int 0/0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 76.1 1 2 0.322 0 0 ``` New ``` Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn Key Closes https://github.com/facebook/rocksdb/pull/2055 Differential Revision: D4804576 Pulled By: IslamAbdelRahman fbshipit-source-id: 719be6a	2017-04-05 17:24:19 -07:00
Siying Dong	ce64b8b719	Divide db/db_impl.cc Summary: db_impl.cc is too large to manage. Divide db_impl.cc into db/db_impl.cc, db/db_impl_compaction_flush.cc, db/db_impl_files.cc, db/db_impl_open.cc and db/db_impl_write.cc. Closes https://github.com/facebook/rocksdb/pull/2095 Differential Revision: D4838188 Pulled By: siying fbshipit-source-id: c5f3059	2017-04-05 17:24:19 -07:00
Andrew Kryczka	d659faad54	Level-based L0->L0 compaction Summary: Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach. - L0->L0 is triggered when base level is unavailable due to pending compactions - L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes. - Subcompactions are disabled for L0->L0 since we want to output one file. - Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0. Closes https://github.com/facebook/rocksdb/pull/2027 Differential Revision: D4760318 Pulled By: ajkr fbshipit-source-id: 9d07183	2017-04-04 18:09:11 -07:00
Siying Dong	43010a929f	Revert "[rocksdb][PR] CMake: more MinGW fixes" fbshipit-source-id: 43b4529	2017-04-04 16:24:26 -07:00
Tamir Duberstein	3450ac8c1b	CMake: more MinGW fixes Summary: See individual commits. yuslepukhin siying Closes https://github.com/facebook/rocksdb/pull/2081 Differential Revision: D4824639 Pulled By: IslamAbdelRahman fbshipit-source-id: 2fc2b00	2017-04-04 15:09:17 -07:00
Aaron Gao	90cfd46458	update IterKey that can get user key and internal key explicitly Summary: to void future bug that caused by the mix of userkey/internalkey Closes https://github.com/facebook/rocksdb/pull/2084 Differential Revision: D4825889 Pulled By: lightmark fbshipit-source-id: 28411db	2017-04-04 14:24:20 -07:00
Yi Wu	9e44531803	Refactor WriteImpl (pipeline write part 1) Summary: Refactor WriteImpl() so when I plug-in the pipeline write code (which is an alternative approach for WriteThread), some of the logic can be reuse. I split out the following methods from WriteImpl(): * PreprocessWrite() * HandleWALFull() (previous MaybeFlushColumnFamilies()) * HandleWriteBufferFull() * WriteToWAL() Also adding a constructor to WriteThread::Writer, and move WriteContext into db_impl.h. No real logic change in this patch. Closes https://github.com/facebook/rocksdb/pull/2042 Differential Revision: D4781014 Pulled By: yiwu-arbug fbshipit-source-id: d45ca18	2017-04-04 10:24:32 -07:00
Siying Dong	6ef8c620d3	Move auto_roll_logger and filename out of db/ Summary: It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency. Closes https://github.com/facebook/rocksdb/pull/2080 Differential Revision: D4821141 Pulled By: siying fbshipit-source-id: ca7d768	2017-04-03 18:39:14 -07:00
Siying Dong	88cc81df5c	auto_roll_logger_test to move away from real sleep Summary: auto_roll_logger_test relies on timing conditon that some operations finish within 1 seconds. This caused flaky tests. Move away from real timing and sleep and use fake time to verify the time-based rolling. Closes https://github.com/facebook/rocksdb/pull/2066 Differential Revision: D4810647 Pulled By: siying fbshipit-source-id: c54d994	2017-04-03 11:39:09 -07:00
Nikhil Benesch	d25e28d584	replace sometimes-undefined uint type with unsigned int Summary: `uint` is nonstandard and not a built-in type on all compilers; replace it with the always-valid `unsigned int`. I assume this went unnoticed because it's inside an `#ifdef ROCKDB_JEMALLOC`. Closes https://github.com/facebook/rocksdb/pull/2075 Differential Revision: D4820427 Pulled By: ajkr fbshipit-source-id: 0876561	2017-04-03 11:39:09 -07:00
Andrew Kryczka	a1d7e487b3	Add L0 write-amp to compaction level stats Summary: Previously it always showed 0.0 for L0 write-amp because we were dividing by bytes read from non-output level. For L0, we should instead divide by bytes ingested to the DB. Note the numerator (bytes written to L0) includes flush bytes. Closes https://github.com/facebook/rocksdb/pull/2078 Differential Revision: D4816902 Pulled By: ajkr fbshipit-source-id: 7dca31a	2017-04-03 11:24:10 -07:00
Orgad Shaneh	6401a8b76b	Fix build with MinGW Summary: There still are many warnings (most of them about invalid printf format for long long), but it builds if FAIL_ON_WARNINGS is disabled. Closes https://github.com/facebook/rocksdb/pull/2052 Differential Revision: D4807355 Pulled By: siying fbshipit-source-id: ef03786	2017-03-30 16:54:52 -07:00
Sagar Vemuri	c6d04f2ecf	Option to fail a request as incomplete when skipping too many internal keys Summary: Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete. This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0. This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084. Closes https://github.com/facebook/rocksdb/pull/2000 Differential Revision: D4753223 Pulled By: sagar0 fbshipit-source-id: 1c973f7	2017-03-30 12:09:21 -07:00
Siying Dong	67d7623794	Expose the stalling information through DB::GetProperty() Summary: Add two DB properties: rocksdb.actual_delayed_write_rate and rocksdb.is_write_stooped, for people to know whether current writes are being throttled. Closes https://github.com/facebook/rocksdb/pull/2043 Differential Revision: D4782975 Pulled By: siying fbshipit-source-id: 6b2f5cf	2017-03-29 11:54:20 -07:00
Maysam Yabandeh	e7731d119a	Configure index partition size Summary: Allow the users to specify the target index partition size. With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut. Closes https://github.com/facebook/rocksdb/pull/2041 Differential Revision: D4780216 Pulled By: maysamyabandeh fbshipit-source-id: 95a0831	2017-03-28 12:09:12 -07:00
Siying Dong	91b5feb37b	Fix Windows Build broken by a recent commit Summary: Closes https://github.com/facebook/rocksdb/pull/2032 Differential Revision: D4766260 Pulled By: siying fbshipit-source-id: 415daa4	2017-03-23 18:09:57 -07:00
Warren Falk	41ccae6d26	Add C API functions (and tests) for WriteBatchWithIndex Summary: I've added functions to the C API to support WriteBatchWithIndex as requested in #1833. I've also added unit tests to c_test I've implemented the WriteBatchWithIndex variation of every function available for regular WriteBatch. And added additional functions unique to WriteBatchWithIndex. For now, the following is omitted: 1. The ability to create WriteBatchWithIndex's custom batch-only iterator as I'm not sure what its purpose is. It should be possible to add later if anyone wants it. 2. The ability to create the batch with a fallback comparator, since it appears to be unnecessary. I believe the column family comparator will be used for this, meaning those using a custom comparator can just use the column family variations. Closes https://github.com/facebook/rocksdb/pull/1985 Differential Revision: D4760039 Pulled By: siying fbshipit-source-id: 393227e	2017-03-23 15:54:13 -07:00
Daniel Black	f4fce4751e	Fix clang compile error - [-Werror,-Wunused-lambda-capture] Summary: Errors where: db/version_set.cc:1535:20: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture] [this](const Fsize& f1, const Fsize& f2) -> bool { ^ db/version_set.cc:1541:20: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture] [this](const Fsize& f1, const Fsize& f2) -> bool { ^ db/db_test.cc:2983:27: error: lambda capture 'kNumPutsBeforeWaitForFlush' is not required to be captured for this use [-Werror,-Wunused-lambda-capture] auto gen_l0_kb = [this, kNumPutsBeforeWaitForFlush](int size) { ^ Closes https://github.com/facebook/rocksdb/pull/1972 Differential Revision: D4685991 Pulled By: siying fbshipit-source-id: 9125379	2017-03-22 18:09:10 -07:00
Siying Dong	15950fe3a0	Remove ASSERT_EQ(boolean, ...) Summary: Closes https://github.com/facebook/rocksdb/pull/2024 Differential Revision: D4755420 Pulled By: siying fbshipit-source-id: 7332ab1	2017-03-22 15:54:12 -07:00
Aaron Gao	3e56c7e0c4	make total_log_size_ atomic Summary: make total_log_size_ atomic to avoid overflow caused by data race. Closes https://github.com/facebook/rocksdb/pull/2019 Differential Revision: D4751391 Pulled By: siying fbshipit-source-id: fac01dd	2017-03-22 11:54:40 -07:00
Dmitri Smirnov	be723a8d8c	Optionally construct Post Processing Info map in MemTableInserter Summary: MemTableInserter default constructs Post processing info std::map. However, on Windows with 2015 STL the default constructed map still dynamically allocates one node which shows up on a profiler and we loose ~40% throughput on fillrandom benchmark. Solution: declare a map as std::aligned storage and optionally construct. This addresses https://github.com/facebook/rocksdb/issues/1976 Before: ------------------------------------------------------------------- Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillonly] fillrandom : 2.775 micros/op 360334 ops/sec; 280.4 MB/s Microseconds per write: Count: 10000000 Average: 2.7749 StdDev: 39.92 Min: 1 Median: 2.0826 Max: 26051 Percentiles: P50: 2.08 P75: 2.55 P99: 3.55 P99.9: 9.58 P99.99: 51.5**6 ------------------------------------------------------ After: Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillon Closes https://github.com/facebook/rocksdb/pull/2011 Differential Revision: D4740823 Pulled By: siying fbshipit-source-id: 1daaa2c	2017-03-22 11:24:12 -07:00
Maysam Yabandeh	8b0097b49b	Readers for partition filter Summary: This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests. Closes https://github.com/facebook/rocksdb/pull/1961 Differential Revision: D4672216 Pulled By: maysamyabandeh fbshipit-source-id: 6a2b829	2017-03-22 09:24:15 -07:00
Siying Dong	8f5bf04468	Flush triggered by DB write buffer size picks the oldest unflushed CF Summary: Previously, when DB write buffer size triggers, we always pick the CF with most data in its memtable to flush. This approach can minimize total flush happens. Change the behavior to always pick the oldest unflushed CF, which makes it the same behavior when max_total_wal_size hits. This approach will minimize size used by max_total_wal_size. Closes https://github.com/facebook/rocksdb/pull/1987 Differential Revision: D4703214 Pulled By: siying fbshipit-source-id: 9ff8b09	2017-03-21 11:09:10 -07:00
Raza Hussain	6908e24b56	dynamic setting of stats_dump_period_sec through SetDBOption() Summary: Resolved the following issue: https://github.com/facebook/rocksdb/issues/1930 Closes https://github.com/facebook/rocksdb/pull/2004 Differential Revision: D4736764 Pulled By: yiwu-arbug fbshipit-source-id: 64fe0b7	2017-03-20 22:54:13 -07:00
Islam AbdelRahman	d52f334cbd	Break stalls when no bg work is happening Summary: Current stall will keep sleeping even if there is no Flush/Compactions to wait for, I changed the logic to break the stall if we are not flushing or compacting db_bench command used ``` # fillrandom # memtable size = 10MB # value size = 1 MB # num = 1000 # use /dev/shm ./db_bench --benchmarks="fillrandom,stats" --value_size=1048576 --write_buffer_size=10485760 --num=1000 --delayed_write_rate=XXXXX --db="/dev/shm/new_stall" \| grep "Cumulative stall" ``` ``` Current results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.031 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00:00:22.314 H:M:S # delayed_write_rate = 100 Kb/sec Cumulative stall: 00:00:42.784 H:M:S # delayed_write_rate = 50 Kb/sec Cumulative stall: 00:01:23.785 H:M:S # delayed_write_rate = 25 Kb/sec Cumulative stall: 00:02:45.702 H:M:S ``` ``` New results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.017 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00 Closes https://github.com/facebook/rocksdb/pull/1884 Differential Revision: D4585439 Pulled By: IslamAbdelRahman fbshipit-source-id: aed2198	2017-03-16 18:24:17 -07:00
Islam AbdelRahman	995618a821	Support SstFileManager::SetDeleteRateBytesPerSecond() Summary: Update DeleteScheduler component to support changing delete rate in runtime by introducing SstFileManager::SetDeleteRateBytesPerSecond() Closes https://github.com/facebook/rocksdb/pull/1994 Differential Revision: D4719906 Pulled By: IslamAbdelRahman fbshipit-source-id: e6b8d9e	2017-03-16 12:09:15 -07:00
Islam AbdelRahman	e19163688b	Add macros to include file name and line number during Logging Summary: current logging ``` 2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25 2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2. 2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK 2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK 2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0. 2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6 2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1 2017/03/14-14:20:31. Closes https://github.com/facebook/rocksdb/pull/1990 Differential Revision: D4708695 Pulled By: IslamAbdelRahman fbshipit-source-id: cb8968f	2017-03-15 19:39:12 -07:00
Maysam Yabandeh	11526252cc	Pinnableslice (2nd attempt) Summary: PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The d Closes https://github.com/facebook/rocksdb/pull/1756 Differential Revision: D4391738 Pulled By: maysamyabandeh fbshipit-source-id: 6f3edd3	2017-03-13 11:54:10 -07:00
Sagar Vemuri	1ffbdfd9a7	Add a new SstFileWriter constructor without explicit comparator Summary: The comparator param in SstFileWriter constructor is redundant as it already exists as a field in options. So the current SstFileWriter constructor should be deprecated in favor of a new one which does not take a comparator. Note that the jni/java apis have not been touched yet. Closes https://github.com/facebook/rocksdb/pull/1978 Differential Revision: D4685629 Pulled By: sagar0 fbshipit-source-id: 372ce96	2017-03-13 11:39:13 -07:00
Reid Horuff	ebd5639b6d	Add ability to search for key prefix in sst_dump tool Summary: Add the flag --prefix to the sst_dump tool This flag is similar to, and exclusive from, the --from flag. --prefix=0x00FF will return all rows prefixed with 0x00FF. The --to flag may also be specified and will work as expected. These changes were used to help in debugging the power cycle corruption issue and theses changes were tested by scanning through a udb. Closes https://github.com/facebook/rocksdb/pull/1984 Differential Revision: D4691814 Pulled By: reidHoruff fbshipit-source-id: 027f261	2017-03-13 10:39:12 -07:00
Maysam Yabandeh	e6725e8c8d	Fix some bugs in MockEnv Summary: Fixing some bugs in MockEnv so it be actually used. Closes https://github.com/facebook/rocksdb/pull/1914 Differential Revision: D4609923 Pulled By: maysamyabandeh fbshipit-source-id: ca25735	2017-03-13 09:54:11 -07:00
Andrew Kryczka	f2817fb7f9	avoid ASSERT_EQ(false, ...); Summary: lately it fails on travis due to a compiler bug (see https://github.com/google/googletest/issues/322#issuecomment-125645145). interestingly it seems to affect occurrences of `ASSERT_EQ(false, ...);` but not `ASSERT_EQ(true, ...);`. Closes https://github.com/facebook/rocksdb/pull/1958 Differential Revision: D4680742 Pulled By: ajkr fbshipit-source-id: 291fe41	2017-03-08 22:24:16 -08:00
Andrew Kryczka	5b11124e39	add max to histogram stats Summary: Domas enlightened me about p100 (i.e., max) stats. Let's add them to our histograms. Closes https://github.com/facebook/rocksdb/pull/1968 Differential Revision: D4678716 Pulled By: ajkr fbshipit-source-id: 65e7118	2017-03-08 22:24:15 -08:00
Andrew Kryczka	18fc1bc0e0	minor changes for rate limiter test flakiness Summary: the 50%+ drained constraint wasn't working consistently in some of our test environments, maybe their resources are too low. relax the constraints a bit. Closes https://github.com/facebook/rocksdb/pull/1970 Differential Revision: D4679419 Pulled By: ajkr fbshipit-source-id: 3789cd8	2017-03-08 17:54:11 -08:00
Aaron Gao	12ba00ea65	Reset DBIter::saved_key_ with proper user key anywhere before pass to DBIter::FindNextUserEntry Summary: fix db_iter bug introduced by [facebook#1413](https://github.com/facebook/rocksdb/pull/1413) Closes https://github.com/facebook/rocksdb/pull/1962 Differential Revision: D4672369 Pulled By: lightmark fbshipit-source-id: 6a22953	2017-03-08 17:24:11 -08:00
Sagar Vemuri	97edc72d39	Add a memtable-only iterator Summary: This PR is to support a way to iterate over all the keys that are just in memtables. Closes https://github.com/facebook/rocksdb/pull/1953 Differential Revision: D4663500 Pulled By: sagar0 fbshipit-source-id: 144e177	2017-03-07 11:54:10 -08:00
Leonidas Galanis	72202962f9	fix db_sst_test flakiness Summary: db_sst_test had been flaky occasionally in the following way: reached_max_space_on_compaction can in very rare cases be 0. This happens when the limit on maximum allowable space set using SetMaxAllowedSpaceUsage is hit during flush for all test db sizes (1,2,4,8 and 10MB).The fix clears the error returned when the the space limit is reached during flush. This ensures that the compaction call back will always be called. The runtime is increased slightly because the 1MB loop writes more data and hits the limit during multiple flushes until compaction is scheduled. Closes https://github.com/facebook/rocksdb/pull/1861 Differential Revision: D4557396 Pulled By: lgalanis fbshipit-source-id: ff778d1	2017-03-07 11:24:13 -08:00
Reid Horuff	58b12dfe37	Set logs as getting flushed before releasing lock, race condition fix Summary: Relating to #1903: In MaybeFlushColumnFamilies() we want to modify the 'getting_flushed' flag before releasing the db mutex when SwitchMemtable() is called. The following 2 actions need to be atomic in MaybeFlushColumnFamilies() - getting_flushed is false on oldest log - we determine that all CFs can be flushed to successfully release oldest log - we set getting_flushed = true on the oldest log. ------- - getting_flushed is false on oldest log - we determine that all CFs can NOT be flushed to successfully release oldest log - we set unable_to_flush_oldest_log_ = true on the oldest log. #### In the 2pc case: T1 enters function but is unable to flush all CFs to release log T1 sets unable_to_flush_oldest_log_ = true T1 begins flushing all CFs possible T2 enters function but is unable to flush all CFs to release log T2 sees unable_to_flush_oldes_log_ has been set so exits T3 enters function and will be able to flush all CFs to release oldest log T3 sets getting_flushed = true on oldes Closes https://github.com/facebook/rocksdb/pull/1909 Differential Revision: D4646235 Pulled By: reidHoruff fbshipit-source-id: c8d0447	2017-03-06 15:09:11 -08:00
Maysam Yabandeh	534581a356	Fix a bug in tests in options operator= Summary: Note: Using the default operator= is an unsafe approach for Options since it destructs shared_ptr in the same order of their creation, in contrast to destructors which destructs them in the opposite order of creation. One particular problme is that the cache destructor might invoke callback functions that use Option members such as statistics. To work around this problem, we manually call destructor of table_facotry which eventually clears the block cache. Closes https://github.com/facebook/rocksdb/pull/1950 Differential Revision: D4655473 Pulled By: maysamyabandeh fbshipit-source-id: 6c4bbff	2017-03-05 18:09:09 -08:00
Andrew Kryczka	4561275c2d	fix rate limiter test flakiness Summary: fix when elapsed time spans non-integral number of intervals since the rate limiter may still be drained during a partial interval. Closes https://github.com/facebook/rocksdb/pull/1948 Differential Revision: D4651304 Pulled By: ajkr fbshipit-source-id: b1f9e70	2017-03-03 11:09:11 -08:00
Andrew Kryczka	7c80a6d7d1	Statistic for how often rate limiter is drained Summary: This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case. The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor. Closes https://github.com/facebook/rocksdb/pull/1946 Differential Revision: D4646489 Pulled By: ajkr fbshipit-source-id: d8e0161	2017-03-02 17:54:15 -08:00
Aaron Gao	6fb9013441	sanitize readahead when direct read enabled Summary: no readahead: readseq : 8.438 micros/op 118510 ops/sec; 13.1 MB/s sanitize to 10MB: readseq : 6.051 micros/op 165248 ops/sec; 18.3 MB/s Closes https://github.com/facebook/rocksdb/pull/1945 Differential Revision: D4645811 Pulled By: lightmark fbshipit-source-id: 5d63770	2017-03-02 17:24:11 -08:00
Islam AbdelRahman	f89b3893c0	Remove skip_table_builder_flush and default it to true Summary: This option is needed to be enabled for Direct IO and I cannot think of a reason where we need to disable it remove it and default it to true Closes https://github.com/facebook/rocksdb/pull/1944 Differential Revision: D4641088 Pulled By: IslamAbdelRahman fbshipit-source-id: d7085b9	2017-03-02 16:54:10 -08:00
Siying Dong	d5b607a43f	Make db_wal_test slightly faster Summary: Avoid to run db_wal_test in all the DB test options, and some small changes. Closes https://github.com/facebook/rocksdb/pull/1921 Differential Revision: D4622054 Pulled By: siying fbshipit-source-id: 890fd64	2017-02-28 17:39:10 -08:00
Siying Dong	ba4c77bd6b	Divide external_sst_file_test Summary: Separate the platform dependent tests from external_sst_file_test. Only those tests need to run on platforms like OSX Closes https://github.com/facebook/rocksdb/pull/1923 Differential Revision: D4622461 Pulled By: siying fbshipit-source-id: d2d6f04	2017-02-28 14:24:11 -08:00
Aaron Gao	e877afa08b	Remove bulk loading and auto_roll_logger in rocksdb_lite Summary: shrink lite size Closes https://github.com/facebook/rocksdb/pull/1929 Differential Revision: D4622059 Pulled By: siying fbshipit-source-id: 050b796	2017-02-28 11:09:11 -08:00
Peter (Stig) Edwards	2ca2059f66	Get unique_ptr to use delete[] for char[] in DumpMallocStats Summary: Avoid mismatched free() / delete / delete [] in DumpMallocStats Closes https://github.com/facebook/rocksdb/pull/1927 Differential Revision: D4622045 Pulled By: siying fbshipit-source-id: 1131b30	2017-02-27 17:39:12 -08:00
Siying Dong	8ad0fcdf99	Separate small subset tests in DBTest Summary: Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux. Closes https://github.com/facebook/rocksdb/pull/1924 Differential Revision: D4616702 Pulled By: siying fbshipit-source-id: 13e6549	2017-02-27 12:24:11 -08:00
Siying Dong	3b8ba703cb	Fix flaky DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 Summary: A previous fix to DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 didn't address the right problem. The problem is L0->L0 compaction is not trivial move in the scenario, not parallel compactions. Fix this. Closes https://github.com/facebook/rocksdb/pull/1911 Differential Revision: D4608955 Pulled By: siying fbshipit-source-id: 7a712cb	2017-02-23 18:39:13 -08:00
Siying Dong	8efb5ffa2a	[rocksdb][PR] Remove option min_partial_merge_operands and verify_checksums_in_comp… Summary: …action The two options, min_partial_merge_operands and verify_checksums_in_compaction, are not seldom used. Remove them to reduce the total number of options. Also remove them from Java and C interface. Closes https://github.com/facebook/rocksdb/pull/1902 Differential Revision: D4601219 Pulled By: siying fbshipit-source-id: aad4cb2	2017-02-23 15:09:12 -08:00
Siying Dong	1ba2804b7f	Remove XFunc tests Summary: Xfunc is hardly used. Remove it to keep the code simple. Closes https://github.com/facebook/rocksdb/pull/1905 Differential Revision: D4603220 Pulled By: siying fbshipit-source-id: 731f96d	2017-02-23 12:09:11 -08:00
Aaron Gao	1ef5f50e84	detect logical sector size Summary: querying logical sector size from the device instead of hardcoding it for linux platform. Closes https://github.com/facebook/rocksdb/pull/1875 Differential Revision: D4591946 Pulled By: ajkr fbshipit-source-id: 4e9805c	2017-02-23 11:25:36 -08:00
Aaron Gao	0824934423	truncate patch Summary: omit the override for the previous commit Closes https://github.com/facebook/rocksdb/pull/1898 Differential Revision: D4598743 Pulled By: lightmark fbshipit-source-id: f98a378	2017-02-22 10:39:11 -08:00
Aaron Gao	286a36db7f	posix writablefile truncate Summary: we occasionally missing this call so the file size will be wrong Closes https://github.com/facebook/rocksdb/pull/1894 Differential Revision: D4598446 Pulled By: lightmark fbshipit-source-id: 42b6ef5	2017-02-22 10:09:14 -08:00
Mike Kolupaev	18eeb7b90e	Fix interference between max_total_wal_size and db_write_buffer_size checks Summary: This is a trivial fix for OOMs we've seen a few days ago in logdevice. RocksDB get into the following state: (1) Write throughput is too high for flushes to keep up. Compactions are out of the picture - automatic compactions are disabled, and for manual compactions we don't care that much if they fall behind. We write to many CFs, with only a few L0 sst files in each, so compactions are not needed most of the time. (2) total_log_size_ is consistently greater than GetMaxTotalWalSize(). It doesn't get smaller since flushes are falling ever further behind. (3) Total size of memtables is way above db_write_buffer_size and keeps growing. But the write_buffer_manager_->ShouldFlush() is not checked because (2) prevents it (for no good reason, afaict; this is what this commit fixes). (4) Every call to WriteImpl() hits the MaybeFlushColumnFamilies() path. This keeps flushing the memtables one by one in order of increasing log file number. (5) No write stalling trigger is hit. We rely on max_write_buffer_number Closes https://github.com/facebook/rocksdb/pull/1893 Differential Revision: D4593590 Pulled By: yiwu-arbug fbshipit-source-id: af79c5f	2017-02-21 16:09:10 -08:00
Aaron Gao	2a0f3d0de1	level compaction expansion Summary: reimplement the compaction expansion on lower level. Considering such a case: input level file: 1[B E] 2[F G] 3[H I] 4 [J M] output level file: 5[A C] 6[D K] 7[L O] If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level. The previous code is messy and wrong. In this diff, I first determine the input range [a, b], and output range [c, d], then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction. Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction. Closes https://github.com/facebook/rocksdb/pull/1760 Differential Revision: D4395564 Pulled By: lightmark fbshipit-source-id: 2dc2c5c	2017-02-21 10:24:17 -08:00
Yi Wu	381fd32247	Remove timeout_hint_us from WriteOptions Summary: The option has been deprecated for two years and has no effect. Removing. Closes https://github.com/facebook/rocksdb/pull/1866 Differential Revision: D4555203 Pulled By: yiwu-arbug fbshipit-source-id: c48f627	2017-02-17 15:24:17 -08:00
Islam AbdelRahman	fce7a6e196	Fail IngestExternalFile when bg_error_ exists Summary: Fail IngestExternalFile() when bg_error_ exists Closes https://github.com/facebook/rocksdb/pull/1881 Differential Revision: D4580621 Pulled By: IslamAbdelRahman fbshipit-source-id: 1194913	2017-02-17 13:39:17 -08:00
Shu Zhang	756c5924e6	Allow adding external v1 sst file with no global seqno support Summary: This is a follow up fix for https://github.com/facebook/rocksdb/pull/1783. After it, we should be able to ingest external v1 sst files with no global seqno field. Closes https://github.com/facebook/rocksdb/pull/1874 Differential Revision: D4576194 Pulled By: IslamAbdelRahman fbshipit-source-id: 5b34a3e	2017-02-16 17:09:12 -08:00
Aaron Gao	db2b4eb50e	avoid direct io in rocksdb_lite Summary: fix lite bugs disable direct io in lite mode Closes https://github.com/facebook/rocksdb/pull/1870 Differential Revision: D4559866 Pulled By: yiwu-arbug fbshipit-source-id: 3761c51	2017-02-16 10:39:13 -08:00
Andrew Kryczka	43e9f01c20	Fix repair_test on ROCKSDB_LITE Summary: RepairDB isn't included in rocksdb lite, so don't test it. Closes https://github.com/facebook/rocksdb/pull/1873 Differential Revision: D4565094 Pulled By: ajkr fbshipit-source-id: 8cc0898	2017-02-15 11:24:12 -08:00
Xiaofei Du	7106a994fe	Use monotonic time points in write_controller.cc and rate_limiter.cc Summary: NowMicros() provides non-monotonic time. When wall clock is synchronized or changed, the non-monotonicity time points will affect write rate controllers. This patch changes write_controller.cc and rate_limiter.cc to use monotonic time points. Closes https://github.com/facebook/rocksdb/pull/1865 Differential Revision: D4561732 Pulled By: siying fbshipit-source-id: 95ece62	2017-02-14 18:24:24 -08:00
Yi Wu	c2247dc1c7	Make DBImpl::has_unpersisted_data_ atomic Summary: Seems to me `has_unpersisted_data_` is read from read thread and write from write thread concurrently without synchronization. Making it an atomic. I update the logic not because seeing any problem with it, but it just feel confusing. Closes https://github.com/facebook/rocksdb/pull/1869 Differential Revision: D4555837 Pulled By: yiwu-arbug fbshipit-source-id: eff2ab8	2017-02-13 18:54:13 -08:00
Sagar Vemuri	eb912a927e	Remove disableDataSync option Summary: Remove disableDataSync, and another similarly named disable_data_sync options. This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods. Closes https://github.com/facebook/rocksdb/pull/1859 Differential Revision: D4541292 Pulled By: sagar0 fbshipit-source-id: 5b3a6ca	2017-02-13 11:09:13 -08:00
Dmitri Smirnov	a5adda0642	Fix repair issues Summary: Record the first parsed sequence number as the minimum so we can find the true minimum otherwise everything is larger than zero. Fix the comparator name comparision. Closes https://github.com/facebook/rocksdb/pull/1858 Differential Revision: D4544365 Pulled By: ajkr fbshipit-source-id: 439cbc2	2017-02-10 10:54:12 -08:00
Andrew Kryczka	b48e4778be	Consolidate file cutting logic in compaction loop Summary: It was really annoying to have two places (top and bottom of compaction loop) where we cut output files. I had bugs in both DeleteRange and dictionary compression due to updating only one of the two. This diff consolidates the file-cutting logic to the bottom of the compaction loop. Keep in mind that my goal with input_status is to be consistent with the past behavior, even though I'm not sure it's ideal. Closes https://github.com/facebook/rocksdb/pull/1832 Differential Revision: D4503038 Pulled By: ajkr fbshipit-source-id: 7da5213	2017-02-08 16:24:17 -08:00
Maysam Yabandeh	c4a37dcb44	Print the missed last layer in cfstats Summary: Printing compaction stats used to operate on two variable: number_levels_: for printing the layer num_levels_to_check: for updating the compaction score After this commit: `361010d447` these two are mixed up and as a result the last layer might not be printed out: https://fb.facebook.com/groups/rocksdb.internal/permalink/1315716625143616/ number_levels_ was used to decide which layers to print: `672300f47f/db/internal_stats.cc (L753)` but after the patch it is based on the return value of DumpCFMapStats `361010d447/db/internal_stats.cc (L929)` which returns num_levels_to_check: `361010d447/db/internal_stats.cc (L917)` Closes https://github.com/facebook/rocksdb/pull/1853 Differential Revision: D4529280 Pulled By: maysamyabandeh fbshipit-source-id: 3fd9448	2017-02-08 10:39:15 -08:00
Maysam Yabandeh	69d5262c81	Two-level Indexes Summary: Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition t15539501 Closes https://github.com/facebook/rocksdb/pull/1814 Differential Revision: D4473535 Pulled By: maysamyabandeh fbshipit-source-id: bffb87e	2017-02-06 16:39:12 -08:00
Dmitri Smirnov	0a4cdde50a	Windows thread Summary: introduce new methods into a public threadpool interface, - allow submission of std::functions as they allow greater flexibility. - add Joining methods to the implementation to join scheduled and submitted jobs with an option to cancel jobs that did not start executing. - Remove ugly `#ifdefs` between pthread and std implementation, make it uniform. - introduce pimpl for a drop in replacement of the implementation - Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread. - Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation. - should be no functionality changes. Closes https://github.com/facebook/rocksdb/pull/1823 Differential Revision: D4492902 Pulled By: siying fbshipit-source-id: c74cb11	2017-02-06 14:54:18 -08:00
Vitaliy Liptchinsky	1aaa898cf1	Adding GetApproximateMemTableStats method Summary: Added method that returns approx num of entries as well as size for memtables. Closes https://github.com/facebook/rocksdb/pull/1841 Differential Revision: D4511990 Pulled By: VitaliyLi fbshipit-source-id: 9a4576e	2017-02-06 14:54:16 -08:00
Siying Dong	036d668b19	Fix wrong result in data race case related to Get() Summary: In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version. Closes https://github.com/facebook/rocksdb/pull/1816 Differential Revision: D4475958 Pulled By: siying fbshipit-source-id: bd9e67a	2017-02-03 11:39:15 -08:00
Islam AbdelRahman	574b543f80	Rename merger.h -> merging_iterator.h Summary: merger.h was always a confusing name for me, simply give the file a better name Closes https://github.com/facebook/rocksdb/pull/1836 Differential Revision: D4505357 Pulled By: IslamAbdelRahman fbshipit-source-id: 07b28d8	2017-02-02 16:54:19 -08:00
oranagra	b96372dead	improving the C wrapper Summary: - rocksdb_property_int (so that we don't have to parse strings) - and rocksdb_set_options (to allow controlling options via strings) - a few other missing options exposed - a documentation comment fix Closes https://github.com/facebook/rocksdb/pull/1793 Differential Revision: D4456569 Pulled By: yiwu-arbug fbshipit-source-id: 9f1fac1	2017-01-27 17:39:16 -08:00
Siying Dong	04c4ec41d1	Change corruption_test to use 4 bits. Summary: In the patch which LRU cache was made use dynamic shard bits, I changed to 2 shard bits to make the test happy. Look like it is occasionally still unhappy. Change it to 4 shard bits. Closes https://github.com/facebook/rocksdb/pull/1815 Differential Revision: D4475849 Pulled By: siying fbshipit-source-id: 575ff00	2017-01-27 11:24:16 -08:00
Siying Dong	2d75cd40d3	NewLRUCache() to pick number of shard bits based on capacity if not given Summary: If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity. Closes https://github.com/facebook/rocksdb/pull/1584 Differential Revision: D4242517 Pulled By: siying fbshipit-source-id: 86b0f18	2017-01-27 06:39:12 -08:00
Siying Dong	f25f1ec60b	Add test DBTest2.GetRaceFlush which can expose a data race bug Summary: A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now. Closes https://github.com/facebook/rocksdb/pull/1813 Differential Revision: D4472310 Pulled By: siying fbshipit-source-id: 5755ebd	2017-01-26 16:39:14 -08:00
sdong	5dad9d6d28	Avoid logs_ operation out of DB mutex Summary: logs_.back() is called out of DB mutex, which can cause data race. We move the access into the DB mutex protection area. Closes https://github.com/facebook/rocksdb/pull/1774 Reviewed By: AsyncDBConnMarkedDownDBException Differential Revision: D4417472 Pulled By: AsyncDBConnMarkedDownDBException fbshipit-source-id: 2da1f1e	2017-01-25 15:54:13 -08:00
Islam AbdelRahman	a7b13919bf	Fix CompactFiles() bug when used with CompactionFilter using SuperVersion Summary: GetAndRefSuperVersion() should not be called again in the same thread before ReturnAndCleanupSuperVersion() is called. If we have a compaction filter that is using DB::Get, This will happen ``` CompactFiles() { GetAndRefSuperVersion() // -- first call .. CompactionFilter() { GetAndRefSuperVersion() // -- second call ReturnAndCleanupSuperVersion() } .. ReturnAndCleanupSuperVersion() } ``` We solve this issue in the same way Iterator is solving it, but using GetReferencedSuperVersion() This was discovered in https://github.com/facebook/mysql-5.6/issues/427 by alxyang Closes https://github.com/facebook/rocksdb/pull/1803 Differential Revision: D4460155 Pulled By: IslamAbdelRahman fbshipit-source-id: 5e54322	2017-01-25 14:09:13 -08:00
Andrew Kryczka	616a1464ea	Fix DeleteRange including sentinels in output files Summary: when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels. blame: #1614 Closes https://github.com/facebook/rocksdb/pull/1804 Differential Revision: D4460426 Pulled By: ajkr fbshipit-source-id: 69444b5	2017-01-25 11:09:12 -08:00
Islam AbdelRahman	03ca2ac8a9	Remove function from DBImpl that are not used anywhere Summary: GetAndRefSuperVersionUnlocked ReturnAndCleanupSuperVersionUnlocked GetColumnFamilyHandleUnlocked Are dead code that are not used any where Closes https://github.com/facebook/rocksdb/pull/1802 Differential Revision: D4459948 Pulled By: IslamAbdelRahman fbshipit-source-id: 30fa89d	2017-01-24 19:24:13 -08:00
Andrew Kryczka	b0029bc7fa	Test merge op covered by range deletion in memtable Summary: It's a test case for #1797. Also got rid of kTypeDeletion in the conditional since we treat it the same as kTypeRangeDeletion. Closes https://github.com/facebook/rocksdb/pull/1800 Differential Revision: D4451300 Pulled By: ajkr fbshipit-source-id: b39dda1	2017-01-24 13:39:11 -08:00
Andrew Kryczka	d438e1ec17	Test range deletion block outlives table reader Summary: This test ensures RangeDelAggregator can still access blocks even if it outlives the table readers that created them (detailed description in comments). I plan to optimize away the extra cache lookup we currently do in BlockBasedTable::NewRangeTombstoneIterator(), as it is ~5% CPU in my random read benchmark in a database with 1k tombstones. This test will help make sure nothing breaks in the process. Closes https://github.com/facebook/rocksdb/pull/1739 Differential Revision: D4375954 Pulled By: ajkr fbshipit-source-id: aef9357	2017-01-24 13:24:14 -08:00
Andrew Kryczka	9da4d542fe	Range deletions unsupported in tailing iterator Summary: change the iterator status to NotSupported as soon as a range tombstone is encountered by a ForwardIterator. Closes https://github.com/facebook/rocksdb/pull/1593 Differential Revision: D4246294 Pulled By: ajkr fbshipit-source-id: aef9f49	2017-01-23 13:39:12 -08:00
Hyeonseok Oh	f2b4939da4	fixed typo Summary: I fixed exisit -> exist Closes https://github.com/facebook/rocksdb/pull/1799 Differential Revision: D4451466 Pulled By: yiwu-arbug fbshipit-source-id: b447c3a	2017-01-23 12:54:13 -08:00
yinqiwen	973f1b78fd	memtable: delete merge value for range deleteion Summary: Closes https://github.com/facebook/rocksdb/pull/1797 Differential Revision: D4448004 Pulled By: ajkr fbshipit-source-id: 3ffc27c	2017-01-23 12:24:14 -08:00
Vitaliy Liptchinsky	753ff84a3d	Fix get approx size Summary: Fixing GetApproximateSize bug for the case of computing stats for mem tables only. Closes https://github.com/facebook/rocksdb/pull/1795 Differential Revision: D4445507 Pulled By: IslamAbdelRahman fbshipit-source-id: 3905846	2017-01-20 15:54:12 -08:00
Jay Lee	537da370da	c: allow set savepoint to writebatch Summary: Allow set SavePoint to WriteBatch in C ABI. Closes https://github.com/facebook/rocksdb/pull/1698 Differential Revision: D4378556 Pulled By: yiwu-arbug fbshipit-source-id: afca746	2017-01-20 13:24:13 -08:00
Changli Gao	5ac97314e7	Fix std::out_of_range when DBOptions::keep_log_file_num is zero Summary: We should validate this option, otherwise we may see std::out_of_range thrown at: db/db_impl.cc:1124 1123 for (unsigned int i = 0; i <= end; i++) { 1124 std::string& to_delete = old_info_log_files.at(i); 1125 std::string full_path_to_delete = 1126 (immutable_db_options_.db_log_dir.empty() Closes https://github.com/facebook/rocksdb/pull/1722 Differential Revision: D4379495 Pulled By: yiwu-arbug fbshipit-source-id: e136552	2017-01-20 13:24:12 -08:00
Shu Zhang	3c0852d1da	Make ingest external file backward compatible Summary: Closes https://github.com/facebook/rocksdb/pull/1783 Differential Revision: D4443463 Pulled By: IslamAbdelRahman fbshipit-source-id: 39d21d6	2017-01-20 12:09:19 -08:00
Siying Dong	0e8dfd6062	Fix OptimizeForPointLookup() Summary: If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead. Closes https://github.com/facebook/rocksdb/pull/1791 Differential Revision: D4442836 Pulled By: siying fbshipit-source-id: bf6c9cd	2017-01-20 10:54:12 -08:00
Vitaliy Liptchinsky	e840213d6e	Change DB::GetApproximateSizes for more flexibility needed for MyRocks Summary: Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables. Closes https://github.com/facebook/rocksdb/pull/1787 Differential Revision: D4441111 Pulled By: IslamAbdelRahman fbshipit-source-id: c11f4c3	2017-01-20 09:39:11 -08:00
Yi Wu	9239103cd4	Flush job should release reference current version if sync log failed Summary: Fix the bug when sync log fail, FlushJob::Run() will not be execute and reference to cfd->current() will not be release. Closes https://github.com/facebook/rocksdb/pull/1792 Differential Revision: D4441316 Pulled By: yiwu-arbug fbshipit-source-id: 5523e28	2017-01-19 23:09:15 -08:00
Islam AbdelRahman	da54d36a96	Disable IngestExternalFile in ReadOnly mode Summary: Disable IngestExternalFile() in read only mode Closes https://github.com/facebook/rocksdb/pull/1781 Differential Revision: D4439179 Pulled By: IslamAbdelRahman fbshipit-source-id: b7e46e7	2017-01-19 15:54:19 -08:00
Reid Horuff	5cf176ca15	Fix for 2PC causing WAL to grow too large Summary: Consider the following single column family scenario: prepare in log A commit in log B WAL is too large, flush all CFs to releast log A CFA is on log B so we do not see CFA is depending on log A so no flush is requested To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on. Closes https://github.com/facebook/rocksdb/pull/1768 Differential Revision: D4403265 Pulled By: reidHoruff fbshipit-source-id: ce800ff	2017-01-19 15:39:12 -08:00
Andrew Kryczka	f9d18e22d2	Fix DeleteRange file boundary correctness issue with max_compaction_bytes Summary: Cockroachdb exposed this bug in #1778. The bug happens when a compaction's output files are ended due to exceeding max_compaction_bytes. In that case we weren't taking into account the next file's start key when deciding how far to extend the current file's max_key. This caused the non-overlapping key-range invariant to be violated. Note this was correctly handled for the usual case of cutting compaction output, which is file size exceeding max_output_file_size. I am not sure why these are two separate code paths, but we can consider refactoring it to prevent such errors in the future. Closes https://github.com/facebook/rocksdb/pull/1784 Differential Revision: D4430235 Pulled By: ajkr fbshipit-source-id: 80af748	2017-01-18 11:54:22 -08:00
Islam AbdelRahman	3ce091fd73	Add KEEP_DB env var option Summary: When debugging tests, it's useful to preserve the DB to investigate it and check the logs This will allow us to set KEEP_DB=1 to preserve the DB Closes https://github.com/facebook/rocksdb/pull/1759 Differential Revision: D4393826 Pulled By: IslamAbdelRahman fbshipit-source-id: 1bff689	2017-01-17 13:54:20 -08:00
Siying Dong	77b4806625	Fix 2PC with concurrent memtable insert Summary: If concurrent memtable insert is enabled, and one prepare command and a normal command are grouped into a commit group, the sequence ID will be calculated incorrectly. Closes https://github.com/facebook/rocksdb/pull/1730 Differential Revision: D4371081 Pulled By: siying fbshipit-source-id: cd40c6d	2017-01-17 11:24:28 -08:00
Mike Kolupaev	d18dd2c41f	Abort compactions more reliably when closing DB Summary: DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper). There's also some minor code cleanup along the way. Closes https://github.com/facebook/rocksdb/pull/1639 Differential Revision: D4306571 Pulled By: yiwu-arbug fbshipit-source-id: f050890	2017-01-11 15:09:21 -08:00
Changli Gao	9f246298e2	Performance: Iterate vector by reference Summary: Closes https://github.com/facebook/rocksdb/pull/1763 Differential Revision: D4398796 Pulled By: yiwu-arbug fbshipit-source-id: b82636d	2017-01-11 10:54:37 -08:00
Dmitri Smirnov	3c233ca4ea	Fix Windows environment issues Summary: Enable directIO on WritableFileImpl::Append with offset being current length of the file. Enable UniqueID tests on Windows, disable others but leeting them to compile. Unique tests are valuable to detect failures on different filesystems and upcoming ReFS. Clear output in WinEnv Getchildren.This is different from previous strategy, do not touch output on failure. Make sure DBTest.OpenWhenOpen works with windows error message Closes https://github.com/facebook/rocksdb/pull/1746 Differential Revision: D4385681 Pulled By: IslamAbdelRahman fbshipit-source-id: c07b702	2017-01-09 15:54:12 -08:00
Maysam Yabandeh	d0ba8ec8f9	Revert "PinnableSlice" Summary: This reverts commit `54d94e9c2c`. The pull request was landed by mistake. Closes https://github.com/facebook/rocksdb/pull/1755 Differential Revision: D4391678 Pulled By: maysamyabandeh fbshipit-source-id: 36d5149	2017-01-08 14:24:12 -08:00
Maysam Yabandeh	54d94e9c2c	PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: 1. value 100 byte: 1.8% regular, 1.2% merge values 2. value 1k byte: 11.5% regular, 7.5% merge values 3. value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled Closes https://github.com/facebook/rocksdb/pull/1732 Differential Revision: D4374613 Pulled By: maysamyabandeh fbshipit-source-id: a077f1a	2017-01-08 13:54:13 -08:00
Andrew Kryczka	b104b87814	Maintain position in range deletions map Summary: When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key. - When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key. - Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map. - When the iterator seeks, we use kBinarySearch for repositioning - After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used. - Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete(). Closes https://github.com/facebook/rocksdb/pull/1701 Differential Revision: D4350318 Pulled By: ajkr fbshipit-source-id: 5129b76	2017-01-05 10:39:12 -08:00
siddontang	653ac1f9c6	C API: support total_order_mode Summary: Closes https://github.com/facebook/rocksdb/pull/1687 Differential Revision: D4349210 Pulled By: IslamAbdelRahman fbshipit-source-id: 32d0fbd	2017-01-03 18:39:14 -08:00
Adam Retter	85ac1a320a	Fix rocksdb::Status::getState Summary: This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState. Closes https://github.com/facebook/rocksdb/issues/1688 Closes https://github.com/facebook/rocksdb/pull/1714 Differential Revision: D4364181 Pulled By: yiwu-arbug fbshipit-source-id: 8e073b4	2017-01-03 18:39:14 -08:00
Islam AbdelRahman	76711b6e77	Make ExternalSSTFileTest::CompactionDeadlock more deterministic Summary: It's not always true that `ASSERT_EQ(running_threads.load(), 2);` Closes https://github.com/facebook/rocksdb/pull/1736 Differential Revision: D4374091 Pulled By: IslamAbdelRahman fbshipit-source-id: 4f70bbd	2017-01-03 18:09:20 -08:00
Islam AbdelRahman	c963460dbc	Fix tests under GCC_481 Summary: This fix the issue with tests failing under GCC 481, I am not sure what is the exact reason Closes https://github.com/facebook/rocksdb/pull/1735 Differential Revision: D4374094 Pulled By: IslamAbdelRahman fbshipit-source-id: b3625bc	2017-01-03 17:54:12 -08:00
Vincent Lee	e425ec1162	utilities/backupable: backup should limit the copy size of wal. Summary: Since the backup work as snapshot, we should only copy the bytes of the wal while we get the alive files. Closes https://github.com/facebook/rocksdb/pull/1733 Differential Revision: D4373457 Pulled By: ajkr fbshipit-source-id: 389318f	2016-12-31 10:54:20 -08:00
Maysam Yabandeh	0712d541d1	Delegate Cleanables Summary: Cleanable objects will perform the registered cleanups when they are destructed. We however rather to delay this cleaning like when we are gathering the merge operands. Current approach is to create the Cleanable object on heap (instead of on stack) and delay deleting it. By allowing Cleanables to delegate their cleanups to another cleanable object we can delay the cleaning without however the need to craete the cleanable object on heap and keeping it around. This patch applies this technique for the cleanups of BlockIter and shows improved performance for some in-memory benchmarks: +1.8% for merge worklaod, +6.4% for non-merge workload when the merge operator is specified. https://our.intern.facebook.com/intern/tasks?t=15168163 Non-merge benchmark: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none Reading random with no merge operator specified: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="read Closes https://github.com/facebook/rocksdb/pull/1711 Differential Revision: D4361163 Pulled By: maysamyabandeh fbshipit-source-id: 9801e07	2016-12-29 15:54:19 -08:00
Islam AbdelRahman	d58ef52ba6	Allow SstFileWriter to Fadvise the file away from page cache Summary: Add `fadvise_trigger` option to `SstFileWriter` If fadvise_trigger is passed with a non-zero value, SstFileWriter will invalidate the os page cache every `fadvise_trigger` bytes for the sst file Closes https://github.com/facebook/rocksdb/pull/1731 Differential Revision: D4371246 Pulled By: IslamAbdelRahman fbshipit-source-id: 91caff1	2016-12-29 15:09:19 -08:00
Siying Dong	17a4b75cc3	Always fsync the file after file copying Summary: File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large. Closes https://github.com/facebook/rocksdb/pull/1728 Differential Revision: D4371083 Pulled By: siying fbshipit-source-id: 579e14c	2016-12-28 19:09:16 -08:00
leipeng	a738af8f84	db/pinned_iterators_manager.h: bugfix Summary: std::unique(beg, end) returns an iterator of unique_end, data behind unique_end should not be accessed. Closes https://github.com/facebook/rocksdb/pull/1726 Differential Revision: D4371076 Pulled By: IslamAbdelRahman fbshipit-source-id: 5564450	2016-12-28 18:54:57 -08:00
Siying Dong	438f22bc56	Fix bug of Checkpoint loses recent transactions with 2PC Summary: If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files. Closes https://github.com/facebook/rocksdb/pull/1724 Differential Revision: D4368319 Pulled By: siying fbshipit-source-id: cc2c746	2016-12-28 12:24:16 -08:00
Aaron Gao	972f96b3fb	direct io write support Summary: rocksdb direct io support ``` [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 5.0 Date: Wed Nov 23 13:17:43 2016 CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz CPUCache: 25600 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 4.393 micros/op 227639 ops/sec; 25.2 MB/s [gzh@dev11575.prn2 ~/roc Closes https://github.com/facebook/rocksdb/pull/1564 Differential Revision: D4241093 Pulled By: lightmark fbshipit-source-id: 98c29e3	2016-12-22 13:09:19 -08:00
Islam AbdelRahman	989e644ed8	Remove sst_file_manager option from LITE Summary: Remove sst_file_manager option from LITE Closes https://github.com/facebook/rocksdb/pull/1690 Differential Revision: D4341331 Pulled By: IslamAbdelRahman fbshipit-source-id: 9f9328d	2016-12-21 17:54:21 -08:00
Islam AbdelRahman	1beef6569a	Fix c_test Summary: addfile phase in c_test could fail because in previous steps we did a DeleteRange. Fix the test by simply moving the addfile phase before DeleteRange Closes https://github.com/facebook/rocksdb/pull/1672 Differential Revision: D4328896 Pulled By: IslamAbdelRahman fbshipit-source-id: 1d946df	2016-12-21 17:39:14 -08:00
Andrew Kryczka	50e305de98	Collapse range deletions Summary: Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach. Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map. Closes https://github.com/facebook/rocksdb/pull/1614 Differential Revision: D4270397 Pulled By: ajkr fbshipit-source-id: 93092c7	2016-12-19 16:54:12 -08:00
Yi Wu	5d1457dbbf	Dump persistent cache options Summary: Dump persistent cache options Closes https://github.com/facebook/rocksdb/pull/1679 Differential Revision: D4337019 Pulled By: yiwu-arbug fbshipit-source-id: 3812f8a	2016-12-19 14:09:12 -08:00
Daniel Black	342370f1d3	Simplify MemTable::Update Summary: As suggested by testn in #1650 The Add is at the end of the function. Having a fallthough will result in it being added twice. Closes https://github.com/facebook/rocksdb/pull/1676 Differential Revision: D4331906 Pulled By: yiwu-arbug fbshipit-source-id: 895c4a0	2016-12-17 00:09:13 -08:00
Ding Ma	1a136c1f13	Expose file size Summary: add a new function to SstFileWriter that will tell the user how big is there file right now. Closes https://github.com/facebook/rocksdb/pull/1686 Differential Revision: D4338868 Pulled By: mdyuki1016 fbshipit-source-id: c1ee16a	2016-12-16 18:39:12 -08:00
Andrew Kryczka	fbff4628a9	Reduce compaction iterator status checks Summary: seems it's expensive to check status since the underlying merge iterator checks status of all its children. so only do it when it's really necessary to get the status before invoking Next(), i.e., when we're advancing to get the first key in the next file. Closes https://github.com/facebook/rocksdb/pull/1691 Differential Revision: D4343446 Pulled By: siying fbshipit-source-id: 70ab315	2016-12-16 17:39:09 -08:00
Daniel Black	816c1e30ca	gcc-7 requires include <functional> for std::function Summary: Fixes compile error: In file included from ./util/statistics.h:17:0, from ./util/stop_watch.h:8, from ./util/perf_step_timer.h:9, from ./util/iostats_context_imp.h:8, from ./util/posix_logger.h:27, from ./port/util_logger.h:18, from ./db/auto_roll_logger.h:15, from db/auto_roll_logger.cc:6: ./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type typedef std::function<void(void, void)> FoldFunc; Closes https://github.com/facebook/rocksdb/pull/1656 Differential Revision: D4318702 Pulled By: yiwu-arbug fbshipit-source-id: 8c5d17a	2016-12-16 11:24:18 -08:00
Yi Wu	c270735861	Iterator should be in corrupted status if merge operator return false Summary: Iterator should be in corrupted status if merge operator return false. Also add test to make sure if max_successive_merges is hit during write, data will not be lost. Closes https://github.com/facebook/rocksdb/pull/1665 Differential Revision: D4322695 Pulled By: yiwu-arbug fbshipit-source-id: b327b05	2016-12-16 11:09:16 -08:00
siddontang	8f5d24ae68	C API: support get usage and pinned_usage for cache Summary: Closes https://github.com/facebook/rocksdb/pull/1671 Differential Revision: D4327453 Pulled By: yiwu-arbug fbshipit-source-id: bcdbc65	2016-12-15 17:24:17 -08:00
Daniel Black	cfc34d7c4e	Missing break in case in DBTestBase::CurrentOptions Summary: Found by gcc-7 compile error. This appeared to be a fault as these options seems too different. Closes https://github.com/facebook/rocksdb/pull/1667 Differential Revision: D4324174 Pulled By: yiwu-arbug fbshipit-source-id: 0f65383	2016-12-13 18:39:14 -08:00
Daniel Black	bfbcec2339	Gcc 7 error expansion to defined Summary: sorry if these gcc-7/clang-4 cleanups are getting tedious. Closes https://github.com/facebook/rocksdb/pull/1658 Differential Revision: D4318792 Pulled By: yiwu-arbug fbshipit-source-id: 8e85891	2016-12-13 18:39:14 -08:00
Daniel Black	67adc937b6	intentional fallthough (prevents gcc-7/clang-4 error) Summary: db/memtable.cc: In member function 'void rocksdb::MemTable::Update(rocksdb::SequenceNumber, const rocksdb::Slice&, const rocksdb::Slice&)': db/memtable.cc:736:11: error: this statement may fall through [-Werror=implicit-fallthrough=] } ^ db/memtable.cc:738:9: note: here default: ^~~~~~~ cc1plus: all warnings being treated as errors closes #1650 Closes https://github.com/facebook/rocksdb/pull/1655 Differential Revision: D4318696 Pulled By: yiwu-arbug fbshipit-source-id: 1a8981c	2016-12-13 14:39:17 -08:00
Islam AbdelRahman	1a146f89c7	break Flush wait for dropped CF Summary: In FlushJob we dont do the Flush if the CF is dropped https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188 but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped. Closes https://github.com/facebook/rocksdb/pull/1664 Differential Revision: D4321032 Pulled By: IslamAbdelRahman fbshipit-source-id: 6e2b25d	2016-12-13 14:09:12 -08:00
Yi Wu	36d42e65d0	Disable test to unblock travis build Summary: The two tests keep failing in travis. Disable them and will fix later. Closes https://github.com/facebook/rocksdb/pull/1648 Differential Revision: D4316389 Pulled By: yiwu-arbug fbshipit-source-id: 0a370e7	2016-12-13 11:54:14 -08:00
siddontang	b57dd9262a	C API: support writebatch delete range Summary: Seem that writebatch delete range can work now, so I add C API for later use. Btw, can we use this feature in production now? Closes https://github.com/facebook/rocksdb/pull/1647 Differential Revision: D4314534 Pulled By: ajkr fbshipit-source-id: e835165	2016-12-13 11:24:18 -08:00
Islam AbdelRahman	2ba59b5a1e	Disallow ingesting files into dropped CFs Summary: This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF. Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF Closes https://github.com/facebook/rocksdb/pull/1657 Differential Revision: D4318657 Pulled By: IslamAbdelRahman fbshipit-source-id: ed6ea2b	2016-12-13 00:54:14 -08:00
Jonathan Lee	2cabdb8f44	Increase buffer size Summary: When compiling with GCC>=7.0.0, "db/internal_stats.cc" fails to compile as the data being written to the buffer potentially exceeds its size. This fix simply doubles the size of the buffer, thus accommodating the max possible data size. Closes https://github.com/facebook/rocksdb/pull/1635 Differential Revision: D4302162 Pulled By: yiwu-arbug fbshipit-source-id: c76ad59	2016-12-09 11:54:22 -08:00
Jonathan Lee	4a17b47bb5	Remove unnecessary header include Summary: Remove "util/testharness.h" from list of includes for "db/db_filesnapshot.cc", as it wasn't being used and thus caused an extraneous dependency on gtest. Closes https://github.com/facebook/rocksdb/pull/1634 Differential Revision: D4302146 Pulled By: yiwu-arbug fbshipit-source-id: e900c0b	2016-12-09 11:54:21 -08:00
Mike Kolupaev	8c2b921fdf	Fixed a crash in debug build in flush_job.cc Summary: It was doing `&range_del_iters[0]` on an empty vector. Even though the resulting pointer is never dereferenced, it's still bad for two reasons: * the practical reason: it crashes with `std::out_of_range` exception in our debug build, * the "C++ standard lawyer" reason: it's undefined behavior because, in `std::vector` implementation, it probably "dereferences" a null pointer, which is invalid even though it doesn't actually read the pointed memory, just converts a pointer into a reference (and then flush_job.cc converts it back to pointer); nullptr references are undefined behavior. Closes https://github.com/facebook/rocksdb/pull/1612 Differential Revision: D4265625 Pulled By: al13n321 fbshipit-source-id: db26fb9	2016-12-09 10:39:12 -08:00
Islam AbdelRahman	20ce081fae	Fix issue where IngestExternalFile insert blocks in block cache with g_seqno=0 Summary: When we Ingest an external file we open it to read some metadata and first/last key during doing that we insert blocks into the block cache with global_seqno = 0 If we move the file (did not copy it) into the DB, we will use these blocks with the wrong seqno in the read path Closes https://github.com/facebook/rocksdb/pull/1627 Differential Revision: D4293332 Pulled By: yiwu-arbug fbshipit-source-id: 3ce5523	2016-12-08 13:39:18 -08:00
zhangjinpeng1987	45c7ce1377	CompactRangeOptions C API Summary: Add C API for CompactRangeOptions. Closes https://github.com/facebook/rocksdb/pull/1596 Differential Revision: D4252339 Pulled By: yiwu-arbug fbshipit-source-id: f768f93	2016-12-07 17:54:14 -08:00
Andrew Kryczka	b821984d31	DeleteRange read path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1592 Differential Revision: D4246260 Pulled By: ajkr fbshipit-source-id: ce03fa2	2016-12-07 12:54:17 -08:00
Artemiy Kolesnikov	2f4fc539c6	Compaction::IsTrivialMove relaxing Summary: IsTrivialMove returns true if no input file overlaps with output_level+1 with more than max_compaction_bytes_ bytes. Closes https://github.com/facebook/rocksdb/pull/1619 Differential Revision: D4278338 Pulled By: yiwu-arbug fbshipit-source-id: 994c001	2016-12-07 11:54:11 -08:00
Islam AbdelRahman	ed8fbdb560	Add EventListener::OnExternalFileIngested() event Summary: Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events Closes https://github.com/facebook/rocksdb/pull/1623 Differential Revision: D4285844 Pulled By: IslamAbdelRahman fbshipit-source-id: 0b95a88	2016-12-06 14:09:17 -08:00
Mike Kolupaev	beb36d9c1e	Fixed CompactionFilter::Decision::kRemoveAndSkipUntil Summary: Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure: db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't. Closes https://github.com/facebook/rocksdb/pull/1611 Differential Revision: D4265622 Pulled By: al13n321 fbshipit-source-id: 747b10f	2016-12-05 15:24:11 -08:00
Islam AbdelRahman	67f37cf198	Allow user to specify a CF for SST files generated by SstFileWriter Summary: Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF. This allow us to persist the CF id in the generated file Closes https://github.com/facebook/rocksdb/pull/1615 Differential Revision: D4270422 Pulled By: IslamAbdelRahman fbshipit-source-id: 7fb954e	2016-12-05 14:24:16 -08:00
Anton Safonov	9053fe2a5c	Made delete_obsolete_files_period_micros option dynamic Summary: Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions(). Closes https://github.com/facebook/rocksdb/pull/1595 Differential Revision: D4246569 Pulled By: tonek fbshipit-source-id: d23f560	2016-12-05 14:24:16 -08:00
Islam AbdelRahman	edde954e7b	fix clang build Summary: override is missing for FilterV2 Closes https://github.com/facebook/rocksdb/pull/1606 Differential Revision: D4263832 Pulled By: IslamAbdelRahman fbshipit-source-id: d8b337a	2016-12-01 18:39:10 -08:00
Islam AbdelRahman	e39d080871	Fix travis (compile for clang < 3.9) Summary: Travis fail because it uses clang 3.6 which don't recognize `__attribute__((__no_sanitize__("undefined")))` Closes https://github.com/facebook/rocksdb/pull/1601 Differential Revision: D4257175 Pulled By: IslamAbdelRahman fbshipit-source-id: fb4d1ab	2016-12-01 10:09:22 -08:00
fangchenliaohui	b77007df8b	Bug: paralle_group status updated in WriteThread::CompleteParallelWorker Summary: Multi-write thread may update the status of the parallel_group in WriteThread::CompleteParallelWorker if the status of Writer is not ok! When copy write status to the paralle_group, the write thread just hold the mutex of the the writer processed by itself. it is useless. The thread should held the the leader of the parallel_group instead. Closes https://github.com/facebook/rocksdb/pull/1598 Differential Revision: D4252335 Pulled By: siying fbshipit-source-id: 3864cf7	2016-12-01 09:54:11 -08:00
Mike Kolupaev	247d0979aa	Support for range skips in compaction filter Summary: This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API. Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :) The diff is not as big as it may seem, about half of the lines are a test. Closes https://github.com/facebook/rocksdb/pull/1599 Differential Revision: D4252092 Pulled By: al13n321 fbshipit-source-id: 41e1e48	2016-12-01 07:09:15 -08:00
Panagiotis Ktistakis	96fcefbf1d	c api: expose option for dynamic level size target Summary: Closes https://github.com/facebook/rocksdb/pull/1587 Differential Revision: D4245923 Pulled By: yiwu-arbug fbshipit-source-id: 6ee7291	2016-11-30 11:24:14 -08:00
zhangjinpeng1987	00197cff39	Add C API to set base_backgroud_compactions Summary: Add C API to set base_backgroud_compactions Closes https://github.com/facebook/rocksdb/pull/1571 Differential Revision: D4245709 Pulled By: yiwu-arbug fbshipit-source-id: 792c6b8	2016-11-30 11:09:13 -08:00
Andrew Kryczka	5b219eccb5	deleterange end-to-end test improvements for lite/robustness Summary: Closes https://github.com/facebook/rocksdb/pull/1591 Differential Revision: D4246019 Pulled By: ajkr fbshipit-source-id: 0c4aa37	2016-11-29 12:24:13 -08:00
Andrew Kryczka	e333528991	DeleteRange write path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1578 Differential Revision: D4241171 Pulled By: ajkr fbshipit-source-id: ce5fd83	2016-11-29 11:09:22 -08:00
Siying Dong	7784980fcd	Fix mis-reporting of compaction read bytes to the base level Summary: In dynamic leveled compaction, when calculating read bytes, output level bytes may be wronglyl calculated as input level inputs. Fix it. Closes https://github.com/facebook/rocksdb/pull/1475 Differential Revision: D4148412 Pulled By: siying fbshipit-source-id: f2f475a	2016-11-29 11:09:22 -08:00
Islam AbdelRahman	3c6b49ed66	Fix implicit conversion between int64_t to int Summary: Make conversion explicit, implicit conversion breaks the build Closes https://github.com/facebook/rocksdb/pull/1589 Differential Revision: D4245158 Pulled By: IslamAbdelRahman fbshipit-source-id: aaec00d	2016-11-29 10:54:15 -08:00
Siying Dong	b3b875657f	Remove unused assignment in db/db_iter.cc Summary: "make analyze" complains the assignment is not useful. Remove it. Closes https://github.com/facebook/rocksdb/pull/1581 Differential Revision: D4241697 Pulled By: siying fbshipit-source-id: 178f67a	2016-11-29 09:09:14 -08:00
Andrew Kryczka	4f6e89b1d0	Fix range deletion covering key in same SST file Summary: AddTombstones() needs to be before t->Get(), oops :'( Closes https://github.com/facebook/rocksdb/pull/1576 Differential Revision: D4241041 Pulled By: ajkr fbshipit-source-id: 781ceea	2016-11-28 22:54:13 -08:00
Islam AbdelRahman	a2bf265a39	Avoid intentional overflow in GetL0ThresholdSpeedupCompaction Summary: `99c052a34f` fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve. UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t Closes https://github.com/facebook/rocksdb/pull/1582 Differential Revision: D4241525 Pulled By: IslamAbdelRahman fbshipit-source-id: b3ae21f	2016-11-28 18:39:13 -08:00
Islam AbdelRahman	52fd1ff2c2	disable UBSAN for functions with intentional -ve shift / overflow Summary: disable UBSAN for functions with intentional left shift on -ve number / overflow These functions are rocksdb:: Hash FixedLengthColBufEncoder::Append FaultInjectionTest:: Key Closes https://github.com/facebook/rocksdb/pull/1577 Differential Revision: D4240801 Pulled By: IslamAbdelRahman fbshipit-source-id: 3e1caf6	2016-11-28 17:54:12 -08:00
Islam AbdelRahman	1886c435b9	Fix CompactionJob::Install division by zero Summary: Fix CompactionJob::Install division by zero Closes https://github.com/facebook/rocksdb/pull/1580 Differential Revision: D4240794 Pulled By: IslamAbdelRahman fbshipit-source-id: 7286721	2016-11-28 16:54:16 -08:00
Islam AbdelRahman	13e66a8f51	Fix compaction_job.cc division by zero Summary: Fix division by zero in compaction_job.cc Closes https://github.com/facebook/rocksdb/pull/1575 Differential Revision: D4240818 Pulled By: IslamAbdelRahman fbshipit-source-id: a8bc757	2016-11-28 16:39:13 -08:00
Andrew Kryczka	01eabf7375	Fix double-counted deletion stat Summary: Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted. Closes https://github.com/facebook/rocksdb/pull/1574 Differential Revision: D4241181 Pulled By: ajkr fbshipit-source-id: c9aaa15	2016-11-28 15:54:12 -08:00
Andrew Kryczka	7ffb10fc1a	DeleteRange compaction statistics Summary: - "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them - "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them - s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction - Move the above class into a separate .h file to avoid circular dependency. Closes https://github.com/facebook/rocksdb/pull/1520 Differential Revision: D4187179 Pulled By: ajkr fbshipit-source-id: 10c2103	2016-11-28 11:54:12 -08:00
Mike Kolupaev	236d4c67e9	Less linear search in DBIter::Seek() when keys are overwritten a lot Summary: In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack: ``` rocksdb::MergingIterator::Next rocksdb::DBIter::FindNextUserEntryInternal rocksdb::DBIter::Seek ``` I think what's happening is: 1. we create a snapshot iterator, 2. we do lots of Put()s for the same key x; this creates lots of entries in memtable, 3. we seek the iterator to a key slightly smaller than x, 4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers. CC IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/1413 Differential Revision: D4083879 Pulled By: IslamAbdelRahman fbshipit-source-id: a83ddae	2016-11-28 10:24:11 -08:00
Siying Dong	cd7c4143d7	Improve Write Stalling System Summary: Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before. Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system. Closes https://github.com/facebook/rocksdb/pull/1562 Differential Revision: D4218519 Pulled By: siying fbshipit-source-id: 95e4088	2016-11-23 09:24:15 -08:00
Yi Wu	dfb6fe6755	Unified InlineSkipList::Insert algorithm with hinting Summary: This PR is based on nbronson's diff with small modifications to wire it up with existing interface. Comparing to previous version, this approach works better for inserting keys in decreasing order or updating the same key, and impose less restriction to the prefix extractor. ---- Summary from original diff ---- This diff introduces a single InlineSkipList::Insert that unifies the existing sequential insert optimization (prev_), concurrent insertion, and insertion using externally-managed insertion point hints. There's a deep symmetry between insertion hints (cursors) and the concurrent algorithm. In both cases we have partial information from the recent past that is likely but not certain to be accurate. This diff introduces the struct InlineSkipList::Splice, which encodes predecessor and successor information in the same form that was previously only used within a single call to InsertConcurrently. Splice holds information about an insertion point that can be used to levera Closes https://github.com/facebook/rocksdb/pull/1561 Differential Revision: D4217283 Pulled By: yiwu-arbug fbshipit-source-id: 33ee437	2016-11-22 14:09:13 -08:00
Andrew Kryczka	734e4acafb	Eliminate redundant cache lookup with range deletion Summary: When we introduced range deletion block, TableCache::Get() and TableCache::NewIterator() each did two table cache lookups, one for range deletion block iterator and another for getting the table reader to which the Get()/NewIterator() is delegated. This extra cache lookup was very CPU-intensive (about 10% overhead in a read-heavy benchmark). We can avoid it by reusing the Cache::Handle created for range deletion block iterator to get the file reader. Closes https://github.com/facebook/rocksdb/pull/1537 Differential Revision: D4201167 Pulled By: ajkr fbshipit-source-id: d33ffd8	2016-11-21 21:24:11 -08:00
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	2016-11-21 18:09:13 -08:00
Karthikeyan Radhakrishnan	4118e13330	Persistent Cache: Expose stats to user via public API Summary: Exposing persistent cache stats (counters) to the user via public API. Closes https://github.com/facebook/rocksdb/pull/1485 Differential Revision: D4155274 Pulled By: siying fbshipit-source-id: 30a9f50	2016-11-21 17:39:13 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Andrew Kryczka	3f62215210	Lazily initialize RangeDelAggregator's map and pinning manager Summary: Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction. Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist. Closes https://github.com/facebook/rocksdb/pull/1539 Differential Revision: D4203488 Pulled By: ajkr fbshipit-source-id: 3b36279	2016-11-18 17:09:11 -08:00
Andrew Kryczka	635a7bd1ad	refactor TableCache Get/NewIterator for single exit points Summary: these functions were too complicated to change with exit points everywhere, so refactored them. btw, please review urgently, this is a prereq to fix the 5.0 perf regression Closes https://github.com/facebook/rocksdb/pull/1534 Differential Revision: D4198972 Pulled By: ajkr fbshipit-source-id: 04ebfb7	2016-11-17 14:39:13 -08:00
Siying Dong	a4eb7387b2	Allow plain table to store index on file with bloom filter disabled Summary: Currently plain table bloom filter is required if storing metadata on file. Remove the constraint. Closes https://github.com/facebook/rocksdb/pull/1525 Differential Revision: D4190977 Pulled By: siying fbshipit-source-id: be60442	2016-11-17 11:09:13 -08:00

... 3 4 5 6 7 ...

2979 Commits