rocksdb

Author	SHA1	Message	Date
Siying Dong	51ac91f586	Histogram of number of merge operands Summary: Add a histogram in statistics to help users understand how many merge operands they merge. Closes https://github.com/facebook/rocksdb/pull/2373 Differential Revision: D5139983 Pulled By: siying fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182	2017-05-31 07:41:44 -07:00
Tamir Duberstein	103d0692ea	Avoid unsupported attributes when not building with UBSAN Summary: yiwu-arbug see individual commits. Closes https://github.com/facebook/rocksdb/pull/2318 Differential Revision: D5141520 Pulled By: yiwu-arbug fbshipit-source-id: 7987c92ab4461eef36afce5a133d3a0ee0c96300	2017-05-30 11:13:01 -07:00
赵星宇	d03c34497c	update comment of GetNextFile Summary: Closes https://github.com/facebook/rocksdb/pull/2377 Differential Revision: D5141274 Pulled By: lightmark fbshipit-source-id: c237a285b73ad93488c080ea80c71a29a17f1be0	2017-05-26 15:12:13 -07:00
Aaron Gao	f7bb1a0060	support merge and delete in file ingestion Summary: Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested. Closes https://github.com/facebook/rocksdb/pull/2361 Differential Revision: D5139402 Pulled By: lightmark fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88	2017-05-26 12:11:21 -07:00
Sagar Vemuri	7bb1f5d483	Increase of compaction threads should be logged at info level instead of a warning Summary: This log message shouldn't be a warning; some services are seeing high warning count due to this. The count for the below line is a few hundreds of millions, as per Logview: ``` [rocksdb/src/db/column_family.cc:729] [checkpoints] Increasing compaction threads because we have 2 level-0 files ``` Closes https://github.com/facebook/rocksdb/pull/2364 Differential Revision: D5123565 Pulled By: sagar0 fbshipit-source-id: a07ce499a4f82f0ebde9cda9f4948fb9df6a734c	2017-05-26 09:56:13 -07:00
Andrew Kryczka	a99fb9928f	fix column_family_test asan Summary: stop calling Close() at the end of tests holding a compaction pressure token since it causes the write controller to be deleted while it's still needed. these calls were pointless anyways since Close() is already called in the test's destructor. Closes https://github.com/facebook/rocksdb/pull/2367 Differential Revision: D5125906 Pulled By: ajkr fbshipit-source-id: 6cad8673e5546a82ff602ac0ba59cc3f68dbde46	2017-05-24 16:41:51 -07:00
Andrew Kryczka	bb01c1880c	Introduce max_background_jobs mutable option Summary: - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure. - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions. - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled. Closes https://github.com/facebook/rocksdb/pull/2205 Differential Revision: D4937461 Pulled By: ajkr fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9	2017-05-24 11:29:08 -07:00
Siying Dong	41cbb72749	options.delayed_write_rate use the rate of rate_limiter by default. Summary: It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude. Closes https://github.com/facebook/rocksdb/pull/2357 Differential Revision: D5115324 Pulled By: siying fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754	2017-05-24 09:58:24 -07:00
Igor Canadi	52d9e5f7b6	Fix column family seconds_up accounting Summary: `cf_stats_snapshot_.seconds_up` appears to be never updated, unlike `db_stats_snapshot_.seconds_up`, which is updated here: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.cc#L883 This leads to wrong information in the log, for example: ``` Compaction Stats [default] .... Uptime(secs): 85591.2 total, 85591.2 interval ``` Even though DB's interval is correctly logged as 60 seconds: ``` DB Stats Uptime(secs): 85591.2 total, 637.8 interval ``` Closes https://github.com/facebook/rocksdb/pull/2338 Differential Revision: D5114131 Pulled By: sagar0 fbshipit-source-id: 85243a38213236ccbb601a7f7aaa8865eaa8083c	2017-05-23 17:14:04 -07:00
Sagar Vemuri	7d8207f1f2	Fix errors in clang-analyzer builds Summary: Fix build error in db_iter.cc when running clang-analyzer. ``` CC db/db_iter.o db/db_iter.cc:938:21: error: no matching constructor for initialization of 'rocksdb::ParsedInternalKey' ParsedInternalKey ikey(Slice(), 0, 0); ^ ~~~~~~~~~~~~~ ./db/dbformat.h:84:3: note: candidate constructor not viable: no known conversion from 'int' to 'rocksdb::ValueType' for 3rd argument ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t) ^ ./db/dbformat.h:78:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided struct ParsedInternalKey { ^ ./db/dbformat.h:78:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided ./db/dbformat.h:83:3: note: candidate constructor not viable: requires 0 arguments, but 3 were provided ParsedInternalKey() { } // Intentionally left uninitialized (for speed) ^ 1 error generated. ``` Closes https://github.com/facebook/rocksdb/pull/2354 Differential Revision: D5115751 Pulled By: sagar0 fbshipit-source-id: b0e386d4e935e4725b07761c3ca5f7a8cbde3692	2017-05-23 15:11:42 -07:00
Andrew Kryczka	6cc9aef162	New API for background work in single thread pool Summary: Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool. Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool. Closes https://github.com/facebook/rocksdb/pull/2204 Differential Revision: D4936256 Pulled By: ajkr fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01	2017-05-23 11:12:27 -07:00
Yi Wu	9d0a07ed52	Fix rocksdb.estimate-num-keys DB property underflow Summary: rocksdb.estimate-num-keys is compute from `estimate_num_keys - 2 * estimate_num_deletes`. If `2 * estimate_num_deletes > estimate_num_keys` it will underflow. Fixing it. Closes https://github.com/facebook/rocksdb/pull/2348 Differential Revision: D5109272 Pulled By: yiwu-arbug fbshipit-source-id: e1bfb91346a59b7282a282b615002507e9d7c246	2017-05-23 10:42:59 -07:00
Aaron Gao	3e86c0f07c	disable direct reads for log and manifest and add direct io to tests Summary: Disable direct reads for log and manifest. Direct reads should not affect sequential_file Also add kDirectIO for option_config_ in db_test_util Closes https://github.com/facebook/rocksdb/pull/2337 Differential Revision: D5100261 Pulled By: lightmark fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b	2017-05-22 18:41:28 -07:00
Sagar Vemuri	228f49d20a	Fix data races caught by tsan Summary: This fixes the tsan build failures in: - write_callback_test - persistent_cache_test.* Closes https://github.com/facebook/rocksdb/pull/2339 Differential Revision: D5101190 Pulled By: sagar0 fbshipit-source-id: 537e19ed05272b1f34cfbf793aa822b2264a1643	2017-05-22 10:27:23 -07:00
Yi Wu	07bdcb91fe	New WriteImpl to pipeline WAL/memtable write Summary: PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that 1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit. 2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes. Merging #2056 and #2058 into this PR. Closes https://github.com/facebook/rocksdb/pull/2286 Differential Revision: D5054606 Pulled By: yiwu-arbug fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d	2017-05-19 14:26:42 -07:00
Yi Wu	d746aead1a	Suppress clang-analyzer false positive Summary: Fixing two types of clang-analyzer false positives: * db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated. * ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running. Closes https://github.com/facebook/rocksdb/pull/2334 Differential Revision: D5093801 Pulled By: yiwu-arbug fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d	2017-05-19 10:56:28 -07:00
Siying Dong	217b866f47	column_family_test: EnvCounter::num_new_writable_file_ to be atomic Summary: TSAN shows warning of data race of EnvCounter::num_new_writable_file_. Make it atomic. Closes https://github.com/facebook/rocksdb/pull/2331 Differential Revision: D5089215 Pulled By: siying fbshipit-source-id: 15f6dcfb770a3310cbb6337c22482c8b330daffc	2017-05-18 13:56:12 -07:00
yizhu.sun	f5ba131bf8	Fixed some spelling mistakes Summary: Closes https://github.com/facebook/rocksdb/pull/2314 Differential Revision: D5079601 Pulled By: sagar0 fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349	2017-05-17 23:12:36 -07:00
hyunwoo	0ebdd70579	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2312 Differential Revision: D5079631 Pulled By: sagar0 fbshipit-source-id: e4c8d1d89b244ee69e9dea1dd013227cc5241026	2017-05-17 16:41:49 -07:00
Mikhail Antonov	ba685a472a	Support ingest_behind for IngestExternalFile Summary: First cut for early review; there are few conceptual points to answer and some code structure issues. For conceptual points - - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true \|\| disable_auto_compaction=false), the user is responsible to properly open and close DB with required params - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise? - Modifying AssignLevelForIngestedFile seems the place we we'd handle that. On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites. Closes https://github.com/facebook/rocksdb/pull/2144 Differential Revision: D4873732 Pulled By: lightmark fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a	2017-05-17 11:42:42 -07:00
boolean5	cb9392a094	add Transactions and Checkpoint to C API Summary: I've added functions to the C API to support Transactions as requested in #1637 and to support Checkpoint. I have also added the corresponding tests to c_test.c For now, the following is omitted: 1. Optimistic Transactions 2. The column family variation of functions Closes https://github.com/facebook/rocksdb/pull/2236 Differential Revision: D4989510 Pulled By: yiwu-arbug fbshipit-source-id: 518cb39f76d5e9ec9690d633fcdc014b98958071	2017-05-16 22:59:43 -07:00
hyunwoo	f720796e24	fixed typo Summary: fixed exisitng -> existing Closes https://github.com/facebook/rocksdb/pull/2305 Differential Revision: D5070169 Pulled By: yiwu-arbug fbshipit-source-id: 8c8450acf50757b767cf78b78314018395738d96	2017-05-16 11:07:58 -07:00
siddontang	1ca723dbd1	C API: support pinnable get Summary: Closes https://github.com/facebook/rocksdb/pull/2254 Differential Revision: D5053590 Pulled By: yiwu-arbug fbshipit-source-id: 2f365a031b3a2947b4fba21d26d4f8f52af9b9f0	2017-05-16 11:07:58 -07:00
赵星宇	4f9e69ccf4	fix log err Summary: Closes https://github.com/facebook/rocksdb/pull/2206 Differential Revision: D5054222 Pulled By: yiwu-arbug fbshipit-source-id: d8742bda1bf3e76d7b68eeb86df4608031b5cbc8	2017-05-15 16:15:38 -07:00
Yi Wu	3907c94ffb	Fix ColumnFamilyTest:BulkAddDrop Summary: Fix ColumnFamilyTest:BulkAddDrop not deleted CF handles at the end, causing ASAN failure. Closes https://github.com/facebook/rocksdb/pull/2275 Differential Revision: D5040724 Pulled By: yiwu-arbug fbshipit-source-id: 86cd4070c944d01173a3cc36462bb800698af192	2017-05-10 23:05:44 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
Aaron Gao	492fc49a86	fix readampbitmap tests Summary: fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose. test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check Closes https://github.com/facebook/rocksdb/pull/2271 Differential Revision: D5038133 Pulled By: lightmark fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8	2017-05-10 12:29:23 -07:00
Andrew Kryczka	be421b0b16	portable sched_getcpu calls Summary: - added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available. - refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred Closes https://github.com/facebook/rocksdb/pull/2272 Differential Revision: D5038093 Pulled By: ajkr fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7	2017-05-10 12:29:23 -07:00
Aaron Gao	259a00eaca	unbiase readamp bitmap Summary: Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a) and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31. An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block. It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i32, (i+1)32). This diff makes each bit represent a single byte: i32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte. But there is one exception: the last bit will always set with the old way.* (*) - assuming read_amp_bytes_per_bit = 32. Closes https://github.com/facebook/rocksdb/pull/2259 Differential Revision: D5035652 Pulled By: lightmark fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356	2017-05-10 01:49:52 -07:00
Islam AbdelRahman	4897eb250b	dont skip IO for filter blocks Summary: Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek Actually I am not sure why this behavior existed in the first place Closes https://github.com/facebook/rocksdb/pull/2255 Differential Revision: D5010721 Pulled By: maysamyabandeh fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4	2017-05-09 09:52:02 -07:00
Changjian Gao	3f73d54bbd	Add C API to set max_file_opening_threads option Summary: Add `rocksdb_options_set_max_file_opening_threads()` API Closes https://github.com/facebook/rocksdb/pull/2184 Differential Revision: D4923090 Pulled By: lightmark fbshipit-source-id: c4ddce17733d999d426d02f7202b33a46ed6faed	2017-05-08 22:49:32 -07:00
Yi Wu	2cd00773c7	Add bulk create/drop column family API Summary: Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time. The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO. Closes https://github.com/facebook/rocksdb/pull/2248 Differential Revision: D5001578 Pulled By: yiwu-arbug fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03	2017-05-07 23:20:46 -07:00
Tamir Duberstein	fdaefa0309	travis: add Windows cross-compilation Summary: - downcase includes for case-sensitive filesystems - give targets the same name (librocksdb) on all platforms With this patch it is possible to cross-compile RocksDB for Windows from a Linux host using mingw. cc yuslepukhin orgads Closes https://github.com/facebook/rocksdb/pull/2107 Differential Revision: D4849784 Pulled By: siying fbshipit-source-id: ad26ed6b4d393851aa6551e6aa4201faba82ef60	2017-05-05 23:20:01 -07:00
Aaron Gao	a30a696034	do not read next datablock if upperbound is reached Summary: Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks: [a b c d][f g h] Then the index key for the first block will be 'e'. then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already. This diff can reduce RA in most cases. Closes https://github.com/facebook/rocksdb/pull/2239 Differential Revision: D4990693 Pulled By: lightmark fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff	2017-05-05 23:20:01 -07:00
Aaron Gao	2d42cf5ea9	Roundup read bytes in ReadaheadRandomAccessFile Summary: Fix alignment in ReadaheadRandomAccessFile Closes https://github.com/facebook/rocksdb/pull/2253 Differential Revision: D5012336 Pulled By: lightmark fbshipit-source-id: 10d2c829520cb787227ef653ef63d5d701725778	2017-05-05 12:14:14 -07:00
Siying Dong	264d3f540c	Allow IntraL0 compaction in FIFO Compaction Summary: Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files. Closes https://github.com/facebook/rocksdb/pull/2163 Differential Revision: D4895953 Pulled By: siying fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231	2017-05-04 18:16:13 -07:00
Andrew Kryczka	8c3a180e83	Set lower-bound on dynamic level sizes Summary: Changed dynamic leveling to stop setting the base level's size bound below `max_bytes_for_level_base`. Behavior for config where `max_bytes_for_level_base == level0_file_num_compaction_trigger * write_buffer_size` and same amount of data in L0 and base-level: - Before #2027, compaction scoring would favor base-level due to dividing by size smaller than `max_bytes_for_level_base`. - After #2027, L0 and Lbase get equal scores. The disadvantage is L0 is often compacted before reaching the num files trigger since `write_buffer_size` can be bigger than the dynamically chosen base-level size. This increases write-amp. - After this diff, L0 and Lbase still get equal scores. Now it takes `level0_file_num_compaction_trigger` files of size `write_buffer_size` to trigger L0 compaction by size, fixing the write-amp problem above. Closes https://github.com/facebook/rocksdb/pull/2123 Differential Revision: D4861570 Pulled By: ajkr fbshipit-source-id: 467ddef56ed1f647c14d86bb018bcb044c39b964	2017-05-04 18:16:12 -07:00
Andrew Kryczka	7c1c8ce5ac	Avoid calling fallocate with UINT64_MAX Summary: When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249. Test setup: - command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize` - filesystem: xfs before this diff: `fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)` after this diff: `fallocate(10, 01, 0, 1977) = 0` Closes https://github.com/facebook/rocksdb/pull/2252 Differential Revision: D5007275 Pulled By: ajkr fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880	2017-05-04 17:43:22 -07:00
Leonidas Galanis	a45e98a5b5	max_open_files dynamic set, follow up Summary: Followup to make 0x40000 a TableCache constant that indicates infinite capacity Closes https://github.com/facebook/rocksdb/pull/2247 Differential Revision: D5001349 Pulled By: lgalanis fbshipit-source-id: ce7bd2e54b0975bb9f8680fdaa0f8bb0e7ae81a2	2017-05-04 10:42:45 -07:00
Leonidas Galanis	e7ae4a3a02	Max open files mutable Summary: Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache. Closes https://github.com/facebook/rocksdb/pull/2185 Differential Revision: D4979189 Pulled By: yiwu-arbug fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda	2017-05-03 21:13:14 -07:00
siddontang	b551104e04	support PopSavePoint for WriteBatch Summary: Try to fix https://github.com/facebook/rocksdb/issues/1969 Closes https://github.com/facebook/rocksdb/pull/2170 Differential Revision: D4907333 Pulled By: yiwu-arbug fbshipit-source-id: 417b420ff668e6c2fd0dad42a94c57385012edc5	2017-05-03 10:57:45 -07:00
Siying Dong	af6fe69e4c	Fix an issue of manual / auto compaction data race Summary: A data race between a manual and an auto compaction can cause a scheduled automatic compaction to be cancelled and never rescheduled again. This may cause a condition of hanging forever. Fix this by always making sure the cancelled compaction is put back to the compaction queue. Closes https://github.com/facebook/rocksdb/pull/2238 Differential Revision: D4984591 Pulled By: siying fbshipit-source-id: 3ab153886403c7b991896dcb2158b96cac12f227	2017-05-02 15:11:59 -07:00
Siying Dong	aeaba07b2a	Remove an assert that causes TSAN failure. Summary: ColumnFamilyData::ConstructNewMemtable is called out of DB mutex, and it asserts current_ is not empty, but current_ should only be accessed inside DB mutex. Remove this assert to make TSAN happy. Closes https://github.com/facebook/rocksdb/pull/2235 Differential Revision: D4978531 Pulled By: siying fbshipit-source-id: 423685a7dae88ed3faaa9e1b9ccb3427ac704a4b	2017-05-01 16:35:15 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Aaron Gao	0ca3ead0cb	add GetRootDB() in DeleteFilesInRange Summary: In case users cast a subclass of db* into dbimpl* Closes https://github.com/facebook/rocksdb/pull/2222 Differential Revision: D4964486 Pulled By: lightmark fbshipit-source-id: 0ccdc08ee8e7a193dfbbe0218c3cbfd795662ca1	2017-04-27 14:33:17 -07:00
Dmitri Smirnov	cdad04b051	Remove double buffering on RandomRead on Windows. Summary: Remove double buffering on RandomRead on Windows. With more logic appear in file reader/write Read no longer obeys forwarding calls to Windows implementation. Previously direct_io (unbuffered) was only available on Windows but now is supported as generic. We remove intermediate buffering on Windows. Remove random_access_max_buffer_size option which was windows specific. Non-zero values for that opton introduced unnecessary lock contention. Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are no longer necessary. Add aligned buffer reads for cases when requested reads exceed read ahead size. Closes https://github.com/facebook/rocksdb/pull/2105 Differential Revision: D4847770 Pulled By: siying fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f	2017-04-27 12:30:05 -07:00
Siying Dong	e15382c09c	Disable two flaky tests Summary: Closes https://github.com/facebook/rocksdb/pull/2217 Differential Revision: D4959351 Pulled By: siying fbshipit-source-id: ce7c3a430bae0d15e06b3d5c958ebce969d08564	2017-04-26 17:13:46 -07:00
Andrew Kryczka	efc361ef7d	Add user stats Reset API Summary: It resets all the ticker and histogram stats to zero. Needed to change the locking a bit since Reset() is the only operation that manipulates multiple tickers/histograms together, and that operation should be seen as atomic by other operations that access tickers/histograms. Closes https://github.com/facebook/rocksdb/pull/2213 Differential Revision: D4952232 Pulled By: ajkr fbshipit-source-id: c0475c3e4c7b940120d53891b69c3091149a0679	2017-04-26 15:57:01 -07:00
Andrew Kryczka	f6a27d0bce	Extract statistics tests into separate file Summary: I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there. Closes https://github.com/facebook/rocksdb/pull/2211 Differential Revision: D4951558 Pulled By: ajkr fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4	2017-04-26 14:47:23 -07:00
Aaron Gao	7eddecce12	support bulk loading with universal compaction Summary: Support buck load with universal compaction. More test cases to be added. Closes https://github.com/facebook/rocksdb/pull/2202 Differential Revision: D4935360 Pulled By: lightmark fbshipit-source-id: cc3ca1b6f42faa503207dab1408d6bcf393ee5b5	2017-04-26 13:41:32 -07:00
Aaron Gao	72c21fb3f2	call GetRootDB() before cast to DBImpl* in CancelAllBackgroundWork Summary: User could call this with wrapper class of DB or DBImpl Closes https://github.com/facebook/rocksdb/pull/2200 Differential Revision: D4935530 Pulled By: lightmark fbshipit-source-id: df9cb61d67d0f3bbcf62f714d77523a459a92883	2017-04-24 13:47:17 -07:00
Tomas Kolda	04d58970cb	AIX and Solaris Sparc Support Summary: Replacement of #2147 The change was squashed due to a lot of conflicts. Closes https://github.com/facebook/rocksdb/pull/2194 Differential Revision: D4929799 Pulled By: siying fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581	2017-04-21 20:48:04 -07:00
Aaron Gao	cb885bccfe	set compaction_iterator earliest_snapshot to max if no snapshot Summary: It is a potential bug that will be triggered if we ingest files before inserting the first key into an empty db. 0 is a special value reserved to indicate the concept of non-existence. But not good for seqno in this case because 0 is a valid seqno for ingestion(bulk loading) Closes https://github.com/facebook/rocksdb/pull/2183 Differential Revision: D4919827 Pulled By: lightmark fbshipit-source-id: 237eea40f88bd6487b66806109d90065dc02c362	2017-04-20 19:56:59 -07:00
Andrew Kryczka	1dd7760513	Change L0 compaction score using level size Summary: The goal is to avoid the problem of small number of L0 files triggering compaction to base level (which increased write-amp), while still allowing L0 compaction-by-size (so intra-L0 compactions cause score to increase). Closes https://github.com/facebook/rocksdb/pull/2172 Differential Revision: D4908552 Pulled By: ajkr fbshipit-source-id: 4b170142b2b368e24bd7948b2a6f24c69fabf73d	2017-04-19 12:00:01 -07:00
Yi Wu	966ebb02f5	Hide event listeners from lite build Summary: Fixing lite build failure introduce by #2169. Closes https://github.com/facebook/rocksdb/pull/2174 Reviewed By: sagar0 Differential Revision: D4910619 Pulled By: yiwu-arbug fbshipit-source-id: 5213b7b7431cc258688793c8c28153025588d8d9	2017-04-18 17:26:19 -07:00
Siying Dong	c49d704656	Add DB:ResetStats() Summary: Add a function to allow users to reset internal stats without restarting the DB. Closes https://github.com/facebook/rocksdb/pull/2167 Differential Revision: D4907939 Pulled By: siying fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68	2017-04-18 16:56:48 -07:00
Yi Wu	0fcdccc33e	Blob storage helper methods Summary: Split out interfaces needed for blob storage from #1560, including * CompactionEventListener and OnFlushBegin listener interfaces. * Blob filename support. Closes https://github.com/facebook/rocksdb/pull/2169 Differential Revision: D4905463 Pulled By: yiwu-arbug fbshipit-source-id: 564e73448f1b7a367e5e46216a521e57ea9011b5	2017-04-18 12:42:38 -07:00
Yi Wu	e9e6e53247	Simplify write thread logic Summary: The concept about early exit in write thread implementation is a confusing one. It means that if early exit is allowed, batch group leader will not responsible to exit the batch group, but the last finished writer do. In case we need to mark log synced, or encounter memtable insert error, early exit is disallowed. This patch remove such a concept by: * In all cases, the last finished writer (not necessary leader) is responsible to exit batch group. * In case of parallel memtable write, leader will also mark log synced after memtable insert and before signal finish (call `CompleteParallelWorker()`). The purpose is to allow mark log synced (which require locking mutex) can run in parallel to memtable insert in other writers. * The last finish writer should handle memtable insert error (update bg_error_) before exiting batch group. Closes https://github.com/facebook/rocksdb/pull/2134 Differential Revision: D4869667 Pulled By: yiwu-arbug fbshipit-source-id: aec170847c85b90f4179d6a4608a4fe1361544e3	2017-04-13 16:12:04 -07:00
Aaron Gao	44fa8ece9b	change use_direct_writes to use_direct_io_for_flush_and_compaction Summary: Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get(). Closes https://github.com/facebook/rocksdb/pull/2117 Differential Revision: D4860912 Pulled By: lightmark fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19	2017-04-13 16:12:04 -07:00
Igor Canadi	b6b9359ece	Fix BYTES_WRITTEN accounting Summary: BYTES_WRITTEN accounting doesn't work with disabled WAL. For example, this is what we get in the LOG: ``` Cumulative writes: 9794K writes, 228M keys, 9794K commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s ``` WAL bytes are tracked in a different statistic: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.h#L105. BYTES_WRITTEN should count all the writes. Closes https://github.com/facebook/rocksdb/pull/2133 Differential Revision: D4880615 Pulled By: yiwu-arbug fbshipit-source-id: 8fd0b223099f3f5ad7df79d4e737d313687fec69	2017-04-13 16:12:03 -07:00
Sagar Vemuri	6a6723ee1e	Move MergeOperatorPinning tests to be with other merge operator tests Summary: Moved MergeOperatorPinning tests from db_test2.cc to db_merge_operator_test.cc. [This is the same code as PR #2104 , which has already been reviewed, but I am creating a new PR as I cannot import from #2104 onto phabricator anymore even after rebasing. I'll close and discard #2104.] Closes https://github.com/facebook/rocksdb/pull/2125 Differential Revision: D4863312 Pulled By: sagar0 fbshipit-source-id: 0f71a7690aa09c1d03ee85ce2bc1d2d89e4f4399	2017-04-11 16:15:06 -07:00
Siying Dong	8f47a97512	File level histogram should be printed per CF, not per DB Summary: Currently level histogram is only printed out for DB stats and for default CF. This is confusing. Change to print for every CF instead. Closes https://github.com/facebook/rocksdb/pull/2126 Differential Revision: D4865373 Pulled By: siying fbshipit-source-id: 1c853e0ac66e00120ee931cabc9daf69ccc2d577	2017-04-11 08:42:03 -07:00
Manuel Ung	1f8b119ed6	Limit maximum memory used in the WriteBatch representation Summary: Extend TransactionOptions to include max_write_batch_size which determines the maximum size of the writebatch representation. If memory limit is exceeded, the operation will abort with subcode kMemoryLimit. Closes https://github.com/facebook/rocksdb/pull/2124 Differential Revision: D4861842 Pulled By: lth fbshipit-source-id: 46fd172ea67cc90bbba829bf0d70cfab2261c161	2017-04-10 15:42:26 -07:00
Maysam Yabandeh	20778f2f92	Adding comments to the write path Summary: also did minor refactoring Closes https://github.com/facebook/rocksdb/pull/2115 Differential Revision: D4855818 Pulled By: maysamyabandeh fbshipit-source-id: fbca6ac57e5c6677fffe8354f7291e596a50cb77	2017-04-10 12:43:34 -07:00
Sagar Vemuri	7124268a09	Reduce the number of params needed to construct DBIter Summary: DBIter, and in-turn NewDBIterator and NewArenaWrappedDBIterator, take a bunch of params. They can be reduced by passing in ReadOptions directly instead of passing in every new param separately. It also seems much cleaner as a bunch of the params towards the end seem to be optional. (Recently I introduced max_skippable_internal_keys, which added one more to the already huge count). Idea courtesy IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/2116 Differential Revision: D4857128 Pulled By: sagar0 fbshipit-source-id: 7d239df094b94bd9ea79d145cdf825478ac037a8	2017-04-10 11:14:14 -07:00
Islam AbdelRahman	61730186df	dummy diff Summary: Closes https://github.com/facebook/rocksdb/pull/2114 Differential Revision: D4854860 Pulled By: IslamAbdelRahman fbshipit-source-id: b871c5b9ccc52d20f5ceacdd172dc70b1dbf9110	2017-04-07 17:07:37 -07:00
Ayappan	dd8f9e38e9	Fix compilation for GCC-5 Summary: Fixes this issue https://github.com/facebook/rocksdb/issues/2108 Closes https://github.com/facebook/rocksdb/pull/2109 Differential Revision: D4851965 Pulled By: yiwu-arbug fbshipit-source-id: 6ee807b	2017-04-07 10:54:12 -07:00
Siying Dong	ff97287016	Refactor compaction picker code Summary: 1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h. 2. Rename some functions to make the code easier to understand. 3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions. Closes https://github.com/facebook/rocksdb/pull/2100 Differential Revision: D4845948 Pulled By: siying fbshipit-source-id: efa0ab4	2017-04-06 20:09:34 -07:00
Sagar Vemuri	343b59d6ee	Move various string utility functions into string_util Summary: This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper. Check the sub-commits for easier reviewing. Closes https://github.com/facebook/rocksdb/pull/2094 Differential Revision: D4837730 Pulled By: sagar0 fbshipit-source-id: 344278a	2017-04-06 14:54:12 -07:00
Yi Wu	df6f5a3772	Move memtable related files into memtable directory Summary: Move memtable related files into memtable directory. Closes https://github.com/facebook/rocksdb/pull/2087 Differential Revision: D4829242 Pulled By: yiwu-arbug fbshipit-source-id: ca70ab6	2017-04-06 14:09:13 -07:00
Tamir Duberstein	107c5f6a60	CMake: more MinGW fixes Summary: siying this is a resubmission of #2081 with the 4th commit fixed. From that commit message: > Note that the previous use of quotes in PLATFORM_{CC,CXX}FLAGS was incorrect and caused GCC to produce the incorrect define: > > #define ROCKSDB_JEMALLOC -DJEMALLOC_NO_DEMANGLE 1 > > This was the cause of the Linux build failure on the previous version of this change. I've tested this locally, and the Linux build succeeds now. Closes https://github.com/facebook/rocksdb/pull/2097 Differential Revision: D4839964 Pulled By: siying fbshipit-source-id: cc51322	2017-04-06 14:09:13 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Islam AbdelRahman	c50e3750dc	Use a human readable size for level report Summary: Current ``` Compaction Stats [default] Level Files Size(MB} Score Read(GB} Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop ---------------------------------------------------------------------------------------------------------------------------------------------------------- L0 2/0 49.02 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 76.1 1 2 0.322 0 0 Sum 2/0 49.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 76.1 1 2 0.322 0 0 Int 0/0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 76.1 1 2 0.322 0 0 ``` New ``` Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn Key Closes https://github.com/facebook/rocksdb/pull/2055 Differential Revision: D4804576 Pulled By: IslamAbdelRahman fbshipit-source-id: 719be6a	2017-04-05 17:24:19 -07:00
Siying Dong	ce64b8b719	Divide db/db_impl.cc Summary: db_impl.cc is too large to manage. Divide db_impl.cc into db/db_impl.cc, db/db_impl_compaction_flush.cc, db/db_impl_files.cc, db/db_impl_open.cc and db/db_impl_write.cc. Closes https://github.com/facebook/rocksdb/pull/2095 Differential Revision: D4838188 Pulled By: siying fbshipit-source-id: c5f3059	2017-04-05 17:24:19 -07:00
Andrew Kryczka	d659faad54	Level-based L0->L0 compaction Summary: Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach. - L0->L0 is triggered when base level is unavailable due to pending compactions - L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes. - Subcompactions are disabled for L0->L0 since we want to output one file. - Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0. Closes https://github.com/facebook/rocksdb/pull/2027 Differential Revision: D4760318 Pulled By: ajkr fbshipit-source-id: 9d07183	2017-04-04 18:09:11 -07:00
Siying Dong	43010a929f	Revert "[rocksdb][PR] CMake: more MinGW fixes" fbshipit-source-id: 43b4529	2017-04-04 16:24:26 -07:00
Tamir Duberstein	3450ac8c1b	CMake: more MinGW fixes Summary: See individual commits. yuslepukhin siying Closes https://github.com/facebook/rocksdb/pull/2081 Differential Revision: D4824639 Pulled By: IslamAbdelRahman fbshipit-source-id: 2fc2b00	2017-04-04 15:09:17 -07:00
Aaron Gao	90cfd46458	update IterKey that can get user key and internal key explicitly Summary: to void future bug that caused by the mix of userkey/internalkey Closes https://github.com/facebook/rocksdb/pull/2084 Differential Revision: D4825889 Pulled By: lightmark fbshipit-source-id: 28411db	2017-04-04 14:24:20 -07:00
Yi Wu	9e44531803	Refactor WriteImpl (pipeline write part 1) Summary: Refactor WriteImpl() so when I plug-in the pipeline write code (which is an alternative approach for WriteThread), some of the logic can be reuse. I split out the following methods from WriteImpl(): * PreprocessWrite() * HandleWALFull() (previous MaybeFlushColumnFamilies()) * HandleWriteBufferFull() * WriteToWAL() Also adding a constructor to WriteThread::Writer, and move WriteContext into db_impl.h. No real logic change in this patch. Closes https://github.com/facebook/rocksdb/pull/2042 Differential Revision: D4781014 Pulled By: yiwu-arbug fbshipit-source-id: d45ca18	2017-04-04 10:24:32 -07:00
Siying Dong	6ef8c620d3	Move auto_roll_logger and filename out of db/ Summary: It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency. Closes https://github.com/facebook/rocksdb/pull/2080 Differential Revision: D4821141 Pulled By: siying fbshipit-source-id: ca7d768	2017-04-03 18:39:14 -07:00
Siying Dong	88cc81df5c	auto_roll_logger_test to move away from real sleep Summary: auto_roll_logger_test relies on timing conditon that some operations finish within 1 seconds. This caused flaky tests. Move away from real timing and sleep and use fake time to verify the time-based rolling. Closes https://github.com/facebook/rocksdb/pull/2066 Differential Revision: D4810647 Pulled By: siying fbshipit-source-id: c54d994	2017-04-03 11:39:09 -07:00
Nikhil Benesch	d25e28d584	replace sometimes-undefined uint type with unsigned int Summary: `uint` is nonstandard and not a built-in type on all compilers; replace it with the always-valid `unsigned int`. I assume this went unnoticed because it's inside an `#ifdef ROCKDB_JEMALLOC`. Closes https://github.com/facebook/rocksdb/pull/2075 Differential Revision: D4820427 Pulled By: ajkr fbshipit-source-id: 0876561	2017-04-03 11:39:09 -07:00
Andrew Kryczka	a1d7e487b3	Add L0 write-amp to compaction level stats Summary: Previously it always showed 0.0 for L0 write-amp because we were dividing by bytes read from non-output level. For L0, we should instead divide by bytes ingested to the DB. Note the numerator (bytes written to L0) includes flush bytes. Closes https://github.com/facebook/rocksdb/pull/2078 Differential Revision: D4816902 Pulled By: ajkr fbshipit-source-id: 7dca31a	2017-04-03 11:24:10 -07:00
Orgad Shaneh	6401a8b76b	Fix build with MinGW Summary: There still are many warnings (most of them about invalid printf format for long long), but it builds if FAIL_ON_WARNINGS is disabled. Closes https://github.com/facebook/rocksdb/pull/2052 Differential Revision: D4807355 Pulled By: siying fbshipit-source-id: ef03786	2017-03-30 16:54:52 -07:00
Sagar Vemuri	c6d04f2ecf	Option to fail a request as incomplete when skipping too many internal keys Summary: Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete. This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0. This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084. Closes https://github.com/facebook/rocksdb/pull/2000 Differential Revision: D4753223 Pulled By: sagar0 fbshipit-source-id: 1c973f7	2017-03-30 12:09:21 -07:00
Siying Dong	67d7623794	Expose the stalling information through DB::GetProperty() Summary: Add two DB properties: rocksdb.actual_delayed_write_rate and rocksdb.is_write_stooped, for people to know whether current writes are being throttled. Closes https://github.com/facebook/rocksdb/pull/2043 Differential Revision: D4782975 Pulled By: siying fbshipit-source-id: 6b2f5cf	2017-03-29 11:54:20 -07:00
Maysam Yabandeh	e7731d119a	Configure index partition size Summary: Allow the users to specify the target index partition size. With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut. Closes https://github.com/facebook/rocksdb/pull/2041 Differential Revision: D4780216 Pulled By: maysamyabandeh fbshipit-source-id: 95a0831	2017-03-28 12:09:12 -07:00
Siying Dong	91b5feb37b	Fix Windows Build broken by a recent commit Summary: Closes https://github.com/facebook/rocksdb/pull/2032 Differential Revision: D4766260 Pulled By: siying fbshipit-source-id: 415daa4	2017-03-23 18:09:57 -07:00
Warren Falk	41ccae6d26	Add C API functions (and tests) for WriteBatchWithIndex Summary: I've added functions to the C API to support WriteBatchWithIndex as requested in #1833. I've also added unit tests to c_test I've implemented the WriteBatchWithIndex variation of every function available for regular WriteBatch. And added additional functions unique to WriteBatchWithIndex. For now, the following is omitted: 1. The ability to create WriteBatchWithIndex's custom batch-only iterator as I'm not sure what its purpose is. It should be possible to add later if anyone wants it. 2. The ability to create the batch with a fallback comparator, since it appears to be unnecessary. I believe the column family comparator will be used for this, meaning those using a custom comparator can just use the column family variations. Closes https://github.com/facebook/rocksdb/pull/1985 Differential Revision: D4760039 Pulled By: siying fbshipit-source-id: 393227e	2017-03-23 15:54:13 -07:00
Daniel Black	f4fce4751e	Fix clang compile error - [-Werror,-Wunused-lambda-capture] Summary: Errors where: db/version_set.cc:1535:20: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture] [this](const Fsize& f1, const Fsize& f2) -> bool { ^ db/version_set.cc:1541:20: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture] [this](const Fsize& f1, const Fsize& f2) -> bool { ^ db/db_test.cc:2983:27: error: lambda capture 'kNumPutsBeforeWaitForFlush' is not required to be captured for this use [-Werror,-Wunused-lambda-capture] auto gen_l0_kb = [this, kNumPutsBeforeWaitForFlush](int size) { ^ Closes https://github.com/facebook/rocksdb/pull/1972 Differential Revision: D4685991 Pulled By: siying fbshipit-source-id: 9125379	2017-03-22 18:09:10 -07:00
Siying Dong	15950fe3a0	Remove ASSERT_EQ(boolean, ...) Summary: Closes https://github.com/facebook/rocksdb/pull/2024 Differential Revision: D4755420 Pulled By: siying fbshipit-source-id: 7332ab1	2017-03-22 15:54:12 -07:00
Aaron Gao	3e56c7e0c4	make total_log_size_ atomic Summary: make total_log_size_ atomic to avoid overflow caused by data race. Closes https://github.com/facebook/rocksdb/pull/2019 Differential Revision: D4751391 Pulled By: siying fbshipit-source-id: fac01dd	2017-03-22 11:54:40 -07:00
Dmitri Smirnov	be723a8d8c	Optionally construct Post Processing Info map in MemTableInserter Summary: MemTableInserter default constructs Post processing info std::map. However, on Windows with 2015 STL the default constructed map still dynamically allocates one node which shows up on a profiler and we loose ~40% throughput on fillrandom benchmark. Solution: declare a map as std::aligned storage and optionally construct. This addresses https://github.com/facebook/rocksdb/issues/1976 Before: ------------------------------------------------------------------- Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillonly] fillrandom : 2.775 micros/op 360334 ops/sec; 280.4 MB/s Microseconds per write: Count: 10000000 Average: 2.7749 StdDev: 39.92 Min: 1 Median: 2.0826 Max: 26051 Percentiles: P50: 2.08 P75: 2.55 P99: 3.55 P99.9: 9.58 P99.99: 51.5**6 ------------------------------------------------------ After: Initializing RocksDB Options from command-line flags DB path: [k:\data\BulkLoadRandom_10M_fillon Closes https://github.com/facebook/rocksdb/pull/2011 Differential Revision: D4740823 Pulled By: siying fbshipit-source-id: 1daaa2c	2017-03-22 11:24:12 -07:00
Maysam Yabandeh	8b0097b49b	Readers for partition filter Summary: This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests. Closes https://github.com/facebook/rocksdb/pull/1961 Differential Revision: D4672216 Pulled By: maysamyabandeh fbshipit-source-id: 6a2b829	2017-03-22 09:24:15 -07:00
Siying Dong	8f5bf04468	Flush triggered by DB write buffer size picks the oldest unflushed CF Summary: Previously, when DB write buffer size triggers, we always pick the CF with most data in its memtable to flush. This approach can minimize total flush happens. Change the behavior to always pick the oldest unflushed CF, which makes it the same behavior when max_total_wal_size hits. This approach will minimize size used by max_total_wal_size. Closes https://github.com/facebook/rocksdb/pull/1987 Differential Revision: D4703214 Pulled By: siying fbshipit-source-id: 9ff8b09	2017-03-21 11:09:10 -07:00
Raza Hussain	6908e24b56	dynamic setting of stats_dump_period_sec through SetDBOption() Summary: Resolved the following issue: https://github.com/facebook/rocksdb/issues/1930 Closes https://github.com/facebook/rocksdb/pull/2004 Differential Revision: D4736764 Pulled By: yiwu-arbug fbshipit-source-id: 64fe0b7	2017-03-20 22:54:13 -07:00
Islam AbdelRahman	d52f334cbd	Break stalls when no bg work is happening Summary: Current stall will keep sleeping even if there is no Flush/Compactions to wait for, I changed the logic to break the stall if we are not flushing or compacting db_bench command used ``` # fillrandom # memtable size = 10MB # value size = 1 MB # num = 1000 # use /dev/shm ./db_bench --benchmarks="fillrandom,stats" --value_size=1048576 --write_buffer_size=10485760 --num=1000 --delayed_write_rate=XXXXX --db="/dev/shm/new_stall" \| grep "Cumulative stall" ``` ``` Current results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.031 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00:00:22.314 H:M:S # delayed_write_rate = 100 Kb/sec Cumulative stall: 00:00:42.784 H:M:S # delayed_write_rate = 50 Kb/sec Cumulative stall: 00:01:23.785 H:M:S # delayed_write_rate = 25 Kb/sec Cumulative stall: 00:02:45.702 H:M:S ``` ``` New results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.017 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00 Closes https://github.com/facebook/rocksdb/pull/1884 Differential Revision: D4585439 Pulled By: IslamAbdelRahman fbshipit-source-id: aed2198	2017-03-16 18:24:17 -07:00
Islam AbdelRahman	995618a821	Support SstFileManager::SetDeleteRateBytesPerSecond() Summary: Update DeleteScheduler component to support changing delete rate in runtime by introducing SstFileManager::SetDeleteRateBytesPerSecond() Closes https://github.com/facebook/rocksdb/pull/1994 Differential Revision: D4719906 Pulled By: IslamAbdelRahman fbshipit-source-id: e6b8d9e	2017-03-16 12:09:15 -07:00
Islam AbdelRahman	e19163688b	Add macros to include file name and line number during Logging Summary: current logging ``` 2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25 2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2. 2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK 2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK 2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0. 2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6 2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1 2017/03/14-14:20:31. Closes https://github.com/facebook/rocksdb/pull/1990 Differential Revision: D4708695 Pulled By: IslamAbdelRahman fbshipit-source-id: cb8968f	2017-03-15 19:39:12 -07:00
Maysam Yabandeh	11526252cc	Pinnableslice (2nd attempt) Summary: PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The d Closes https://github.com/facebook/rocksdb/pull/1756 Differential Revision: D4391738 Pulled By: maysamyabandeh fbshipit-source-id: 6f3edd3	2017-03-13 11:54:10 -07:00
Sagar Vemuri	1ffbdfd9a7	Add a new SstFileWriter constructor without explicit comparator Summary: The comparator param in SstFileWriter constructor is redundant as it already exists as a field in options. So the current SstFileWriter constructor should be deprecated in favor of a new one which does not take a comparator. Note that the jni/java apis have not been touched yet. Closes https://github.com/facebook/rocksdb/pull/1978 Differential Revision: D4685629 Pulled By: sagar0 fbshipit-source-id: 372ce96	2017-03-13 11:39:13 -07:00
Reid Horuff	ebd5639b6d	Add ability to search for key prefix in sst_dump tool Summary: Add the flag --prefix to the sst_dump tool This flag is similar to, and exclusive from, the --from flag. --prefix=0x00FF will return all rows prefixed with 0x00FF. The --to flag may also be specified and will work as expected. These changes were used to help in debugging the power cycle corruption issue and theses changes were tested by scanning through a udb. Closes https://github.com/facebook/rocksdb/pull/1984 Differential Revision: D4691814 Pulled By: reidHoruff fbshipit-source-id: 027f261	2017-03-13 10:39:12 -07:00
Maysam Yabandeh	e6725e8c8d	Fix some bugs in MockEnv Summary: Fixing some bugs in MockEnv so it be actually used. Closes https://github.com/facebook/rocksdb/pull/1914 Differential Revision: D4609923 Pulled By: maysamyabandeh fbshipit-source-id: ca25735	2017-03-13 09:54:11 -07:00
Andrew Kryczka	f2817fb7f9	avoid ASSERT_EQ(false, ...); Summary: lately it fails on travis due to a compiler bug (see https://github.com/google/googletest/issues/322#issuecomment-125645145). interestingly it seems to affect occurrences of `ASSERT_EQ(false, ...);` but not `ASSERT_EQ(true, ...);`. Closes https://github.com/facebook/rocksdb/pull/1958 Differential Revision: D4680742 Pulled By: ajkr fbshipit-source-id: 291fe41	2017-03-08 22:24:16 -08:00
Andrew Kryczka	5b11124e39	add max to histogram stats Summary: Domas enlightened me about p100 (i.e., max) stats. Let's add them to our histograms. Closes https://github.com/facebook/rocksdb/pull/1968 Differential Revision: D4678716 Pulled By: ajkr fbshipit-source-id: 65e7118	2017-03-08 22:24:15 -08:00
Andrew Kryczka	18fc1bc0e0	minor changes for rate limiter test flakiness Summary: the 50%+ drained constraint wasn't working consistently in some of our test environments, maybe their resources are too low. relax the constraints a bit. Closes https://github.com/facebook/rocksdb/pull/1970 Differential Revision: D4679419 Pulled By: ajkr fbshipit-source-id: 3789cd8	2017-03-08 17:54:11 -08:00
Aaron Gao	12ba00ea65	Reset DBIter::saved_key_ with proper user key anywhere before pass to DBIter::FindNextUserEntry Summary: fix db_iter bug introduced by [facebook#1413](https://github.com/facebook/rocksdb/pull/1413) Closes https://github.com/facebook/rocksdb/pull/1962 Differential Revision: D4672369 Pulled By: lightmark fbshipit-source-id: 6a22953	2017-03-08 17:24:11 -08:00
Sagar Vemuri	97edc72d39	Add a memtable-only iterator Summary: This PR is to support a way to iterate over all the keys that are just in memtables. Closes https://github.com/facebook/rocksdb/pull/1953 Differential Revision: D4663500 Pulled By: sagar0 fbshipit-source-id: 144e177	2017-03-07 11:54:10 -08:00
Leonidas Galanis	72202962f9	fix db_sst_test flakiness Summary: db_sst_test had been flaky occasionally in the following way: reached_max_space_on_compaction can in very rare cases be 0. This happens when the limit on maximum allowable space set using SetMaxAllowedSpaceUsage is hit during flush for all test db sizes (1,2,4,8 and 10MB).The fix clears the error returned when the the space limit is reached during flush. This ensures that the compaction call back will always be called. The runtime is increased slightly because the 1MB loop writes more data and hits the limit during multiple flushes until compaction is scheduled. Closes https://github.com/facebook/rocksdb/pull/1861 Differential Revision: D4557396 Pulled By: lgalanis fbshipit-source-id: ff778d1	2017-03-07 11:24:13 -08:00
Reid Horuff	58b12dfe37	Set logs as getting flushed before releasing lock, race condition fix Summary: Relating to #1903: In MaybeFlushColumnFamilies() we want to modify the 'getting_flushed' flag before releasing the db mutex when SwitchMemtable() is called. The following 2 actions need to be atomic in MaybeFlushColumnFamilies() - getting_flushed is false on oldest log - we determine that all CFs can be flushed to successfully release oldest log - we set getting_flushed = true on the oldest log. ------- - getting_flushed is false on oldest log - we determine that all CFs can NOT be flushed to successfully release oldest log - we set unable_to_flush_oldest_log_ = true on the oldest log. #### In the 2pc case: T1 enters function but is unable to flush all CFs to release log T1 sets unable_to_flush_oldest_log_ = true T1 begins flushing all CFs possible T2 enters function but is unable to flush all CFs to release log T2 sees unable_to_flush_oldes_log_ has been set so exits T3 enters function and will be able to flush all CFs to release oldest log T3 sets getting_flushed = true on oldes Closes https://github.com/facebook/rocksdb/pull/1909 Differential Revision: D4646235 Pulled By: reidHoruff fbshipit-source-id: c8d0447	2017-03-06 15:09:11 -08:00
Maysam Yabandeh	534581a356	Fix a bug in tests in options operator= Summary: Note: Using the default operator= is an unsafe approach for Options since it destructs shared_ptr in the same order of their creation, in contrast to destructors which destructs them in the opposite order of creation. One particular problme is that the cache destructor might invoke callback functions that use Option members such as statistics. To work around this problem, we manually call destructor of table_facotry which eventually clears the block cache. Closes https://github.com/facebook/rocksdb/pull/1950 Differential Revision: D4655473 Pulled By: maysamyabandeh fbshipit-source-id: 6c4bbff	2017-03-05 18:09:09 -08:00
Andrew Kryczka	4561275c2d	fix rate limiter test flakiness Summary: fix when elapsed time spans non-integral number of intervals since the rate limiter may still be drained during a partial interval. Closes https://github.com/facebook/rocksdb/pull/1948 Differential Revision: D4651304 Pulled By: ajkr fbshipit-source-id: b1f9e70	2017-03-03 11:09:11 -08:00
Andrew Kryczka	7c80a6d7d1	Statistic for how often rate limiter is drained Summary: This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case. The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor. Closes https://github.com/facebook/rocksdb/pull/1946 Differential Revision: D4646489 Pulled By: ajkr fbshipit-source-id: d8e0161	2017-03-02 17:54:15 -08:00
Aaron Gao	6fb9013441	sanitize readahead when direct read enabled Summary: no readahead: readseq : 8.438 micros/op 118510 ops/sec; 13.1 MB/s sanitize to 10MB: readseq : 6.051 micros/op 165248 ops/sec; 18.3 MB/s Closes https://github.com/facebook/rocksdb/pull/1945 Differential Revision: D4645811 Pulled By: lightmark fbshipit-source-id: 5d63770	2017-03-02 17:24:11 -08:00
Islam AbdelRahman	f89b3893c0	Remove skip_table_builder_flush and default it to true Summary: This option is needed to be enabled for Direct IO and I cannot think of a reason where we need to disable it remove it and default it to true Closes https://github.com/facebook/rocksdb/pull/1944 Differential Revision: D4641088 Pulled By: IslamAbdelRahman fbshipit-source-id: d7085b9	2017-03-02 16:54:10 -08:00
Siying Dong	d5b607a43f	Make db_wal_test slightly faster Summary: Avoid to run db_wal_test in all the DB test options, and some small changes. Closes https://github.com/facebook/rocksdb/pull/1921 Differential Revision: D4622054 Pulled By: siying fbshipit-source-id: 890fd64	2017-02-28 17:39:10 -08:00
Siying Dong	ba4c77bd6b	Divide external_sst_file_test Summary: Separate the platform dependent tests from external_sst_file_test. Only those tests need to run on platforms like OSX Closes https://github.com/facebook/rocksdb/pull/1923 Differential Revision: D4622461 Pulled By: siying fbshipit-source-id: d2d6f04	2017-02-28 14:24:11 -08:00
Aaron Gao	e877afa08b	Remove bulk loading and auto_roll_logger in rocksdb_lite Summary: shrink lite size Closes https://github.com/facebook/rocksdb/pull/1929 Differential Revision: D4622059 Pulled By: siying fbshipit-source-id: 050b796	2017-02-28 11:09:11 -08:00
Peter (Stig) Edwards	2ca2059f66	Get unique_ptr to use delete[] for char[] in DumpMallocStats Summary: Avoid mismatched free() / delete / delete [] in DumpMallocStats Closes https://github.com/facebook/rocksdb/pull/1927 Differential Revision: D4622045 Pulled By: siying fbshipit-source-id: 1131b30	2017-02-27 17:39:12 -08:00
Siying Dong	8ad0fcdf99	Separate small subset tests in DBTest Summary: Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux. Closes https://github.com/facebook/rocksdb/pull/1924 Differential Revision: D4616702 Pulled By: siying fbshipit-source-id: 13e6549	2017-02-27 12:24:11 -08:00
Siying Dong	3b8ba703cb	Fix flaky DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 Summary: A previous fix to DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 didn't address the right problem. The problem is L0->L0 compaction is not trivial move in the scenario, not parallel compactions. Fix this. Closes https://github.com/facebook/rocksdb/pull/1911 Differential Revision: D4608955 Pulled By: siying fbshipit-source-id: 7a712cb	2017-02-23 18:39:13 -08:00
Siying Dong	8efb5ffa2a	[rocksdb][PR] Remove option min_partial_merge_operands and verify_checksums_in_comp… Summary: …action The two options, min_partial_merge_operands and verify_checksums_in_compaction, are not seldom used. Remove them to reduce the total number of options. Also remove them from Java and C interface. Closes https://github.com/facebook/rocksdb/pull/1902 Differential Revision: D4601219 Pulled By: siying fbshipit-source-id: aad4cb2	2017-02-23 15:09:12 -08:00
Siying Dong	1ba2804b7f	Remove XFunc tests Summary: Xfunc is hardly used. Remove it to keep the code simple. Closes https://github.com/facebook/rocksdb/pull/1905 Differential Revision: D4603220 Pulled By: siying fbshipit-source-id: 731f96d	2017-02-23 12:09:11 -08:00
Aaron Gao	1ef5f50e84	detect logical sector size Summary: querying logical sector size from the device instead of hardcoding it for linux platform. Closes https://github.com/facebook/rocksdb/pull/1875 Differential Revision: D4591946 Pulled By: ajkr fbshipit-source-id: 4e9805c	2017-02-23 11:25:36 -08:00
Aaron Gao	0824934423	truncate patch Summary: omit the override for the previous commit Closes https://github.com/facebook/rocksdb/pull/1898 Differential Revision: D4598743 Pulled By: lightmark fbshipit-source-id: f98a378	2017-02-22 10:39:11 -08:00
Aaron Gao	286a36db7f	posix writablefile truncate Summary: we occasionally missing this call so the file size will be wrong Closes https://github.com/facebook/rocksdb/pull/1894 Differential Revision: D4598446 Pulled By: lightmark fbshipit-source-id: 42b6ef5	2017-02-22 10:09:14 -08:00
Mike Kolupaev	18eeb7b90e	Fix interference between max_total_wal_size and db_write_buffer_size checks Summary: This is a trivial fix for OOMs we've seen a few days ago in logdevice. RocksDB get into the following state: (1) Write throughput is too high for flushes to keep up. Compactions are out of the picture - automatic compactions are disabled, and for manual compactions we don't care that much if they fall behind. We write to many CFs, with only a few L0 sst files in each, so compactions are not needed most of the time. (2) total_log_size_ is consistently greater than GetMaxTotalWalSize(). It doesn't get smaller since flushes are falling ever further behind. (3) Total size of memtables is way above db_write_buffer_size and keeps growing. But the write_buffer_manager_->ShouldFlush() is not checked because (2) prevents it (for no good reason, afaict; this is what this commit fixes). (4) Every call to WriteImpl() hits the MaybeFlushColumnFamilies() path. This keeps flushing the memtables one by one in order of increasing log file number. (5) No write stalling trigger is hit. We rely on max_write_buffer_number Closes https://github.com/facebook/rocksdb/pull/1893 Differential Revision: D4593590 Pulled By: yiwu-arbug fbshipit-source-id: af79c5f	2017-02-21 16:09:10 -08:00
Aaron Gao	2a0f3d0de1	level compaction expansion Summary: reimplement the compaction expansion on lower level. Considering such a case: input level file: 1[B E] 2[F G] 3[H I] 4 [J M] output level file: 5[A C] 6[D K] 7[L O] If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level. The previous code is messy and wrong. In this diff, I first determine the input range [a, b], and output range [c, d], then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction. Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction. Closes https://github.com/facebook/rocksdb/pull/1760 Differential Revision: D4395564 Pulled By: lightmark fbshipit-source-id: 2dc2c5c	2017-02-21 10:24:17 -08:00
Yi Wu	381fd32247	Remove timeout_hint_us from WriteOptions Summary: The option has been deprecated for two years and has no effect. Removing. Closes https://github.com/facebook/rocksdb/pull/1866 Differential Revision: D4555203 Pulled By: yiwu-arbug fbshipit-source-id: c48f627	2017-02-17 15:24:17 -08:00
Islam AbdelRahman	fce7a6e196	Fail IngestExternalFile when bg_error_ exists Summary: Fail IngestExternalFile() when bg_error_ exists Closes https://github.com/facebook/rocksdb/pull/1881 Differential Revision: D4580621 Pulled By: IslamAbdelRahman fbshipit-source-id: 1194913	2017-02-17 13:39:17 -08:00
Shu Zhang	756c5924e6	Allow adding external v1 sst file with no global seqno support Summary: This is a follow up fix for https://github.com/facebook/rocksdb/pull/1783. After it, we should be able to ingest external v1 sst files with no global seqno field. Closes https://github.com/facebook/rocksdb/pull/1874 Differential Revision: D4576194 Pulled By: IslamAbdelRahman fbshipit-source-id: 5b34a3e	2017-02-16 17:09:12 -08:00
Aaron Gao	db2b4eb50e	avoid direct io in rocksdb_lite Summary: fix lite bugs disable direct io in lite mode Closes https://github.com/facebook/rocksdb/pull/1870 Differential Revision: D4559866 Pulled By: yiwu-arbug fbshipit-source-id: 3761c51	2017-02-16 10:39:13 -08:00
Andrew Kryczka	43e9f01c20	Fix repair_test on ROCKSDB_LITE Summary: RepairDB isn't included in rocksdb lite, so don't test it. Closes https://github.com/facebook/rocksdb/pull/1873 Differential Revision: D4565094 Pulled By: ajkr fbshipit-source-id: 8cc0898	2017-02-15 11:24:12 -08:00
Xiaofei Du	7106a994fe	Use monotonic time points in write_controller.cc and rate_limiter.cc Summary: NowMicros() provides non-monotonic time. When wall clock is synchronized or changed, the non-monotonicity time points will affect write rate controllers. This patch changes write_controller.cc and rate_limiter.cc to use monotonic time points. Closes https://github.com/facebook/rocksdb/pull/1865 Differential Revision: D4561732 Pulled By: siying fbshipit-source-id: 95ece62	2017-02-14 18:24:24 -08:00
Yi Wu	c2247dc1c7	Make DBImpl::has_unpersisted_data_ atomic Summary: Seems to me `has_unpersisted_data_` is read from read thread and write from write thread concurrently without synchronization. Making it an atomic. I update the logic not because seeing any problem with it, but it just feel confusing. Closes https://github.com/facebook/rocksdb/pull/1869 Differential Revision: D4555837 Pulled By: yiwu-arbug fbshipit-source-id: eff2ab8	2017-02-13 18:54:13 -08:00
Sagar Vemuri	eb912a927e	Remove disableDataSync option Summary: Remove disableDataSync, and another similarly named disable_data_sync options. This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods. Closes https://github.com/facebook/rocksdb/pull/1859 Differential Revision: D4541292 Pulled By: sagar0 fbshipit-source-id: 5b3a6ca	2017-02-13 11:09:13 -08:00
Dmitri Smirnov	a5adda0642	Fix repair issues Summary: Record the first parsed sequence number as the minimum so we can find the true minimum otherwise everything is larger than zero. Fix the comparator name comparision. Closes https://github.com/facebook/rocksdb/pull/1858 Differential Revision: D4544365 Pulled By: ajkr fbshipit-source-id: 439cbc2	2017-02-10 10:54:12 -08:00
Andrew Kryczka	b48e4778be	Consolidate file cutting logic in compaction loop Summary: It was really annoying to have two places (top and bottom of compaction loop) where we cut output files. I had bugs in both DeleteRange and dictionary compression due to updating only one of the two. This diff consolidates the file-cutting logic to the bottom of the compaction loop. Keep in mind that my goal with input_status is to be consistent with the past behavior, even though I'm not sure it's ideal. Closes https://github.com/facebook/rocksdb/pull/1832 Differential Revision: D4503038 Pulled By: ajkr fbshipit-source-id: 7da5213	2017-02-08 16:24:17 -08:00
Maysam Yabandeh	c4a37dcb44	Print the missed last layer in cfstats Summary: Printing compaction stats used to operate on two variable: number_levels_: for printing the layer num_levels_to_check: for updating the compaction score After this commit: `361010d447` these two are mixed up and as a result the last layer might not be printed out: https://fb.facebook.com/groups/rocksdb.internal/permalink/1315716625143616/ number_levels_ was used to decide which layers to print: `672300f47f/db/internal_stats.cc (L753)` but after the patch it is based on the return value of DumpCFMapStats `361010d447/db/internal_stats.cc (L929)` which returns num_levels_to_check: `361010d447/db/internal_stats.cc (L917)` Closes https://github.com/facebook/rocksdb/pull/1853 Differential Revision: D4529280 Pulled By: maysamyabandeh fbshipit-source-id: 3fd9448	2017-02-08 10:39:15 -08:00
Maysam Yabandeh	69d5262c81	Two-level Indexes Summary: Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition t15539501 Closes https://github.com/facebook/rocksdb/pull/1814 Differential Revision: D4473535 Pulled By: maysamyabandeh fbshipit-source-id: bffb87e	2017-02-06 16:39:12 -08:00
Dmitri Smirnov	0a4cdde50a	Windows thread Summary: introduce new methods into a public threadpool interface, - allow submission of std::functions as they allow greater flexibility. - add Joining methods to the implementation to join scheduled and submitted jobs with an option to cancel jobs that did not start executing. - Remove ugly `#ifdefs` between pthread and std implementation, make it uniform. - introduce pimpl for a drop in replacement of the implementation - Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread. - Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation. - should be no functionality changes. Closes https://github.com/facebook/rocksdb/pull/1823 Differential Revision: D4492902 Pulled By: siying fbshipit-source-id: c74cb11	2017-02-06 14:54:18 -08:00
Vitaliy Liptchinsky	1aaa898cf1	Adding GetApproximateMemTableStats method Summary: Added method that returns approx num of entries as well as size for memtables. Closes https://github.com/facebook/rocksdb/pull/1841 Differential Revision: D4511990 Pulled By: VitaliyLi fbshipit-source-id: 9a4576e	2017-02-06 14:54:16 -08:00
Siying Dong	036d668b19	Fix wrong result in data race case related to Get() Summary: In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version. Closes https://github.com/facebook/rocksdb/pull/1816 Differential Revision: D4475958 Pulled By: siying fbshipit-source-id: bd9e67a	2017-02-03 11:39:15 -08:00
Islam AbdelRahman	574b543f80	Rename merger.h -> merging_iterator.h Summary: merger.h was always a confusing name for me, simply give the file a better name Closes https://github.com/facebook/rocksdb/pull/1836 Differential Revision: D4505357 Pulled By: IslamAbdelRahman fbshipit-source-id: 07b28d8	2017-02-02 16:54:19 -08:00
oranagra	b96372dead	improving the C wrapper Summary: - rocksdb_property_int (so that we don't have to parse strings) - and rocksdb_set_options (to allow controlling options via strings) - a few other missing options exposed - a documentation comment fix Closes https://github.com/facebook/rocksdb/pull/1793 Differential Revision: D4456569 Pulled By: yiwu-arbug fbshipit-source-id: 9f1fac1	2017-01-27 17:39:16 -08:00
Siying Dong	04c4ec41d1	Change corruption_test to use 4 bits. Summary: In the patch which LRU cache was made use dynamic shard bits, I changed to 2 shard bits to make the test happy. Look like it is occasionally still unhappy. Change it to 4 shard bits. Closes https://github.com/facebook/rocksdb/pull/1815 Differential Revision: D4475849 Pulled By: siying fbshipit-source-id: 575ff00	2017-01-27 11:24:16 -08:00
Siying Dong	2d75cd40d3	NewLRUCache() to pick number of shard bits based on capacity if not given Summary: If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity. Closes https://github.com/facebook/rocksdb/pull/1584 Differential Revision: D4242517 Pulled By: siying fbshipit-source-id: 86b0f18	2017-01-27 06:39:12 -08:00
Siying Dong	f25f1ec60b	Add test DBTest2.GetRaceFlush which can expose a data race bug Summary: A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now. Closes https://github.com/facebook/rocksdb/pull/1813 Differential Revision: D4472310 Pulled By: siying fbshipit-source-id: 5755ebd	2017-01-26 16:39:14 -08:00
sdong	5dad9d6d28	Avoid logs_ operation out of DB mutex Summary: logs_.back() is called out of DB mutex, which can cause data race. We move the access into the DB mutex protection area. Closes https://github.com/facebook/rocksdb/pull/1774 Reviewed By: AsyncDBConnMarkedDownDBException Differential Revision: D4417472 Pulled By: AsyncDBConnMarkedDownDBException fbshipit-source-id: 2da1f1e	2017-01-25 15:54:13 -08:00
Islam AbdelRahman	a7b13919bf	Fix CompactFiles() bug when used with CompactionFilter using SuperVersion Summary: GetAndRefSuperVersion() should not be called again in the same thread before ReturnAndCleanupSuperVersion() is called. If we have a compaction filter that is using DB::Get, This will happen ``` CompactFiles() { GetAndRefSuperVersion() // -- first call .. CompactionFilter() { GetAndRefSuperVersion() // -- second call ReturnAndCleanupSuperVersion() } .. ReturnAndCleanupSuperVersion() } ``` We solve this issue in the same way Iterator is solving it, but using GetReferencedSuperVersion() This was discovered in https://github.com/facebook/mysql-5.6/issues/427 by alxyang Closes https://github.com/facebook/rocksdb/pull/1803 Differential Revision: D4460155 Pulled By: IslamAbdelRahman fbshipit-source-id: 5e54322	2017-01-25 14:09:13 -08:00
Andrew Kryczka	616a1464ea	Fix DeleteRange including sentinels in output files Summary: when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels. blame: #1614 Closes https://github.com/facebook/rocksdb/pull/1804 Differential Revision: D4460426 Pulled By: ajkr fbshipit-source-id: 69444b5	2017-01-25 11:09:12 -08:00
Islam AbdelRahman	03ca2ac8a9	Remove function from DBImpl that are not used anywhere Summary: GetAndRefSuperVersionUnlocked ReturnAndCleanupSuperVersionUnlocked GetColumnFamilyHandleUnlocked Are dead code that are not used any where Closes https://github.com/facebook/rocksdb/pull/1802 Differential Revision: D4459948 Pulled By: IslamAbdelRahman fbshipit-source-id: 30fa89d	2017-01-24 19:24:13 -08:00
Andrew Kryczka	b0029bc7fa	Test merge op covered by range deletion in memtable Summary: It's a test case for #1797. Also got rid of kTypeDeletion in the conditional since we treat it the same as kTypeRangeDeletion. Closes https://github.com/facebook/rocksdb/pull/1800 Differential Revision: D4451300 Pulled By: ajkr fbshipit-source-id: b39dda1	2017-01-24 13:39:11 -08:00
Andrew Kryczka	d438e1ec17	Test range deletion block outlives table reader Summary: This test ensures RangeDelAggregator can still access blocks even if it outlives the table readers that created them (detailed description in comments). I plan to optimize away the extra cache lookup we currently do in BlockBasedTable::NewRangeTombstoneIterator(), as it is ~5% CPU in my random read benchmark in a database with 1k tombstones. This test will help make sure nothing breaks in the process. Closes https://github.com/facebook/rocksdb/pull/1739 Differential Revision: D4375954 Pulled By: ajkr fbshipit-source-id: aef9357	2017-01-24 13:24:14 -08:00
Andrew Kryczka	9da4d542fe	Range deletions unsupported in tailing iterator Summary: change the iterator status to NotSupported as soon as a range tombstone is encountered by a ForwardIterator. Closes https://github.com/facebook/rocksdb/pull/1593 Differential Revision: D4246294 Pulled By: ajkr fbshipit-source-id: aef9f49	2017-01-23 13:39:12 -08:00
Hyeonseok Oh	f2b4939da4	fixed typo Summary: I fixed exisit -> exist Closes https://github.com/facebook/rocksdb/pull/1799 Differential Revision: D4451466 Pulled By: yiwu-arbug fbshipit-source-id: b447c3a	2017-01-23 12:54:13 -08:00
yinqiwen	973f1b78fd	memtable: delete merge value for range deleteion Summary: Closes https://github.com/facebook/rocksdb/pull/1797 Differential Revision: D4448004 Pulled By: ajkr fbshipit-source-id: 3ffc27c	2017-01-23 12:24:14 -08:00
Vitaliy Liptchinsky	753ff84a3d	Fix get approx size Summary: Fixing GetApproximateSize bug for the case of computing stats for mem tables only. Closes https://github.com/facebook/rocksdb/pull/1795 Differential Revision: D4445507 Pulled By: IslamAbdelRahman fbshipit-source-id: 3905846	2017-01-20 15:54:12 -08:00
Jay Lee	537da370da	c: allow set savepoint to writebatch Summary: Allow set SavePoint to WriteBatch in C ABI. Closes https://github.com/facebook/rocksdb/pull/1698 Differential Revision: D4378556 Pulled By: yiwu-arbug fbshipit-source-id: afca746	2017-01-20 13:24:13 -08:00
Changli Gao	5ac97314e7	Fix std::out_of_range when DBOptions::keep_log_file_num is zero Summary: We should validate this option, otherwise we may see std::out_of_range thrown at: db/db_impl.cc:1124 1123 for (unsigned int i = 0; i <= end; i++) { 1124 std::string& to_delete = old_info_log_files.at(i); 1125 std::string full_path_to_delete = 1126 (immutable_db_options_.db_log_dir.empty() Closes https://github.com/facebook/rocksdb/pull/1722 Differential Revision: D4379495 Pulled By: yiwu-arbug fbshipit-source-id: e136552	2017-01-20 13:24:12 -08:00
Shu Zhang	3c0852d1da	Make ingest external file backward compatible Summary: Closes https://github.com/facebook/rocksdb/pull/1783 Differential Revision: D4443463 Pulled By: IslamAbdelRahman fbshipit-source-id: 39d21d6	2017-01-20 12:09:19 -08:00
Siying Dong	0e8dfd6062	Fix OptimizeForPointLookup() Summary: If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead. Closes https://github.com/facebook/rocksdb/pull/1791 Differential Revision: D4442836 Pulled By: siying fbshipit-source-id: bf6c9cd	2017-01-20 10:54:12 -08:00
Vitaliy Liptchinsky	e840213d6e	Change DB::GetApproximateSizes for more flexibility needed for MyRocks Summary: Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables. Closes https://github.com/facebook/rocksdb/pull/1787 Differential Revision: D4441111 Pulled By: IslamAbdelRahman fbshipit-source-id: c11f4c3	2017-01-20 09:39:11 -08:00
Yi Wu	9239103cd4	Flush job should release reference current version if sync log failed Summary: Fix the bug when sync log fail, FlushJob::Run() will not be execute and reference to cfd->current() will not be release. Closes https://github.com/facebook/rocksdb/pull/1792 Differential Revision: D4441316 Pulled By: yiwu-arbug fbshipit-source-id: 5523e28	2017-01-19 23:09:15 -08:00
Islam AbdelRahman	da54d36a96	Disable IngestExternalFile in ReadOnly mode Summary: Disable IngestExternalFile() in read only mode Closes https://github.com/facebook/rocksdb/pull/1781 Differential Revision: D4439179 Pulled By: IslamAbdelRahman fbshipit-source-id: b7e46e7	2017-01-19 15:54:19 -08:00
Reid Horuff	5cf176ca15	Fix for 2PC causing WAL to grow too large Summary: Consider the following single column family scenario: prepare in log A commit in log B WAL is too large, flush all CFs to releast log A CFA is on log B so we do not see CFA is depending on log A so no flush is requested To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on. Closes https://github.com/facebook/rocksdb/pull/1768 Differential Revision: D4403265 Pulled By: reidHoruff fbshipit-source-id: ce800ff	2017-01-19 15:39:12 -08:00
Andrew Kryczka	f9d18e22d2	Fix DeleteRange file boundary correctness issue with max_compaction_bytes Summary: Cockroachdb exposed this bug in #1778. The bug happens when a compaction's output files are ended due to exceeding max_compaction_bytes. In that case we weren't taking into account the next file's start key when deciding how far to extend the current file's max_key. This caused the non-overlapping key-range invariant to be violated. Note this was correctly handled for the usual case of cutting compaction output, which is file size exceeding max_output_file_size. I am not sure why these are two separate code paths, but we can consider refactoring it to prevent such errors in the future. Closes https://github.com/facebook/rocksdb/pull/1784 Differential Revision: D4430235 Pulled By: ajkr fbshipit-source-id: 80af748	2017-01-18 11:54:22 -08:00
Islam AbdelRahman	3ce091fd73	Add KEEP_DB env var option Summary: When debugging tests, it's useful to preserve the DB to investigate it and check the logs This will allow us to set KEEP_DB=1 to preserve the DB Closes https://github.com/facebook/rocksdb/pull/1759 Differential Revision: D4393826 Pulled By: IslamAbdelRahman fbshipit-source-id: 1bff689	2017-01-17 13:54:20 -08:00
Siying Dong	77b4806625	Fix 2PC with concurrent memtable insert Summary: If concurrent memtable insert is enabled, and one prepare command and a normal command are grouped into a commit group, the sequence ID will be calculated incorrectly. Closes https://github.com/facebook/rocksdb/pull/1730 Differential Revision: D4371081 Pulled By: siying fbshipit-source-id: cd40c6d	2017-01-17 11:24:28 -08:00
Mike Kolupaev	d18dd2c41f	Abort compactions more reliably when closing DB Summary: DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper). There's also some minor code cleanup along the way. Closes https://github.com/facebook/rocksdb/pull/1639 Differential Revision: D4306571 Pulled By: yiwu-arbug fbshipit-source-id: f050890	2017-01-11 15:09:21 -08:00
Changli Gao	9f246298e2	Performance: Iterate vector by reference Summary: Closes https://github.com/facebook/rocksdb/pull/1763 Differential Revision: D4398796 Pulled By: yiwu-arbug fbshipit-source-id: b82636d	2017-01-11 10:54:37 -08:00
Dmitri Smirnov	3c233ca4ea	Fix Windows environment issues Summary: Enable directIO on WritableFileImpl::Append with offset being current length of the file. Enable UniqueID tests on Windows, disable others but leeting them to compile. Unique tests are valuable to detect failures on different filesystems and upcoming ReFS. Clear output in WinEnv Getchildren.This is different from previous strategy, do not touch output on failure. Make sure DBTest.OpenWhenOpen works with windows error message Closes https://github.com/facebook/rocksdb/pull/1746 Differential Revision: D4385681 Pulled By: IslamAbdelRahman fbshipit-source-id: c07b702	2017-01-09 15:54:12 -08:00
Maysam Yabandeh	d0ba8ec8f9	Revert "PinnableSlice" Summary: This reverts commit `54d94e9c2c`. The pull request was landed by mistake. Closes https://github.com/facebook/rocksdb/pull/1755 Differential Revision: D4391678 Pulled By: maysamyabandeh fbshipit-source-id: 36d5149	2017-01-08 14:24:12 -08:00
Maysam Yabandeh	54d94e9c2c	PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: 1. value 100 byte: 1.8% regular, 1.2% merge values 2. value 1k byte: 11.5% regular, 7.5% merge values 3. value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled Closes https://github.com/facebook/rocksdb/pull/1732 Differential Revision: D4374613 Pulled By: maysamyabandeh fbshipit-source-id: a077f1a	2017-01-08 13:54:13 -08:00
Andrew Kryczka	b104b87814	Maintain position in range deletions map Summary: When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key. - When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key. - Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map. - When the iterator seeks, we use kBinarySearch for repositioning - After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used. - Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete(). Closes https://github.com/facebook/rocksdb/pull/1701 Differential Revision: D4350318 Pulled By: ajkr fbshipit-source-id: 5129b76	2017-01-05 10:39:12 -08:00
siddontang	653ac1f9c6	C API: support total_order_mode Summary: Closes https://github.com/facebook/rocksdb/pull/1687 Differential Revision: D4349210 Pulled By: IslamAbdelRahman fbshipit-source-id: 32d0fbd	2017-01-03 18:39:14 -08:00
Adam Retter	85ac1a320a	Fix rocksdb::Status::getState Summary: This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState. Closes https://github.com/facebook/rocksdb/issues/1688 Closes https://github.com/facebook/rocksdb/pull/1714 Differential Revision: D4364181 Pulled By: yiwu-arbug fbshipit-source-id: 8e073b4	2017-01-03 18:39:14 -08:00
Islam AbdelRahman	76711b6e77	Make ExternalSSTFileTest::CompactionDeadlock more deterministic Summary: It's not always true that `ASSERT_EQ(running_threads.load(), 2);` Closes https://github.com/facebook/rocksdb/pull/1736 Differential Revision: D4374091 Pulled By: IslamAbdelRahman fbshipit-source-id: 4f70bbd	2017-01-03 18:09:20 -08:00
Islam AbdelRahman	c963460dbc	Fix tests under GCC_481 Summary: This fix the issue with tests failing under GCC 481, I am not sure what is the exact reason Closes https://github.com/facebook/rocksdb/pull/1735 Differential Revision: D4374094 Pulled By: IslamAbdelRahman fbshipit-source-id: b3625bc	2017-01-03 17:54:12 -08:00
Vincent Lee	e425ec1162	utilities/backupable: backup should limit the copy size of wal. Summary: Since the backup work as snapshot, we should only copy the bytes of the wal while we get the alive files. Closes https://github.com/facebook/rocksdb/pull/1733 Differential Revision: D4373457 Pulled By: ajkr fbshipit-source-id: 389318f	2016-12-31 10:54:20 -08:00
Maysam Yabandeh	0712d541d1	Delegate Cleanables Summary: Cleanable objects will perform the registered cleanups when they are destructed. We however rather to delay this cleaning like when we are gathering the merge operands. Current approach is to create the Cleanable object on heap (instead of on stack) and delay deleting it. By allowing Cleanables to delegate their cleanups to another cleanable object we can delay the cleaning without however the need to craete the cleanable object on heap and keeping it around. This patch applies this technique for the cleanups of BlockIter and shows improved performance for some in-memory benchmarks: +1.8% for merge worklaod, +6.4% for non-merge workload when the merge operator is specified. https://our.intern.facebook.com/intern/tasks?t=15168163 Non-merge benchmark: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none Reading random with no merge operator specified: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="read Closes https://github.com/facebook/rocksdb/pull/1711 Differential Revision: D4361163 Pulled By: maysamyabandeh fbshipit-source-id: 9801e07	2016-12-29 15:54:19 -08:00
Islam AbdelRahman	d58ef52ba6	Allow SstFileWriter to Fadvise the file away from page cache Summary: Add `fadvise_trigger` option to `SstFileWriter` If fadvise_trigger is passed with a non-zero value, SstFileWriter will invalidate the os page cache every `fadvise_trigger` bytes for the sst file Closes https://github.com/facebook/rocksdb/pull/1731 Differential Revision: D4371246 Pulled By: IslamAbdelRahman fbshipit-source-id: 91caff1	2016-12-29 15:09:19 -08:00
Siying Dong	17a4b75cc3	Always fsync the file after file copying Summary: File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large. Closes https://github.com/facebook/rocksdb/pull/1728 Differential Revision: D4371083 Pulled By: siying fbshipit-source-id: 579e14c	2016-12-28 19:09:16 -08:00
leipeng	a738af8f84	db/pinned_iterators_manager.h: bugfix Summary: std::unique(beg, end) returns an iterator of unique_end, data behind unique_end should not be accessed. Closes https://github.com/facebook/rocksdb/pull/1726 Differential Revision: D4371076 Pulled By: IslamAbdelRahman fbshipit-source-id: 5564450	2016-12-28 18:54:57 -08:00
Siying Dong	438f22bc56	Fix bug of Checkpoint loses recent transactions with 2PC Summary: If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files. Closes https://github.com/facebook/rocksdb/pull/1724 Differential Revision: D4368319 Pulled By: siying fbshipit-source-id: cc2c746	2016-12-28 12:24:16 -08:00
Aaron Gao	972f96b3fb	direct io write support Summary: rocksdb direct io support ``` [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 5.0 Date: Wed Nov 23 13:17:43 2016 CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz CPUCache: 25600 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 4.393 micros/op 227639 ops/sec; 25.2 MB/s [gzh@dev11575.prn2 ~/roc Closes https://github.com/facebook/rocksdb/pull/1564 Differential Revision: D4241093 Pulled By: lightmark fbshipit-source-id: 98c29e3	2016-12-22 13:09:19 -08:00
Islam AbdelRahman	989e644ed8	Remove sst_file_manager option from LITE Summary: Remove sst_file_manager option from LITE Closes https://github.com/facebook/rocksdb/pull/1690 Differential Revision: D4341331 Pulled By: IslamAbdelRahman fbshipit-source-id: 9f9328d	2016-12-21 17:54:21 -08:00
Islam AbdelRahman	1beef6569a	Fix c_test Summary: addfile phase in c_test could fail because in previous steps we did a DeleteRange. Fix the test by simply moving the addfile phase before DeleteRange Closes https://github.com/facebook/rocksdb/pull/1672 Differential Revision: D4328896 Pulled By: IslamAbdelRahman fbshipit-source-id: 1d946df	2016-12-21 17:39:14 -08:00
Andrew Kryczka	50e305de98	Collapse range deletions Summary: Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach. Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map. Closes https://github.com/facebook/rocksdb/pull/1614 Differential Revision: D4270397 Pulled By: ajkr fbshipit-source-id: 93092c7	2016-12-19 16:54:12 -08:00
Yi Wu	5d1457dbbf	Dump persistent cache options Summary: Dump persistent cache options Closes https://github.com/facebook/rocksdb/pull/1679 Differential Revision: D4337019 Pulled By: yiwu-arbug fbshipit-source-id: 3812f8a	2016-12-19 14:09:12 -08:00
Daniel Black	342370f1d3	Simplify MemTable::Update Summary: As suggested by testn in #1650 The Add is at the end of the function. Having a fallthough will result in it being added twice. Closes https://github.com/facebook/rocksdb/pull/1676 Differential Revision: D4331906 Pulled By: yiwu-arbug fbshipit-source-id: 895c4a0	2016-12-17 00:09:13 -08:00
Ding Ma	1a136c1f13	Expose file size Summary: add a new function to SstFileWriter that will tell the user how big is there file right now. Closes https://github.com/facebook/rocksdb/pull/1686 Differential Revision: D4338868 Pulled By: mdyuki1016 fbshipit-source-id: c1ee16a	2016-12-16 18:39:12 -08:00
Andrew Kryczka	fbff4628a9	Reduce compaction iterator status checks Summary: seems it's expensive to check status since the underlying merge iterator checks status of all its children. so only do it when it's really necessary to get the status before invoking Next(), i.e., when we're advancing to get the first key in the next file. Closes https://github.com/facebook/rocksdb/pull/1691 Differential Revision: D4343446 Pulled By: siying fbshipit-source-id: 70ab315	2016-12-16 17:39:09 -08:00
Daniel Black	816c1e30ca	gcc-7 requires include <functional> for std::function Summary: Fixes compile error: In file included from ./util/statistics.h:17:0, from ./util/stop_watch.h:8, from ./util/perf_step_timer.h:9, from ./util/iostats_context_imp.h:8, from ./util/posix_logger.h:27, from ./port/util_logger.h:18, from ./db/auto_roll_logger.h:15, from db/auto_roll_logger.cc:6: ./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type typedef std::function<void(void, void)> FoldFunc; Closes https://github.com/facebook/rocksdb/pull/1656 Differential Revision: D4318702 Pulled By: yiwu-arbug fbshipit-source-id: 8c5d17a	2016-12-16 11:24:18 -08:00
Yi Wu	c270735861	Iterator should be in corrupted status if merge operator return false Summary: Iterator should be in corrupted status if merge operator return false. Also add test to make sure if max_successive_merges is hit during write, data will not be lost. Closes https://github.com/facebook/rocksdb/pull/1665 Differential Revision: D4322695 Pulled By: yiwu-arbug fbshipit-source-id: b327b05	2016-12-16 11:09:16 -08:00
siddontang	8f5d24ae68	C API: support get usage and pinned_usage for cache Summary: Closes https://github.com/facebook/rocksdb/pull/1671 Differential Revision: D4327453 Pulled By: yiwu-arbug fbshipit-source-id: bcdbc65	2016-12-15 17:24:17 -08:00
Daniel Black	cfc34d7c4e	Missing break in case in DBTestBase::CurrentOptions Summary: Found by gcc-7 compile error. This appeared to be a fault as these options seems too different. Closes https://github.com/facebook/rocksdb/pull/1667 Differential Revision: D4324174 Pulled By: yiwu-arbug fbshipit-source-id: 0f65383	2016-12-13 18:39:14 -08:00
Daniel Black	bfbcec2339	Gcc 7 error expansion to defined Summary: sorry if these gcc-7/clang-4 cleanups are getting tedious. Closes https://github.com/facebook/rocksdb/pull/1658 Differential Revision: D4318792 Pulled By: yiwu-arbug fbshipit-source-id: 8e85891	2016-12-13 18:39:14 -08:00
Daniel Black	67adc937b6	intentional fallthough (prevents gcc-7/clang-4 error) Summary: db/memtable.cc: In member function 'void rocksdb::MemTable::Update(rocksdb::SequenceNumber, const rocksdb::Slice&, const rocksdb::Slice&)': db/memtable.cc:736:11: error: this statement may fall through [-Werror=implicit-fallthrough=] } ^ db/memtable.cc:738:9: note: here default: ^~~~~~~ cc1plus: all warnings being treated as errors closes #1650 Closes https://github.com/facebook/rocksdb/pull/1655 Differential Revision: D4318696 Pulled By: yiwu-arbug fbshipit-source-id: 1a8981c	2016-12-13 14:39:17 -08:00
Islam AbdelRahman	1a146f89c7	break Flush wait for dropped CF Summary: In FlushJob we dont do the Flush if the CF is dropped https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188 but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped. Closes https://github.com/facebook/rocksdb/pull/1664 Differential Revision: D4321032 Pulled By: IslamAbdelRahman fbshipit-source-id: 6e2b25d	2016-12-13 14:09:12 -08:00
Yi Wu	36d42e65d0	Disable test to unblock travis build Summary: The two tests keep failing in travis. Disable them and will fix later. Closes https://github.com/facebook/rocksdb/pull/1648 Differential Revision: D4316389 Pulled By: yiwu-arbug fbshipit-source-id: 0a370e7	2016-12-13 11:54:14 -08:00
siddontang	b57dd9262a	C API: support writebatch delete range Summary: Seem that writebatch delete range can work now, so I add C API for later use. Btw, can we use this feature in production now? Closes https://github.com/facebook/rocksdb/pull/1647 Differential Revision: D4314534 Pulled By: ajkr fbshipit-source-id: e835165	2016-12-13 11:24:18 -08:00
Islam AbdelRahman	2ba59b5a1e	Disallow ingesting files into dropped CFs Summary: This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF. Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF Closes https://github.com/facebook/rocksdb/pull/1657 Differential Revision: D4318657 Pulled By: IslamAbdelRahman fbshipit-source-id: ed6ea2b	2016-12-13 00:54:14 -08:00
Jonathan Lee	2cabdb8f44	Increase buffer size Summary: When compiling with GCC>=7.0.0, "db/internal_stats.cc" fails to compile as the data being written to the buffer potentially exceeds its size. This fix simply doubles the size of the buffer, thus accommodating the max possible data size. Closes https://github.com/facebook/rocksdb/pull/1635 Differential Revision: D4302162 Pulled By: yiwu-arbug fbshipit-source-id: c76ad59	2016-12-09 11:54:22 -08:00
Jonathan Lee	4a17b47bb5	Remove unnecessary header include Summary: Remove "util/testharness.h" from list of includes for "db/db_filesnapshot.cc", as it wasn't being used and thus caused an extraneous dependency on gtest. Closes https://github.com/facebook/rocksdb/pull/1634 Differential Revision: D4302146 Pulled By: yiwu-arbug fbshipit-source-id: e900c0b	2016-12-09 11:54:21 -08:00
Mike Kolupaev	8c2b921fdf	Fixed a crash in debug build in flush_job.cc Summary: It was doing `&range_del_iters[0]` on an empty vector. Even though the resulting pointer is never dereferenced, it's still bad for two reasons: * the practical reason: it crashes with `std::out_of_range` exception in our debug build, * the "C++ standard lawyer" reason: it's undefined behavior because, in `std::vector` implementation, it probably "dereferences" a null pointer, which is invalid even though it doesn't actually read the pointed memory, just converts a pointer into a reference (and then flush_job.cc converts it back to pointer); nullptr references are undefined behavior. Closes https://github.com/facebook/rocksdb/pull/1612 Differential Revision: D4265625 Pulled By: al13n321 fbshipit-source-id: db26fb9	2016-12-09 10:39:12 -08:00
Islam AbdelRahman	20ce081fae	Fix issue where IngestExternalFile insert blocks in block cache with g_seqno=0 Summary: When we Ingest an external file we open it to read some metadata and first/last key during doing that we insert blocks into the block cache with global_seqno = 0 If we move the file (did not copy it) into the DB, we will use these blocks with the wrong seqno in the read path Closes https://github.com/facebook/rocksdb/pull/1627 Differential Revision: D4293332 Pulled By: yiwu-arbug fbshipit-source-id: 3ce5523	2016-12-08 13:39:18 -08:00
zhangjinpeng1987	45c7ce1377	CompactRangeOptions C API Summary: Add C API for CompactRangeOptions. Closes https://github.com/facebook/rocksdb/pull/1596 Differential Revision: D4252339 Pulled By: yiwu-arbug fbshipit-source-id: f768f93	2016-12-07 17:54:14 -08:00
Andrew Kryczka	b821984d31	DeleteRange read path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1592 Differential Revision: D4246260 Pulled By: ajkr fbshipit-source-id: ce03fa2	2016-12-07 12:54:17 -08:00
Artemiy Kolesnikov	2f4fc539c6	Compaction::IsTrivialMove relaxing Summary: IsTrivialMove returns true if no input file overlaps with output_level+1 with more than max_compaction_bytes_ bytes. Closes https://github.com/facebook/rocksdb/pull/1619 Differential Revision: D4278338 Pulled By: yiwu-arbug fbshipit-source-id: 994c001	2016-12-07 11:54:11 -08:00
Islam AbdelRahman	ed8fbdb560	Add EventListener::OnExternalFileIngested() event Summary: Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events Closes https://github.com/facebook/rocksdb/pull/1623 Differential Revision: D4285844 Pulled By: IslamAbdelRahman fbshipit-source-id: 0b95a88	2016-12-06 14:09:17 -08:00
Mike Kolupaev	beb36d9c1e	Fixed CompactionFilter::Decision::kRemoveAndSkipUntil Summary: Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure: db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't. Closes https://github.com/facebook/rocksdb/pull/1611 Differential Revision: D4265622 Pulled By: al13n321 fbshipit-source-id: 747b10f	2016-12-05 15:24:11 -08:00
Islam AbdelRahman	67f37cf198	Allow user to specify a CF for SST files generated by SstFileWriter Summary: Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF. This allow us to persist the CF id in the generated file Closes https://github.com/facebook/rocksdb/pull/1615 Differential Revision: D4270422 Pulled By: IslamAbdelRahman fbshipit-source-id: 7fb954e	2016-12-05 14:24:16 -08:00
Anton Safonov	9053fe2a5c	Made delete_obsolete_files_period_micros option dynamic Summary: Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions(). Closes https://github.com/facebook/rocksdb/pull/1595 Differential Revision: D4246569 Pulled By: tonek fbshipit-source-id: d23f560	2016-12-05 14:24:16 -08:00
Islam AbdelRahman	edde954e7b	fix clang build Summary: override is missing for FilterV2 Closes https://github.com/facebook/rocksdb/pull/1606 Differential Revision: D4263832 Pulled By: IslamAbdelRahman fbshipit-source-id: d8b337a	2016-12-01 18:39:10 -08:00
Islam AbdelRahman	e39d080871	Fix travis (compile for clang < 3.9) Summary: Travis fail because it uses clang 3.6 which don't recognize `__attribute__((__no_sanitize__("undefined")))` Closes https://github.com/facebook/rocksdb/pull/1601 Differential Revision: D4257175 Pulled By: IslamAbdelRahman fbshipit-source-id: fb4d1ab	2016-12-01 10:09:22 -08:00
fangchenliaohui	b77007df8b	Bug: paralle_group status updated in WriteThread::CompleteParallelWorker Summary: Multi-write thread may update the status of the parallel_group in WriteThread::CompleteParallelWorker if the status of Writer is not ok! When copy write status to the paralle_group, the write thread just hold the mutex of the the writer processed by itself. it is useless. The thread should held the the leader of the parallel_group instead. Closes https://github.com/facebook/rocksdb/pull/1598 Differential Revision: D4252335 Pulled By: siying fbshipit-source-id: 3864cf7	2016-12-01 09:54:11 -08:00
Mike Kolupaev	247d0979aa	Support for range skips in compaction filter Summary: This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API. Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :) The diff is not as big as it may seem, about half of the lines are a test. Closes https://github.com/facebook/rocksdb/pull/1599 Differential Revision: D4252092 Pulled By: al13n321 fbshipit-source-id: 41e1e48	2016-12-01 07:09:15 -08:00
Panagiotis Ktistakis	96fcefbf1d	c api: expose option for dynamic level size target Summary: Closes https://github.com/facebook/rocksdb/pull/1587 Differential Revision: D4245923 Pulled By: yiwu-arbug fbshipit-source-id: 6ee7291	2016-11-30 11:24:14 -08:00
zhangjinpeng1987	00197cff39	Add C API to set base_backgroud_compactions Summary: Add C API to set base_backgroud_compactions Closes https://github.com/facebook/rocksdb/pull/1571 Differential Revision: D4245709 Pulled By: yiwu-arbug fbshipit-source-id: 792c6b8	2016-11-30 11:09:13 -08:00
Andrew Kryczka	5b219eccb5	deleterange end-to-end test improvements for lite/robustness Summary: Closes https://github.com/facebook/rocksdb/pull/1591 Differential Revision: D4246019 Pulled By: ajkr fbshipit-source-id: 0c4aa37	2016-11-29 12:24:13 -08:00
Andrew Kryczka	e333528991	DeleteRange write path end-to-end tests Summary: Closes https://github.com/facebook/rocksdb/pull/1578 Differential Revision: D4241171 Pulled By: ajkr fbshipit-source-id: ce5fd83	2016-11-29 11:09:22 -08:00
Siying Dong	7784980fcd	Fix mis-reporting of compaction read bytes to the base level Summary: In dynamic leveled compaction, when calculating read bytes, output level bytes may be wronglyl calculated as input level inputs. Fix it. Closes https://github.com/facebook/rocksdb/pull/1475 Differential Revision: D4148412 Pulled By: siying fbshipit-source-id: f2f475a	2016-11-29 11:09:22 -08:00
Islam AbdelRahman	3c6b49ed66	Fix implicit conversion between int64_t to int Summary: Make conversion explicit, implicit conversion breaks the build Closes https://github.com/facebook/rocksdb/pull/1589 Differential Revision: D4245158 Pulled By: IslamAbdelRahman fbshipit-source-id: aaec00d	2016-11-29 10:54:15 -08:00
Siying Dong	b3b875657f	Remove unused assignment in db/db_iter.cc Summary: "make analyze" complains the assignment is not useful. Remove it. Closes https://github.com/facebook/rocksdb/pull/1581 Differential Revision: D4241697 Pulled By: siying fbshipit-source-id: 178f67a	2016-11-29 09:09:14 -08:00
Andrew Kryczka	4f6e89b1d0	Fix range deletion covering key in same SST file Summary: AddTombstones() needs to be before t->Get(), oops :'( Closes https://github.com/facebook/rocksdb/pull/1576 Differential Revision: D4241041 Pulled By: ajkr fbshipit-source-id: 781ceea	2016-11-28 22:54:13 -08:00
Islam AbdelRahman	a2bf265a39	Avoid intentional overflow in GetL0ThresholdSpeedupCompaction Summary: `99c052a34f` fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve. UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t Closes https://github.com/facebook/rocksdb/pull/1582 Differential Revision: D4241525 Pulled By: IslamAbdelRahman fbshipit-source-id: b3ae21f	2016-11-28 18:39:13 -08:00
Islam AbdelRahman	52fd1ff2c2	disable UBSAN for functions with intentional -ve shift / overflow Summary: disable UBSAN for functions with intentional left shift on -ve number / overflow These functions are rocksdb:: Hash FixedLengthColBufEncoder::Append FaultInjectionTest:: Key Closes https://github.com/facebook/rocksdb/pull/1577 Differential Revision: D4240801 Pulled By: IslamAbdelRahman fbshipit-source-id: 3e1caf6	2016-11-28 17:54:12 -08:00
Islam AbdelRahman	1886c435b9	Fix CompactionJob::Install division by zero Summary: Fix CompactionJob::Install division by zero Closes https://github.com/facebook/rocksdb/pull/1580 Differential Revision: D4240794 Pulled By: IslamAbdelRahman fbshipit-source-id: 7286721	2016-11-28 16:54:16 -08:00
Islam AbdelRahman	13e66a8f51	Fix compaction_job.cc division by zero Summary: Fix division by zero in compaction_job.cc Closes https://github.com/facebook/rocksdb/pull/1575 Differential Revision: D4240818 Pulled By: IslamAbdelRahman fbshipit-source-id: a8bc757	2016-11-28 16:39:13 -08:00
Andrew Kryczka	01eabf7375	Fix double-counted deletion stat Summary: Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted. Closes https://github.com/facebook/rocksdb/pull/1574 Differential Revision: D4241181 Pulled By: ajkr fbshipit-source-id: c9aaa15	2016-11-28 15:54:12 -08:00
Andrew Kryczka	7ffb10fc1a	DeleteRange compaction statistics Summary: - "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them - "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them - s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction - Move the above class into a separate .h file to avoid circular dependency. Closes https://github.com/facebook/rocksdb/pull/1520 Differential Revision: D4187179 Pulled By: ajkr fbshipit-source-id: 10c2103	2016-11-28 11:54:12 -08:00
Mike Kolupaev	236d4c67e9	Less linear search in DBIter::Seek() when keys are overwritten a lot Summary: In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack: ``` rocksdb::MergingIterator::Next rocksdb::DBIter::FindNextUserEntryInternal rocksdb::DBIter::Seek ``` I think what's happening is: 1. we create a snapshot iterator, 2. we do lots of Put()s for the same key x; this creates lots of entries in memtable, 3. we seek the iterator to a key slightly smaller than x, 4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers. CC IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/1413 Differential Revision: D4083879 Pulled By: IslamAbdelRahman fbshipit-source-id: a83ddae	2016-11-28 10:24:11 -08:00
Siying Dong	cd7c4143d7	Improve Write Stalling System Summary: Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before. Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system. Closes https://github.com/facebook/rocksdb/pull/1562 Differential Revision: D4218519 Pulled By: siying fbshipit-source-id: 95e4088	2016-11-23 09:24:15 -08:00
Yi Wu	dfb6fe6755	Unified InlineSkipList::Insert algorithm with hinting Summary: This PR is based on nbronson's diff with small modifications to wire it up with existing interface. Comparing to previous version, this approach works better for inserting keys in decreasing order or updating the same key, and impose less restriction to the prefix extractor. ---- Summary from original diff ---- This diff introduces a single InlineSkipList::Insert that unifies the existing sequential insert optimization (prev_), concurrent insertion, and insertion using externally-managed insertion point hints. There's a deep symmetry between insertion hints (cursors) and the concurrent algorithm. In both cases we have partial information from the recent past that is likely but not certain to be accurate. This diff introduces the struct InlineSkipList::Splice, which encodes predecessor and successor information in the same form that was previously only used within a single call to InsertConcurrently. Splice holds information about an insertion point that can be used to levera Closes https://github.com/facebook/rocksdb/pull/1561 Differential Revision: D4217283 Pulled By: yiwu-arbug fbshipit-source-id: 33ee437	2016-11-22 14:09:13 -08:00
Andrew Kryczka	734e4acafb	Eliminate redundant cache lookup with range deletion Summary: When we introduced range deletion block, TableCache::Get() and TableCache::NewIterator() each did two table cache lookups, one for range deletion block iterator and another for getting the table reader to which the Get()/NewIterator() is delegated. This extra cache lookup was very CPU-intensive (about 10% overhead in a read-heavy benchmark). We can avoid it by reusing the Cache::Handle created for range deletion block iterator to get the file reader. Closes https://github.com/facebook/rocksdb/pull/1537 Differential Revision: D4201167 Pulled By: ajkr fbshipit-source-id: d33ffd8	2016-11-21 21:24:11 -08:00
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	2016-11-21 18:09:13 -08:00
Karthikeyan Radhakrishnan	4118e13330	Persistent Cache: Expose stats to user via public API Summary: Exposing persistent cache stats (counters) to the user via public API. Closes https://github.com/facebook/rocksdb/pull/1485 Differential Revision: D4155274 Pulled By: siying fbshipit-source-id: 30a9f50	2016-11-21 17:39:13 -08:00
Andrew Kryczka	fd43ee09da	Range deletion microoptimizations Summary: - Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter. - Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly. Closes https://github.com/facebook/rocksdb/pull/1548 Differential Revision: D4208169 Pulled By: ajkr fbshipit-source-id: 2fd65cf	2016-11-21 12:24:13 -08:00
Andrew Kryczka	fe349db57b	Remove Arena in RangeDelAggregator Summary: The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator. Closes https://github.com/facebook/rocksdb/pull/1547 Differential Revision: D4207781 Pulled By: ajkr fbshipit-source-id: 9d1c130	2016-11-19 14:24:12 -08:00
Andrew Kryczka	3f62215210	Lazily initialize RangeDelAggregator's map and pinning manager Summary: Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction. Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist. Closes https://github.com/facebook/rocksdb/pull/1539 Differential Revision: D4203488 Pulled By: ajkr fbshipit-source-id: 3b36279	2016-11-18 17:09:11 -08:00
Andrew Kryczka	635a7bd1ad	refactor TableCache Get/NewIterator for single exit points Summary: these functions were too complicated to change with exit points everywhere, so refactored them. btw, please review urgently, this is a prereq to fix the 5.0 perf regression Closes https://github.com/facebook/rocksdb/pull/1534 Differential Revision: D4198972 Pulled By: ajkr fbshipit-source-id: 04ebfb7	2016-11-17 14:39:13 -08:00
Siying Dong	a4eb7387b2	Allow plain table to store index on file with bloom filter disabled Summary: Currently plain table bloom filter is required if storing metadata on file. Remove the constraint. Closes https://github.com/facebook/rocksdb/pull/1525 Differential Revision: D4190977 Pulled By: siying fbshipit-source-id: be60442	2016-11-17 11:09:13 -08:00
Yi Wu	36e4762ce0	Remove Ticker::SEQUENCE_NUMBER Summary: Remove the ticker count because: * Having to reset the ticker count in WriteImpl is ineffiecent; * It doesn't make sense to have it as a ticker count if multiple db instance share a statistics object. Closes https://github.com/facebook/rocksdb/pull/1531 Differential Revision: D4194442 Pulled By: yiwu-arbug fbshipit-source-id: e2110a9	2016-11-16 22:39:09 -08:00
Andrew Kryczka	760ef68a69	fix deleterange asan issue Summary: pinned_iters_mgr_ pins iterators allocated with arena_, so we should order the instance variable declarations such that the pinned iterators have their destructors executed before the arena is destroyed. Closes https://github.com/facebook/rocksdb/pull/1528 Differential Revision: D4191984 Pulled By: ajkr fbshipit-source-id: 1386f20	2016-11-16 14:09:07 -08:00
Siying Dong	972e3ff295	Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default Summary: Closes https://github.com/facebook/rocksdb/pull/1496 Differential Revision: D4168080 Pulled By: siying fbshipit-source-id: 056ae62	2016-11-16 09:39:09 -08:00
Yi Wu	1543d5d92e	Report memory usage by memtable insert hints map. Summary: It is hard to measure acutal memory usage by std containers. Even providing a custom allocator will miss count some of the usage. Here we only do a wild guess on its memory usage. Closes https://github.com/facebook/rocksdb/pull/1511 Differential Revision: D4179945 Pulled By: yiwu-arbug fbshipit-source-id: 32ab929	2016-11-15 20:24:13 -08:00
Andrew Kryczka	48e8baebc0	Decouple data iterator and range deletion iterator in TableCache Summary: Previously we used TableCache::NewIterator() for multiple purposes (data block iterator and range deletion iterator), and returned non-ok status in the data block iterator. In one case where the caller only used the range deletion block iterator (`9e7cf3469b/db/version_set.cc (L965-L973)`), we didn't check/free the data block iterator containing non-ok status, which caused a valgrind error. So, this diff decouples creation of data block and range deletion block iterators, and updates the callers accordingly. Both functions can return non-ok status in an InternalIterator. Since the non-ok status is returned in an iterator that the callers will definitely use, it should be more usable/less error-prone. Closes https://github.com/facebook/rocksdb/pull/1513 Differential Revision: D4181423 Pulled By: ajkr fbshipit-source-id: 835b8f5	2016-11-15 17:24:28 -08:00
Andrew Kryczka	661e4c9267	DeleteRange unsupported in non-block-based tables Summary: Return an error from DeleteRange() (or Write() if the user is using the low-level WriteBatch API) if an unsupported table type is configured. Closes https://github.com/facebook/rocksdb/pull/1519 Differential Revision: D4185933 Pulled By: ajkr fbshipit-source-id: abcdf84	2016-11-15 15:24:16 -08:00
Andrew Kryczka	489d142808	DeleteRange interface Summary: Expose DeleteRange() interface since we think the implementation is functionally correct now. Closes https://github.com/facebook/rocksdb/pull/1503 Differential Revision: D4171921 Pulled By: ajkr fbshipit-source-id: 5e21c98	2016-11-15 15:24:16 -08:00
Islam AbdelRahman	eba99c28e4	Fix min_write_buffer_number_to_merge = 0 bug Summary: It's possible that we set min_write_buffer_number_to_merge to 0. This should never happen Closes https://github.com/facebook/rocksdb/pull/1515 Differential Revision: D4183356 Pulled By: yiwu-arbug fbshipit-source-id: c9d39d7	2016-11-15 13:54:08 -08:00
Artemiy Kolesnikov	91300d01f6	Dynamic max_total_wal_size option Summary: Closes https://github.com/facebook/rocksdb/pull/1509 Differential Revision: D4176426 Pulled By: yiwu-arbug fbshipit-source-id: b57689d	2016-11-14 22:54:17 -08:00
Andrew Kryczka	ec2f64794b	Consider subcompaction boundaries when updating file boundaries for range deletion Summary: Adjusted AddToBuilder() to take lower_bound and upper_bound, which serve two purposes: (1) only range deletions overlapping with the interval [lower_bound, upper_bound) will be added to the output file, and (2) the output file's boundaries will not be extended before lower_bound or after upper_bound. Our computation of lower_bound/upper_bound consider both subcompaction boundaries and previous/next files within the subcompaction. Test cases are here (level subcompactions: https://gist.github.com/ajkr/63c7eae3e9667c5ebdc0a7efb74ac332, and universal subcompactions: https://gist.github.com/ajkr/5a62af77c4ebe4052a1955c496d51fdb) but can't be included in this diff as they depend on committing the API first. They fail before this change and pass after. Closes https://github.com/facebook/rocksdb/pull/1501 Reviewed By: yhchiang Differential Revision: D4171685 Pulled By: ajkr fbshipit-source-id: ee99db8	2016-11-14 20:24:21 -08:00
Andrew Kryczka	3b192f6186	Handle full final subcompaction output file with range deletions Summary: This conditional should only open a new file that's dedicated to range deletions when it's the sole output of the subcompaction. Previously, we created such a file whenever the table builder was nullptr, which would've also been the case whenever the CompactionIterator's final key coincided with the final output table becoming full. Closes https://github.com/facebook/rocksdb/pull/1507 Differential Revision: D4174613 Pulled By: ajkr fbshipit-source-id: 9ffacea	2016-11-14 17:54:20 -08:00
Andrew Kryczka	6c57952002	Make range deletion inclusive-exclusive Summary: This makes it easier to implement future optimizations like range collapsing. Closes https://github.com/facebook/rocksdb/pull/1504 Differential Revision: D4172214 Pulled By: ajkr fbshipit-source-id: ac4942f	2016-11-14 17:39:13 -08:00
Yi Wu	1ea79a78c9	Optimize sequential insert into memtable - Part 1: Interface Summary: Currently our skip-list have an optimization to speedup sequential inserts from a single stream, by remembering the last insert position. We extend the idea to support sequential inserts from multiple streams, and even tolerate small reordering wihtin each stream. This PR is the interface part adding the following: - Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key. - Add `InsertWithHint()` interface to memtable, to allow underlying implementation to return a hint of insert position, which can be later pass back to optimize inserts. - Memtable will maintain a map from prefix to hints and pass the hint via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null. Closes https://github.com/facebook/rocksdb/pull/1419 Differential Revision: D4079367 Pulled By: yiwu-arbug fbshipit-source-id: 3555326	2016-11-13 19:09:18 -08:00
Yi Wu	df5eeb85ca	Optimize sequential insert into memtable - Part 2: Implementation Summary: Implement a insert hint into skip-list to hint insert position. This is to optimize for the write workload where there are multiple stream of sequential writes. For example, there is a stream of keys of a1, a2, a3... but also b1, b2, b2... Each stream are not neccessary strictly sequential, but can get reorder a little bit. User can specify a prefix extractor and the `SkipListRep` can thus maintan a hint for each of the stream for fast insert into memtable. This is the internal implementation part. See #1419 for the interface part. See inline comments for details. Closes https://github.com/facebook/rocksdb/pull/1449 Differential Revision: D4106781 Pulled By: yiwu-arbug fbshipit-source-id: f4d48c4	2016-11-13 13:09:16 -08:00
Islam AbdelRahman	5ed650857d	Fix SstFileWriter destructor Summary: If user did not call SstFileWriter::Finish() or called Finish() but it failed. We need to abandon the builder, to avoid destructing it while it's open Closes https://github.com/facebook/rocksdb/pull/1502 Differential Revision: D4171660 Pulled By: IslamAbdelRahman fbshipit-source-id: ab6f434	2016-11-12 20:11:19 -08:00
Lijun Tang	adb665e0bf	Allowed delayed_write_rate option to be dynamically set. Summary: Closes https://github.com/facebook/rocksdb/pull/1488 Differential Revision: D4157784 Pulled By: siying fbshipit-source-id: f150081	2016-11-12 15:54:11 -08:00
Maysam Yabandeh	361010d447	Exporting compaction stats in the form of a map Summary: Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present the stats in any format required by the them. Closes https://github.com/facebook/rocksdb/pull/1477 Differential Revision: D4149836 Pulled By: maysamyabandeh fbshipit-source-id: b3df19f	2016-11-11 20:54:14 -08:00
Aaron Gao	b39b2ee12f	do not call get() in recovery mode Summary: This is a previous fix that has a typo Closes https://github.com/facebook/rocksdb/pull/1487 Differential Revision: D4157381 Pulled By: lightmark fbshipit-source-id: f079be8	2016-11-10 11:24:20 -08:00
Reid Horuff	1ca5f6d132	Fix 2PC Recovery SeqId Miscount Summary: Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL. Closes https://github.com/facebook/rocksdb/pull/1486 Differential Revision: D4156064 Pulled By: reidHoruff fbshipit-source-id: a6da8d9	2016-11-10 11:09:22 -08:00
Andrew Kryczka	c90fef88b1	fix open failure with empty wal Summary: Closes https://github.com/facebook/rocksdb/pull/1490 Differential Revision: D4158821 Pulled By: IslamAbdelRahman fbshipit-source-id: 59b73f4	2016-11-09 22:24:26 -08:00
Andrew Kryczka	4e20c5da20	Store internal keys in TombstoneMap Summary: This fixes a correctness issue where ranges with same begin key would overwrite each other. This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator. End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix. Closes https://github.com/facebook/rocksdb/pull/1484 Differential Revision: D4155248 Pulled By: ajkr fbshipit-source-id: 304b4b9	2016-11-09 15:09:18 -08:00
Yueh-Hsuan Chiang	a9fb346e4a	Fix RocksDB Lite build failure in c_test.cc Summary: Fix the following RocksDB Lite build failure in c_test.cc db/c_test.c:1051:3: error: implicit declaration of function 'fprintf' is invalid in C99 [-Werror,-Wimplicit-function-declaration] fprintf(stderr, "SKIPPED\n"); ^ db/c_test.c:1051:3: error: declaration of built-in function 'fprintf' requires inclusion of the header <stdio.h> [-Werror,-Wbuiltin-requires-header] db/c_test.c:1051:11: error: use of undeclared identifier 'stderr' fprintf(stderr, "SKIPPED\n"); ^ 3 errors generated. Closes https://github.com/facebook/rocksdb/pull/1479 Differential Revision: D4151160 Pulled By: yhchiang fbshipit-source-id: a471a30	2016-11-09 12:24:18 -08:00
Reid Horuff	d133b08f68	Use correct sequence number when creating memtable Summary: copied from: `5ebfd2623a` Opening existing RocksDB attempts recovery from log files, which uses wrong sequence number to create the memtable. This is a regression introduced in change `a400336`. This change includes a test demonstrating the problem, without the fix the test fails with "Operation failed. Try again.: Transaction could not check for conflicts for operation at SequenceNumber 1 as the MemTable only contains changes newer than SequenceNumber 2. Increasing the value of the max_write_buffer_number_to_maintain option could reduce the frequency of this error" This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname and me. Closes https://github.com/facebook/rocksdb/pull/1458 Differential Revision: D4143791 Pulled By: reidHoruff fbshipit-source-id: 5a25033	2016-11-09 12:24:17 -08:00
Islam AbdelRahman	9bd191d2f4	Fix deadlock between (WriterThread/Compaction/IngestExternalFile) Summary: A deadlock is possible if this happen (1) Writer thread is stopped because it's waiting for compaction to finish (2) Compaction is waiting for current IngestExternalFile() calls to finish (3) IngestExternalFile() is waiting to be able to acquire the writer thread (4) WriterThread is held by stopped writes that are waiting for compactions to finish This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread. This patch include a unittest to reproduce the described scenario Closes https://github.com/facebook/rocksdb/pull/1480 Differential Revision: D4151646 Pulled By: IslamAbdelRahman fbshipit-source-id: 09b39db	2016-11-09 10:54:10 -08:00
Islam AbdelRahman	193221e0a1	Fix Forward Iterator Seek()/SeekToFirst() Summary: In ForwardIterator::SeekInternal(), we may end up passing empty Slice representing an internal key to InternalKeyComparator::Compare. and when we try to extract the user key from this empty Slice, we will create a slice with size = 0 - 8 ( which will overflow and cause us to read invalid memory as well ) Scenarios to reproduce these issues are in the unit tests Closes https://github.com/facebook/rocksdb/pull/1467 Differential Revision: D4136660 Pulled By: lightmark fbshipit-source-id: 151e128	2016-11-08 13:54:31 -08:00
Aaron Gao	e48f3f8b9e	remove tabs and duplicate #include in c api Summary: fix lint error about tabs and duplicate includes. Closes https://github.com/facebook/rocksdb/pull/1476 Differential Revision: D4149646 Pulled By: lightmark fbshipit-source-id: 2e0a632	2016-11-08 13:54:31 -08:00
Jay Lee	a7875272d7	c: support seek_for_prev Summary: support seek_for_prev in c abi. Closes https://github.com/facebook/rocksdb/pull/1457 Differential Revision: D4135360 Pulled By: lightmark fbshipit-source-id: 61256b0	2016-11-08 12:54:13 -08:00
Andrew Kryczka	9e7cf3469b	DeleteRange user iterator support Summary: Note: reviewed in https://reviews.facebook.net/D65115 - DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times. - DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205) - DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones Closes https://github.com/facebook/rocksdb/pull/1464 Differential Revision: D4131753 Pulled By: ajkr fbshipit-source-id: be86559	2016-11-04 12:09:22 -07:00
Andrew Kryczka	f998c9790f	DeleteRange Get support Summary: During Get()/MultiGet(), build up a RangeDelAggregator with range tombstones as we search through live memtable, immutable memtables, and SST files. This aggregator is then used by memtable.cc's SaveValue() and GetContext::SaveValue() to check whether keys are covered. added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761 Closes https://github.com/facebook/rocksdb/pull/1456 Differential Revision: D4111271 Pulled By: ajkr fbshipit-source-id: 6e388d4	2016-11-03 18:54:20 -07:00
zhangjinpeng1987	879f366366	Add C api for RateLimiter Summary: Add C api for RateLimiter. Closes https://github.com/facebook/rocksdb/pull/1455 Differential Revision: D4116362 Pulled By: yiwu-arbug fbshipit-source-id: cb05a8d	2016-11-03 11:09:17 -07:00
Yi Wu	437942e481	Add avoid_flush_during_shutdown DB option Summary: Add avoid_flush_during_shutdown DB option. Closes https://github.com/facebook/rocksdb/pull/1451 Differential Revision: D4108643 Pulled By: yiwu-arbug fbshipit-source-id: abdaf4d	2016-11-02 15:39:18 -07:00
Benoit Girard	2b16d664cb	Change max_bytes_for_level_multiplier to double Summary: Closes https://github.com/facebook/rocksdb/pull/1427 Differential Revision: D4094732 Pulled By: yiwu-arbug fbshipit-source-id: b9b79e9	2016-11-01 21:09:23 -07:00
Jay Lee	16fb04434f	expose IngestExternalFile to c abi Summary: IngestExternalFile is very useful when doing bulk load. This pr expose this API to c so many bindings can benefit from it too. Closes https://github.com/facebook/rocksdb/pull/1454 Differential Revision: D4113420 Pulled By: yiwu-arbug fbshipit-source-id: 307c6ae	2016-11-01 17:09:39 -07:00
Andrew Kryczka	40a2e406f8	DeleteRange flush support Summary: Changed BuildTable() (used for flush) to (1) add range tombstones to the aggregator, which is used by CompactionIterator to determine which keys can be removed; and (2) add aggregator's range tombstones to the table that is output for the flush. Closes https://github.com/facebook/rocksdb/pull/1438 Differential Revision: D4100025 Pulled By: ajkr fbshipit-source-id: cb01a70	2016-10-31 20:54:18 -07:00
Vladislav Vaintroub	d5555d95a3	Fix MSVC compile error in 32 bit compilation Summary: Passing std::atomic<uint64_t> variables to ASSERT_EQ() results in compile error C2718 'const T1': actual parameter with requested alignment of 8 won't be aligned. VS2015 defines std::atomic as specially aligned type ( with 'alignas'), however the compiler does not like declspec(align)ed function arguments. Worked around by casting std::atomic<uint64_t> types to uint64_t in ASSERT_EQ. Closes https://github.com/facebook/rocksdb/pull/1450 Differential Revision: D4106788 Pulled By: yiwu-arbug fbshipit-source-id: 5fb42c3	2016-10-31 17:24:18 -07:00
Siying Dong	da61f348d3	Print compression and Fast CRC support info as Header level Summary: Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level. Also add ZSTD to this printing. Closes https://github.com/facebook/rocksdb/pull/1448 Differential Revision: D4106608 Pulled By: yiwu-arbug fbshipit-source-id: cb9a076	2016-10-31 16:09:13 -07:00
Siying Dong	c90c48d3c8	Show More DB Stats in info logs Summary: DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line. Closes https://github.com/facebook/rocksdb/pull/1439 Differential Revision: D4100943 Pulled By: yiwu-arbug fbshipit-source-id: 79f7218	2016-10-29 16:09:18 -07:00
Siying Dong	1b295ac8ae	DBTest.GetThreadStatus: Wait for test results for longer Summary: The current 10 millisecond waiting for test results may not be sufficient in some test environments. Increase it to 60 seconds and check the results for every 1 milliseond. Already reviewed: https://reviews.facebook.net/D65457 Closes https://github.com/facebook/rocksdb/pull/1437 Differential Revision: D4099443 Pulled By: siying fbshipit-source-id: cf1f205	2016-10-29 16:09:18 -07:00
Aaron Gao	b50a81a2bb	Add a test for tailing_iterator Summary: A bug that tailingIterator->Seek(target) skips records. I think the bug is in the SeekInternal starting at lines 387: search_left_bound > search_right_bound There are only 2 cases this can happen: (1) target key is smaller than left most file (2) target key is larger than right most file The comment is wrong, there is another possibility that at the higher level there is a big gap such that the file in the lower level fits completely in the gap and then indexer->GetNextLevelIndex returns search_left_bound > search_right_bound I think pointing on the files after and before the gap. details: https://github.com/facebook/rocksdb/issues/1372 fixed this bug with test case added. Closes https://github.com/facebook/rocksdb/pull/1436 Reviewed By: IslamAbdelRahman Differential Revision: D4099313 Pulled By: lightmark fbshipit-source-id: 6a675b3	2016-10-28 18:24:14 -07:00
Siying Dong	04751d5345	L0 compression should follow options.compression_per_level if not empty Summary: Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0]. Reviewed and accepted in: https://reviews.facebook.net/D65607 Closes https://github.com/facebook/rocksdb/pull/1435 Differential Revision: D4099368 Pulled By: siying fbshipit-source-id: cfbbdcd	2016-10-28 17:39:20 -07:00
Andrew Kryczka	2946cadc46	Improve RangeDelAggregator documentation Summary: as requested in D62259 Closes https://github.com/facebook/rocksdb/pull/1434 Differential Revision: D4099047 Pulled By: ajkr fbshipit-source-id: a258cfb	2016-10-28 15:54:21 -07:00
Aaron Gao	bc429de490	revert fractional cascading in farward iterator Summary: As offline discussion with Siying, revert this since it has bug with seek. Test Plan: make check -j64 Reviewers: yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65559	2016-10-28 10:25:39 -07:00
Andrew Kryczka	b9bc7a2aa4	Use skiplist rep for range tombstone memtable Summary: somehow missed committing this update in D62217 Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65361	2016-10-27 10:07:28 -07:00
Siying Dong	9ee84067f6	Disable DBTest.RepeatedWritesToSameKey (#1420 ) Summary: The verification condition of the test DBTest.RepeatedWritesToSameKey doesn't hold anymore after `3ce3bb3da2`. Disable the test for now before we find a way to replace it. Test Plan: Run the test and make sure it is disabled.	2016-10-25 10:23:50 -07:00
Aaron Gao	9de2f75216	revert Prev() in MergingIterator to use previous code in non-prefix-seek mode Summary: Siying suggested to keep old code for normal mode prev() for safety Test Plan: make check -j64 Reviewers: yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65439	2016-10-24 13:13:01 -07:00
sdong	24495186da	DBSSTTest.RateLimitedDelete: not to use real clock Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead. Test Plan: Run the tests and all existing tests. Reviewers: yiwu, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65145	2016-10-24 10:35:00 -07:00
sdong	1168cb810a	Fix a bug that may cause a deleted row to appear again Summary: The previous fix of reappearing of a deleted row `0ce258f9b3` missed a corner case, which can be reproduced using test CompactionPickerTest.OverlappingUserKeys7. Consider such an example: input level file: 1[B E] 2[F H] output level file: 3[A C] 4[D I] 5[I K] First file 2 is picked, which overlaps to file 4. 4 expands to 5. Now the all range is [D K] with 2 output level files. When we try to expand that, [D K] overlaps with file 1 and 2 in the input level, and 1 and 2 overlaps with 3 and 4 in the output level. So we end up with picking 3 and 4 in the output level. Without expanding, it also has 2 files, so we determine the output level doesn't change, although they are the different two files. The fix is to expand the output level files after we picked 3 and 4. In that case, there will be three output level files so we will abort the expanding. I also added two unit tests related to marked_for_compaction and being_compacted. They have been passing though. Test Plan: Run the new unit test, as well as all other tests. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65373	2016-10-24 09:49:07 -07:00
Edouard A	99c052a34f	Fix integer overflow in GetL0ThresholdSpeedupCompaction (#1378 )	2016-10-23 18:43:29 -07:00
Aaron Gao	59a7c0337b	Change ioptions to store user_comparator, fix bug Summary: change ioptions.comparator to user_comparator instread of internal_comparator. Also change Comparator* to InternalKeyComparator* to make its type explicitly. Test Plan: make all check -j64 Reviewers: andrewkr, sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65121	2016-10-21 11:31:42 -07:00
Islam AbdelRahman	869ae5d786	Support IngestExternalFile (remove AddFile restrictions) Summary: Changes in the diff API changes: - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear) - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file) - Deprecate AddFile() API Logic changes: - If our file overlap with the memtable we will flush the memtable - We will find the first level in the LSM tree that our file key range overlap with the keys in it - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it - We will assign a global sequence number to our new file - Remove AddFile restrictions by using global sequence numbers Other changes: - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob Test Plan: unit tests (still need to add more) addfile_stress (https://reviews.facebook.net/D65037) Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong Reviewed By: sdong Subscribers: jkedgar, hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65061	2016-10-20 17:05:32 -07:00
sdong	1d9dbef64e	Restrict running condition of UniversalCompactionTrivialMoveTest2 Summary: DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 verifies non-trivial move is not triggered if we load data in sequential order. However, if there are multiple compaction threads, this conditon may not hold. Restrict the running condition to 1 compaction thread to make the test more robust. Test Plan: Run the test and make sure at least it doesn't regress normally. Reviewers: yhchiang, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65277	2016-10-20 15:43:00 -07:00
sdong	fb2e412943	column_family_test: disable some tests in LITE Summary: Some tests in column_family_test depend on functions that are not available in LITE build, which sometimes cause flakiness. Disable them. Test Plan: Run those tests in LITE build. Reviewers: yiwu, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65271	2016-10-19 15:55:56 -07:00
Aaron Gao	5af651db24	fix data race in compact_files_test Summary: fix data race Test Plan: compact_files_test Reviewers: sdong, yiwu, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65259	2016-10-19 13:37:51 -07:00
Andrew Kryczka	a0ba0aa877	Fix uninitialized variable gcc error for MyRocks Summary: make sure seq_ is properly initialized even if ParseInternalKey() fails. Test Plan: run myrocks release tests Reviewers: lightmark, mung, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65199	2016-10-19 10:59:46 -07:00
Islam AbdelRahman	b88f8e87c5	Support SST files with Global sequence numbers [reland] Summary: reland https://reviews.facebook.net/D62523 - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: sdong, yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65211	2016-10-18 16:59:37 -07:00
Aaron Gao	52c9808c3a	not split file in compaciton on level 0 Summary: we should not split file on level 0 in compaction because it will fail the following verification of seqno order on level 0 Test Plan: check with filldeterministic in db_bench Reviewers: yhchiang, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65193	2016-10-18 16:30:34 -07:00
Aaron Gao	5e0d6b4cc9	fix db_stress assertion failure Summary: in rocksdb::DBIter::FindValueForCurrentKey(), last_not_merge_type could also be SingleDelete() which is omitted Test Plan: db_iter_test Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65187	2016-10-18 16:07:10 -07:00
sdong	b4d07123c4	SamePrefixTest.InDomainTest to clear the test directory before testing Summary: SamePrefixTest.InDomainTest may fail if the previous run of some test cases in prefix_test fail. Test Plan: Run the test Reviewers: lightmark, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65163	2016-10-18 14:01:10 -07:00
Islam AbdelRahman	aa09d03381	Avoid calling GetDBOptions() inside GetFromBatchAndDB() Summary: MyRocks hit a regression, @mung generated perf reports showing that the reason is the cost of calling `GetDBOptions()` inside `GetFromBatchAndDB()` This diff avoid calling `GetDBOptions` and use the `ImmutableDBOptions` instead Test Plan: make check -j64 Reviewers: sdong, yiwu Reviewed By: yiwu Subscribers: andrewkr, dhruba, mung Differential Revision: https://reviews.facebook.net/D65151	2016-10-18 13:19:26 -07:00
Andrew Kryczka	6fbe96baf8	Compaction Support for Range Deletion Summary: This diff introduces RangeDelAggregator, which takes ownership of iterators provided to it via AddTombstones(). The tombstones are organized in a two-level map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data copy by holding Slices returned by the iterator, which remain valid thanks to pinning. For compaction, we create a hierarchical range tombstone iterator with structure matching the iterator over compaction input data. An aggregator based on that iterator is used by CompactionIterator to determine which keys are covered by range tombstones. In case of merge operand, the same aggregator is used by MergeHelper. Upon finishing each file in the compaction, relevant range tombstones are added to the output file's range tombstone metablock and file boundaries are updated accordingly. To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete() considers tombstones in the key's snapshot stripe. When this function is used outside of compaction, it also checks newer stripes, which can contain covering tombstones. Currently the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges within a stripe such that binary search can be used. RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range to a new table's range tombstone meta-block. Since range tombstones may fall in the gap between files, we may need to extend some files' key-ranges. The strategy is (1) first file extends as far left as possible and other files do not extend left, (2) all files extend right until either the start of the next file or the end of the last range tombstone in the gap, whichever comes first. One other notable change is adding release/move semantics to ScopedArenaIterator such that it can be used to transfer ownership of an arena-allocated iterator, similar to how unique_ptr is used for malloc'd data. Depends on D61473 Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927 Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62205	2016-10-18 12:04:56 -07:00
Andrew Kryczka	f47054015d	Handle WAL deletion when using avoid_flush_during_recovery Summary: Previously the WAL files that were avoided during recovery would never be considered for deletion. That was because alive_log_files_ was only populated when log files are created. This diff further populates alive_log_files_ with existing log files that aren't flushed during recovery, such that FindObsoleteFiles() can find them later. Depends on D64053. Test Plan: new unit test, verifies it fails before this change and passes after Reviewers: sdong, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64059	2016-10-14 12:59:51 -07:00
Yi Wu	e29d3b67c2	Make max_background_compactions and base_background_compactions dynamic changeable Summary: Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions. I'll add more dynamic changeable options soon. Test Plan: unit test. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64749	2016-10-14 12:25:39 -07:00
Aaron Gao	21e8daced5	fix assertion failure in Prev() Summary: fix assertion failure in db_stress. It happens because of prefix seek key is larger than merge iterator key when they have the same user key Test Plan: ./db_stress --max_background_compactions=1 --max_write_buffer_number=3 --sync=0 --reopen=20 --write_buffer_size=33554432 --delpercent=5 --log2_keys_per_lock=10 --block_size=16384 --allow_concurrent_memtable_write=0 --test_batches_snapshots=0 --max_bytes_for_level_base=67108864 --progress_reports=0 --mmap_read=0 --writepercent=35 --disable_data_sync=0 --readpercent=50 --subcompactions=4 --ops_per_thread=20000000 --memtablerep=skip_list --prefix_size=0 --target_file_size_multiplier=1 --column_families=1 --threads=32 --disable_wal=0 --open_files=500000 --destroy_db_initially=0 --target_file_size_base=16777216 --nooverwritepercent=1 --iterpercent=10 --max_key=100000000 --prefixpercent=0 --use_clock_cache=false --kill_random_test=888887 --cache_size=1048576 --verify_checksum=1 Reviewers: sdong, andrewkr, yiwu, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D65025	2016-10-13 17:36:48 -07:00
Yueh-Hsuan Chiang	040328a30d	Remove an assertion for single-delete in MergeHelper::MergeUntil Summary: Previously we have an assertion which triggers when we issue Merges after a single delete. However, merges after a single delete are unrelated to that single delete. Thus this behavior should be allowed. This will address a flakyness of db_stress. Test Plan: db_stress Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64923	2016-10-13 14:26:57 -07:00
Islam AbdelRahman	f26a139d89	Log successful AddFile Summary: Log successful AddFile Test Plan: visually check LOG file Reviewers: yiwu, andrewkr, lightmark, sdong Reviewed By: sdong Subscribers: andrewkr, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D65019	2016-10-13 11:56:27 -07:00
Islam AbdelRahman	5691a1d8a4	Fix compaction conflict with running compaction Summary: Issue scenario: (1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2 (2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files) (3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile() Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D64947	2016-10-13 10:49:06 -07:00
Andrew Kryczka	017de666c7	fixup commit Summary: I accidentally left out these changes from my commit of D64053 due to messing up the merge conflict resolution. Test Plan: ./db_wal_test Reviewers: Subscribers: Tasks: Blame Revision: D64053	2016-10-13 08:48:40 -07:00
Andrew Kryczka	1b7af5fb1a	Redo handling of recycled logs in full purge Summary: This reverts commit `9e4aa798c3`, which doesn't handle all cases (see inline comment). I reimplemented the logic as suggested in the initial PR: https://github.com/facebook/rocksdb/pull/1313. This approach has two benefits: - All the parsing/filtering of full_scan_candidate_files is kept together in PurgeObsoleteFiles. - We only need to check whether log file is recycled in one place where we've already determined it's a log file Test Plan: new unit test, verified fails before the original fix, still passes now. Reviewers: IslamAbdelRahman, yiwu, sdong Reviewed By: yiwu, sdong Subscribers: leveldb, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64053	2016-10-12 23:13:09 -07:00
Aaron Gao	447f17127c	new Prev() prefix support using SeekForPrev() Summary: 1) The previous solution for Prev() prefix support is not clean. Since I add api SeekForPrev(), now the Prev() can be symmetric to Next(). and we do not need SeekToLast() to be called in Prev() any more. Also, Next() will Seek(prefix_seek_key_) to solve the problem of possible inconsistency between db_iter and merge_iter when there is merge_operator. And prefix_seek_key is only refreshed when change direction to forward. 2) This diff also solves the bug of Iterator::SeekToLast() with iterate_upper_bound_ with prefix extractor. add test cases for the above two cases. There are some tests for the SeekToLast() in Prev(), I will clean them later. Test Plan: make all check Reviewers: IslamAbdelRahman, andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63933	2016-10-11 13:54:26 -07:00
Islam AbdelRahman	2ad68b971a	Support running consistency checks in release mode Summary: We always run consistency checks when compiling in debug mode allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode Test Plan: make check -j64 make release Reviewers: lightmark, sdong, yiwu Reviewed By: yiwu Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64701	2016-10-07 17:21:45 -07:00
Islam AbdelRahman	d062328977	Revert "Support SST files with Global sequence numbers" This reverts commit `ab01da5437`.	2016-10-07 14:05:12 -07:00
Reid Horuff	2c1f95291d	Add facility to write only a portion of WriteBatch to WAL Summary: When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement. ``` RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628 RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054 ``` Test Plan: Two unit tests added. Reviewers: sdong, yiwu, IslamAbdelRahman Reviewed By: yiwu Subscribers: hermanlee4, dhruba, andrewkr Differential Revision: https://reviews.facebook.net/D64599	2016-10-07 11:32:10 -07:00
Islam AbdelRahman	ab01da5437	Support SST files with Global sequence numbers Summary: - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno` - Update TableProperties to be aware of the offset of each property in the file - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks Test Plan: unit tests Reviewers: andrewkr, yhchiang, yiwu, sdong Reviewed By: sdong Subscribers: hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62523	2016-10-03 16:12:39 -07:00
Andrew Kryczka	6009c473c7	Store range tombstones in memtable Summary: - Store range tombstones in a separate MemTableRep instantiated with ColumnFamilyOptions::memtable_factory - MemTable::NewRangeTombstoneIterator() returns a MemTableIterator over the separate MemTableRep - Part of the read path is not implemented yet (i.e., MemTable::Get()) Test Plan: see unit tests Reviewers: wanning Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62217	2016-09-30 09:06:43 -07:00
Aaron Gao	26388247aa	delete unused variable for PrevInterval() Summary: delete unused variable Test Plan: make check Reviewers: sdong, andrewkr, IslamAbdelRahman, tianx Reviewed By: tianx Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64509	2016-09-29 13:19:58 -07:00
Islam AbdelRahman	87dfc1d23e	Fix conflict between AddFile() and CompactRange() Summary: Fix the conflict bug between AddFile() and CompactRange() by - Make sure that no AddFile calls are running when asking CompactionPicker to pick compaction for manual compaction - If AddFile() run after we pick the compaction for the manual compaction it will be aware of it since we will add the manual compaction to running_compactions_ after picking it This will solve these 2 scenarios - If AddFile() is running, we will wait for it to finish before we pick a compaction for the manual compaction - If we already picked a manual compaction and then AddFile() started ... we ensure that it never ingest a file in a level that will overlap with the manual compaction Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64449	2016-09-28 15:42:06 -07:00
Aaron Gao	f517d9dd09	Add SeekForPrev() to Iterator Summary: Add new Iterator API, `SeekForPrev`: find the last key that <= target key support prefix_extractor support prefix_same_as_start support upper_bound not supported in iterators without Prev() Also add tests in db_iter_test and db_iterator_test Pass all tests Cheers! Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64149	2016-09-27 18:20:57 -07:00
yiwu-arbug	eb3894cf42	Recompute compaction score on SetOptions (#1346 ) Summary: We didn't recompute compaction score on SetOptions, and end up not having compaction if no flush happens afterward. The PR fixing it. Test Plan: See unit test. Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64167	2016-09-27 11:17:15 -07:00
Islam AbdelRahman	5c64fb67d2	Fix AddFile() conflict with compaction output [WaitForAddFile()] Summary: Since AddFile unlock/lock the mutex inside LogAndApply() we need to ensure that during this period other compactions cannot run since such compactions are not aware of the file we are ingesting and could create a compaction that overlap wit this file this diff add - WaitForAddFile() call that will ensure that no AddFile() calls are being processed right now - Call `WaitForAddFile()` in 3 locations -- When doing manual Compaction -- When starting automatic Compaction -- When doing CompactFiles() Test Plan: unit test Reviewers: lightmark, yiwu, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64383	2016-09-27 00:14:55 -07:00
Islam AbdelRahman	9e9f5a0b92	Fix CompactFilesTest.ObsoleteFiles timeout (#1353 )	2016-09-26 10:39:07 -07:00
Aaron Gao	c2a62a4cb2	not cut compaction output when compact to level 0 Summary: we should not call ShouldStopBefore() in compaction when the compaction targets level 0. Otherwise, CheckConsistency will fail the assertion of seq number check on level 0. Test Plan: make all check -j64 I also manully test that using db_bench to compact files to level 0. Without this line change, the assertion files and multiple files are generated on level 0 after compaction. Reviewers: yhchiang, andrewkr, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64269	2016-09-23 17:16:38 -07:00
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	2016-09-23 16:34:04 -07:00
yiwu-arbug	4bc8c88e6b	Recover same sequence id from WAL (#1350 ) Summary: Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later. This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id. Test Plan: Added unit test. Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64275	2016-09-23 16:15:14 -07:00
Aaron Gao	0a1bd9c509	add cfh deletion started listener Summary: add ColumnFamilyHandleDeletionStarted listener which can be called when user deletes handler. Test Plan: ./listener_test Reviewers: yiwu, IslamAbdelRahman, sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60717	2016-09-22 11:56:18 -07:00
Islam AbdelRahman	da5a9a65c1	Fix mac build	2016-09-21 20:22:09 -07:00
Islam AbdelRahman	abc0ae462b	Add AddFile() InternalStats for Total files/L0 files/total keys ingested Summary: Report more information about the ingested files in CF InternalStats - Total files - Total L0 files - Total keys There was also noticed that we were reporting files that failed to ingest, fix this bug Test Plan: print stats in tests Reviewers: sdong, andrewkr, lightmark Reviewed By: lightmark Subscribers: jkedgar, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D63039	2016-09-21 14:24:08 -07:00
Aaron Gao	715256338a	forbid merge during recovery Summary: Mitigate regression bug of options.max_successive_merges hit during DB Recovery For https://reviews.facebook.net/D62625 Test Plan: make all check Reviewers: horuff, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62655	2016-09-21 11:05:07 -07:00
Yi Wu	e4d3f5d9b8	Fix DBImpl::GetWalPreallocateBlockSize Mac build error Summary: Specify type param with std::min to resolve compile error on Mac. Test Plan: https://travis-ci.org/facebook/rocksdb/builds/161223845 Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64143	2016-09-20 10:17:28 -07:00
sdong	d78a4401b5	DBImpl::GetWalPreallocateBlockSize() should return size_t Summary: WritableFile::SetPreallocationBlockSize() requires parameter as size_t, and options used in DBImpl::GetWalPreallocateBlockSize() are all size_t. WritableFile::SetPreallocationBlockSize() should return size_t to avoid build break if size_t is not uint64_t. Test Plan: Run existing tests. Reviewers: andrewkr, IslamAbdelRahman, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D64137	2016-09-19 16:51:38 -07:00
sdong	b666f85445	Consider more factors when determining preallocation size of WAL files Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller. Test Plan: Add a unit test. Reviewers: andrewkr, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63957	2016-09-19 12:04:35 -07:00
rockeet	4c3f4496b5	Add TableBuilderOptions::level and relevant changes (#1335 )	2016-09-17 22:30:43 -07:00
Yi Wu	0a88f38b7e	Remove ColumnFamilyData::options() Summary: One more small refactor before I split DBOptions into mutable and immutable parts. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64047	2016-09-16 15:09:14 -07:00
Yi Wu	8d9bf5c498	Fix DBOptionsTest.GetLatestOptions Summary: RandomInitCFOptions will allocate a new compaction filter, which we have to delete afterward. Test Plan: valgrind against the test Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64011	2016-09-15 14:57:32 -07:00
Yi Wu	40cfa3e021	Fix DBWALTest.RecoveryWithLogDataForSomeCFs with mac Summary: Seems there's no std::array on mac+clang. Use raw array instead. Test Plan: run ./db_wal_test on mac. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64005	2016-09-15 13:44:33 -07:00
Andrew Kryczka	06b4785fec	Fix recovery for WALs without data for all CFs Summary: if one or more CFs had no data in the WAL, the log number that's used by FindObsoleteFiles() wasn't updated. We need to treat this case the same as if the data for that WAL had been flushed. Test Plan: new unit test Reviewers: IslamAbdelRahman, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63963	2016-09-15 11:40:48 -07:00
Andrew Kryczka	d7242ff4d5	Fix GetSortedWalFiles when log recycling enabled Summary: Previously the sequence number was mistakenly passed in an argument where the log number should go. This caused the reader to assume the old WAL format was used, which is incompatible with the WAL recycling format. Test Plan: new unit test, verified it fails before this change and passes afterwards. Reviewers: yiwu, lightmark, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63987	2016-09-15 09:55:02 -07:00
Yi Wu	17f76fc564	DB::GetOptions() reflect dynamic changed options Summary: DB::GetOptions() reflect dynamic changed options. Test Plan: See the new unit test. Reviewers: yhchiang, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63903	2016-09-14 22:10:28 -07:00
Yi Wu	81747f1be6	Refactor MutableCFOptions Summary: * Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions. * Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63945	2016-09-13 21:11:59 -07:00
somnathr	9e4aa798c3	Summary: (#1313 ) If log recycling is enabled with the rocksdb (recycle_log_file_num=16) db->Writebatch is erroring out with keynotfound after ~5-6 hours of run (1M seq but can happen to any workload I guess).See my detailed bug report here (https://github.com/facebook/rocksdb/issues/1303). This commit is the fix for this, a check is been added not to delete the log file if it is already there in the recycle list. Test Plan: Unit tested it and ran the similar profile. Not reproducing anymore.	2016-09-12 16:53:42 -07:00
Adam Faulkner	a10e8a056d	Fix C api memtable rep bugs. (#1328 )	2016-09-12 15:31:42 -07:00
zhangjinpeng1987	b06b191362	add C api for set wal_recovery_mode (#1327 ) * add C api for set wal recovery mode * add test	2016-09-09 10:11:30 -07:00
Islam AbdelRahman	1cca091298	Temporarily revert Prev() prefix support Summary: Temporarily revert commits for supporting prefix Prev() to unblock MyRocks and RocksDB release These are the commits reverted - `6a14d55bd9` - `b18f9c9eac` - `db74b1a219` - `2482d5fb45` Test Plan: make check -j64 Reviewers: sdong, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D63789	2016-09-08 14:45:32 -07:00
Islam AbdelRahman	52ee07b021	Move AddFile() tests to external_sst_file_test.cc Summary: Simply move the tests Test Plan: make check -j64 Reviewers: andrewkr, lightmark, yiwu, yhchiang, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62529	2016-09-07 15:41:54 -07:00
Edouard A	66a91e2607	Add NoSpace subcode to IOError (#1320 ) Add a sub code to distinguish "out of space" errors from regular I/O errors	2016-09-07 12:37:45 -07:00
sdong	67036c0406	Fix Flaky ColumnFamilyTest.FlushCloseWALFiles Summary: In ColumnFamilyTest.FlushCloseWALFiles, there is a small window in which the flush has finished but the log writer is not yet closed, causing the assert failure. Fix it by explicitly waiting the flush job to finish. Test Plan: Run the test many times in high parallelism. Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63423	2016-09-07 11:19:15 -07:00
sdong	607628d349	Support ZSTD with finalized format Summary: ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility. Still keep ZSTDNotFinal for compatibility reason. Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63141	2016-09-06 12:22:16 -07:00
Injun Song	ce1be2ce37	Fix build error on Windows (AppVeyor) (#1315 ) Add 'cf_options' to source list and db_imple.cc fix casting	2016-09-06 08:41:43 -07:00

... 5 6 7 8 9 ...

3072 Commits