rocksdb

Author	SHA1	Message	Date
Maysam Yabandeh	e9f91a5176	Add a fetch_add variation to AddDBStats Summary: AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add. This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is: - vanilla: 68618 tps - removing mutex on AddDBStats (unsafe): 69767 tps - fetch_add for all AddDBStats: 69200 tps - fetch_add only for concurrently access AddDBStats (this patch): 69579 tps Closes https://github.com/facebook/rocksdb/pull/2505 Differential Revision: D5330656 Pulled By: maysamyabandeh fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383	2017-06-29 17:12:19 -07:00
zhangjinpeng1987	c1b375e96a	skip generating empty sst Summary: When a compaction job output nothing, there is no necessary to generate a empty sst file which will cause `VersionEdit::EncodeTo` failed. ref https://github.com/facebook/rocksdb/issues/2478 Closes https://github.com/facebook/rocksdb/pull/2503 Differential Revision: D5350799 Pulled By: ajkr fbshipit-source-id: df0b4fcf3507fe1c3c435208b762e75478e00143	2017-06-29 15:26:52 -07:00
Mike Kolupaev	397ab11152	Improve Status message for block checksum mismatches Summary: We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages. It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild. Closes https://github.com/facebook/rocksdb/pull/2507 Differential Revision: D5345702 Pulled By: al13n321 fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339	2017-06-28 21:27:01 -07:00
Siying Dong	18c63af6ef	Make "make analyze" happy Summary: "make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way. Closes https://github.com/facebook/rocksdb/pull/2508 Differential Revision: D5341710 Pulled By: siying fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6	2017-06-28 15:42:27 -07:00
Maysam Yabandeh	01534db24c	Fix the reported asan issues Summary: This is to resolve the asan complains. In the meanwhile I am working on clarifying/revisiting the sync rules. Closes https://github.com/facebook/rocksdb/pull/2510 Differential Revision: D5338660 Pulled By: yiwu-arbug fbshipit-source-id: ce6f6e0826d43a2c0bfa4328a00c78f73cd6498a	2017-06-28 13:13:22 -07:00
Sagar Vemuri	1cd45cd1b3	FIFO Compaction with TTL Summary: Introducing FIFO compactions with TTL. FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size. To address that request: - Added a new TTL option to FIFO compaction options. - Updated FIFO compaction score to take TTL into consideration. - Added a new table property, creation_time, to keep track of when the SST file is created. - Creation_time is set as below: - On Flush: Set to the time of flush. - On Compaction: Set to the max creation_time of all the files involved in the compaction. - On Repair and Recovery: Set to the time of repair/recovery. - Old files created prior to this code change will have a creation_time of 0. - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day. - FIFO compaction will fall back to the prior way of deleting files based on size if: - the creation_time of all files involved in compaction is 0. - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted. This feature is not supported if max_open_files != -1 or with table formats other than Block-based. Test Plan: Added tests. Benchmark results: Base: FIFO with max size: 100MB :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found) ``` With TTL (a low one for testing) :: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20 readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found) ``` Example Log lines: ``` 2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion 2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files ... 2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK 2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40} ``` SST Files remaining in the dbbench dir, after db_bench execution completed: ``` svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst ``` Closes https://github.com/facebook/rocksdb/pull/2480 Differential Revision: D5305116 Pulled By: sagar0 fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174	2017-06-27 17:11:48 -07:00
Siying Dong	89468c01d4	Fix Windows build broken by `5c97a7c066` Summary: A typo conversion fails Windows build. Fix it. Closes https://github.com/facebook/rocksdb/pull/2500 Differential Revision: D5325962 Pulled By: siying fbshipit-source-id: 2cefdafc9afbc85f856f403af7c876b622400630	2017-06-26 17:28:22 -07:00
Ewout Prangsma	51778612c9	Encryption at rest support Summary: This PR adds support for encrypting data stored by RocksDB when written to disk. It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files. The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done. Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!). The Counter operation mode uses an initial counter & random initialization vector (IV). Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize). The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there. To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests. Typically you would run it like this: ``` ENCRYPTED_ENV=1 make check_some ``` There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings. Closes https://github.com/facebook/rocksdb/pull/2424 Differential Revision: D5322178 Pulled By: sdwilsh fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f	2017-06-26 16:56:24 -07:00
Siying Dong	5c97a7c066	Unit Tests for sync, range sync and file close failures Summary: Closes https://github.com/facebook/rocksdb/pull/2454 Differential Revision: D5255320 Pulled By: siying fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913	2017-06-26 13:27:58 -07:00
Siying Dong	d757355cbf	Fix bug that flush doesn't respond to fsync result Summary: With a regression bug was introduced two years ago, by `6e9fbeb27c` , we fail to check return status of fsync call. This can cause we miss the information from the file system and can potentially cause corrupted data which we could have been detected. Closes https://github.com/facebook/rocksdb/pull/2495 Reviewed By: ajkr Differential Revision: D5321949 Pulled By: siying fbshipit-source-id: c68117914bb40700198fc37d0e4c63163a8a1031	2017-06-26 12:41:48 -07:00
Maysam Yabandeh	8e6345d2df	Update rename of ParanoidCheck Summary: Closes https://github.com/facebook/rocksdb/pull/2494 Differential Revision: D5317902 Pulled By: maysamyabandeh fbshipit-source-id: 097330292180816b3d0c9f4cbbdb6f68f0180200	2017-06-24 18:12:03 -07:00
Maysam Yabandeh	499ebb3ab5	Optimize for serial commits in 2PC Summary: Throughput: 46k tps in our sysbench settings (filling the details later) The idea is to have the simplest change that gives us a reasonable boost in 2PC throughput. Major design changes: 1. The WAL file internal buffer is not flushed after each write. Instead it is flushed before critical operations (WAL copy via fs) or when FlushWAL is called by MySQL. Flushing the WAL buffer is also protected via mutex_. 2. Use two sequence numbers: last seq, and last seq for write. Last seq is the last visible sequence number for reads. Last seq for write is the next sequence number that should be used to write to WAL/memtable. This allows to have a memtable write be in parallel to WAL writes. 3. BatchGroup is not used for writes. This means that we can have parallel writers which changes a major assumption in the code base. To accommodate for that i) allow only 1 WriteImpl that intends to write to memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes come via group commit phase which is serial anyway, ii) make all the parts in the code base that assumed to be the only writer (via EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are protected via a stat_mutex_. Note: the first commit has the approach figured out but is not clean. Submitting the PR anyway to get the early feedback on the approach. If we are ok with the approach I will go ahead with this updates: 0) Rebase with Yi's pipelining changes 1) Currently batching is disabled by default to make sure that it will be consistent with all unit tests. Will make this optional via a config. 2) A couple of unit tests are disabled. They need to be updated with the serial commit of 2PC taken into account. 3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires releasing mutex_ beforehand (the same way EnterUnbatched does). This needs to be cleaned up. Closes https://github.com/facebook/rocksdb/pull/2345 Differential Revision: D5210732 Pulled By: maysamyabandeh fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4	2017-06-24 14:11:29 -07:00
Andrew Kryczka	71f5bcb730	Introduce OnBackgroundError callback Summary: Some users want to prevent rocksdb from entering read-only mode in certain error cases. This diff gives them a callback, `OnBackgroundError`, that they can use to achieve it. - call `OnBackgroundError` every time we consider setting `bg_error_`. Use its result to assign `bg_error_` but not to change the function's return status. - classified calls using `BackgroundErrorReason` to give the callback some info about where the error happened - renamed `ParanoidCheck` to something more specific so we can provide a clear `BackgroundErrorReason` - unit tests for the most common cases: flush or compaction errors Closes https://github.com/facebook/rocksdb/pull/2477 Differential Revision: D5300190 Pulled By: ajkr fbshipit-source-id: a0ea4564249719b83428e3f4c6ca2c49e366e9b3	2017-06-22 19:41:50 -07:00
Siying Dong	6837a17621	Fix Data Race Between CreateColumnFamily() and GetAggregatedIntProperty() Summary: CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed. Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later. Closes https://github.com/facebook/rocksdb/pull/2475 Differential Revision: D5298053 Pulled By: siying fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6	2017-06-22 15:56:47 -07:00
Andrew Kryczka	9467eb6141	Fix flush assertion with tsan Summary: DBImpl's instance variables should only be accessed with mutex held. I moved an assert later to uphold this rule. DBTest.LastWriteBufferDelay test was sporadically failing TSAN because it tried to flush around the same time the db was destroyed, so the variable was accessed simultaneously by two threads. Closes https://github.com/facebook/rocksdb/pull/2471 Differential Revision: D5286857 Pulled By: ajkr fbshipit-source-id: 435abd84efa601f667c254e320b0bb5a434b971f	2017-06-20 16:43:44 -07:00
Dmitri Smirnov	a21db161c9	Implement ReopenWritibaleFile on Windows and other fixes Summary: Make default impl return NoSupported so the db_blob tests exist in a meaningful manner. Replace std::thread to port::Thread Closes https://github.com/facebook/rocksdb/pull/2465 Differential Revision: D5275563 Pulled By: yiwu-arbug fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6	2017-06-20 10:31:13 -07:00
zhangjinpeng1987	c430d69eed	fix coredump for release nullptr Summary: Coredump will be triggered when ingest external sst file after delete range. ref https://github.com/facebook/rocksdb/issues/2398 Closes https://github.com/facebook/rocksdb/pull/2463 Differential Revision: D5275599 Pulled By: ajkr fbshipit-source-id: 0828dbc062ea8c74e913877cd63494fd3478a30d	2017-06-19 11:41:38 -07:00
hyunwoo	6b5a5dc5d8	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2430 Differential Revision: D5242471 Pulled By: IslamAbdelRahman fbshipit-source-id: 832eb3a4c70221444ccd2ae63217823fec56c748	2017-06-13 16:58:01 -07:00
Andrew Kryczka	c217e0b9c7	Call RateLimiter for compaction reads Summary: Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed. Closes https://github.com/facebook/rocksdb/pull/2433 Differential Revision: D5216946 Pulled By: ajkr fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e	2017-06-13 14:56:46 -07:00
Siying Dong	5d5a28a98c	Fix Clang release build broken by `5582123dee` Summary: `5582123dee` broken CLANG release build because of an unexpected change. Fix it. Closes https://github.com/facebook/rocksdb/pull/2443 Differential Revision: D5236297 Pulled By: siying fbshipit-source-id: 1b410adf13ded149c53e8235e9ea9f3130fb5403	2017-06-13 04:56:35 -07:00
Islam AbdelRahman	d713471da8	Limit trash directory to be 25% of total DB Summary: Update DeleteScheduler to delete files immediately if trash directory is >= 25% of DB size Closes https://github.com/facebook/rocksdb/pull/2436 Differential Revision: D5230384 Pulled By: IslamAbdelRahman fbshipit-source-id: 5cbda8ac536a3cc72c774641621edc02c8202482	2017-06-12 16:57:21 -07:00
Siying Dong	5582123dee	Sample number of reads per SST file Summary: We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne. Closes https://github.com/facebook/rocksdb/pull/2417 Differential Revision: D5193528 Pulled By: siying fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df	2017-06-12 07:12:08 -07:00
Siying Dong	db818d2d1a	Fix RocksDB Lite build with CLANG Summary: Closes https://github.com/facebook/rocksdb/pull/2419 Differential Revision: D5193976 Pulled By: siying fbshipit-source-id: 62d115edee6043237e9d6ad3c2a05481e162c9eb	2017-06-12 06:41:27 -07:00
Yi Wu	85dace2afa	Disable DBRangeDelTest::TailingIteratorRangeTombstoneUnsupported for ubsan Summary: UBSAN crashes when it run the test. Disabling it for UBSAN. Closes https://github.com/facebook/rocksdb/pull/2427 Differential Revision: D5210897 Pulled By: yiwu-arbug fbshipit-source-id: 2f5a876807c98d8db79ab9581965f7e6b29d4163	2017-06-08 12:43:01 -07:00
Siying Dong	52a7f38b19	WriteOptions.low_pri which can throttle low pri writes if needed Summary: If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown. Closes https://github.com/facebook/rocksdb/pull/2369 Differential Revision: D5127619 Pulled By: siying fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0	2017-06-05 15:02:35 -07:00
Yi Wu	dba9f3722b	Fix db_write_test clang/windows build failure Summary: Fix db_write_test clang/windows build failure. Explicitly cast size_t (unsigned long) to uint32_t (unsigned int). Closes https://github.com/facebook/rocksdb/pull/2407 Differential Revision: D5182995 Pulled By: yiwu-arbug fbshipit-source-id: aba225a9fccb12d5bfbdc2cd6efc11040706a9d2	2017-06-05 12:27:24 -07:00
hyunwoo	c7662a44a4	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2376 Differential Revision: D5183630 Pulled By: ajkr fbshipit-source-id: 133cfd0445959e70aa2cd1a12151bf3c0c5c3ac5	2017-06-05 11:27:34 -07:00
Aaron Gao	7f6c02dda1	using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… Summary: … headers https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB. We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term. make check -j64 Closes https://github.com/facebook/rocksdb/pull/2380 Differential Revision: D5177896 Pulled By: lightmark fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390	2017-06-02 17:26:19 -07:00
Mike Kolupaev	138b87eae4	Fix interaction between CompactionFilter::Decision::kRemoveAndSkipUnt… Summary: Fixes the following scenario: 1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`. 2. Do a compaction. 3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`. 4. At some point compaction filter returns `kRemoveAndSkipUntil`. 5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded. The fix is to make compaction iterator use `total_order_seek = true`. The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice). Closes https://github.com/facebook/rocksdb/pull/2349 Differential Revision: D5110388 Pulled By: lightmark fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a	2017-06-02 15:11:38 -07:00
Siying Dong	95b0e89b5d	Improve write buffer manager (and allow the size to be tracked in block cache) Summary: Improve write buffer manager in several ways: 1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower. 2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit. 3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable. Closes https://github.com/facebook/rocksdb/pull/2350 Differential Revision: D5110648 Pulled By: siying fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017	2017-06-02 14:26:56 -07:00
Andrew Kryczka	a4d9c02511	Pass CF ID to MemTableRepFactory Summary: Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff: - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload. - updates MemTable to create MemTableRep's using the new overload. Closes https://github.com/facebook/rocksdb/pull/2346 Differential Revision: D5108061 Pulled By: ajkr fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616	2017-06-02 12:12:06 -07:00
Yi Wu	f68d88be51	Fix DBWriteTest::ReturnSequenceNumberMultiThreaded data race Summary: rocksdb::Random is not thread-safe. Have one Random for each thread instead. Closes https://github.com/facebook/rocksdb/pull/2400 Differential Revision: D5173919 Pulled By: yiwu-arbug fbshipit-source-id: 1a99c7b877f3893eb22355af49e321bcad4e53e6	2017-06-02 11:42:11 -07:00
Andrew Kryczka	215076ef06	Fix TSAN: avoid arena mode with range deletions Summary: The range deletion meta-block iterators weren't getting cleaned up properly since they don't support arena allocation. I didn't implement arena support since, in the general case, each iterator is used only once and separately from all other iterators, so there should be no benefit to data locality. Anyways, this diff fixes up #2370 by treating range deletion iterators as non-arena-allocated. Closes https://github.com/facebook/rocksdb/pull/2399 Differential Revision: D5171119 Pulled By: ajkr fbshipit-source-id: bef6f5c4c5905a124f4993945aed4bd86e2807d8	2017-06-01 22:26:49 -07:00
Andrew Kryczka	3a8a848a55	account for L0 size in estimated compaction bytes Summary: also changed the `>` in the comparison against `level0_file_num_compaction_trigger` into a `>=` since exactly `level0_file_num_compaction_trigger` can trigger a compaction from L0. Closes https://github.com/facebook/rocksdb/pull/2179 Differential Revision: D4915772 Pulled By: ajkr fbshipit-source-id: e38fec6253de6f9a40e61734615c6670d84038aa	2017-06-01 17:56:59 -07:00
Tamir Duberstein	0dc3040d54	db: avoid `#include`ing malloc and jemalloc simultaneously Summary: This fixes a compilation failure on Linux when the system libc is not glibc. jemalloc's configure script incorrectly assumes that glibc is always used on Linux systems, producing glibc-style signatures; when the system libc is e.g. musl, the following error is observed: ``` [ 0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0, from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77: /x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void)' has a different exception specifier size_t malloc_usable_size(void ); ^~~~~~~~~~~~~~~~~~ In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0: /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()' # define je_malloc_usable_size malloc_usable_size ^ /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size' JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size( ^~~~~~~~~~~~~~~~~~~~~ CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed ``` This works around the issue by rearranging the sources such that jemalloc's headers are never in the same scope as the system's malloc header. The jemalloc issue has been reported as well, see: https://github.com/jemalloc/jemalloc/issues/778. cc tschottdorf Closes https://github.com/facebook/rocksdb/pull/2188 Differential Revision: D5163048 Pulled By: siying fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d	2017-05-31 22:43:02 -07:00
Andrew Kryczka	9c9909bf7d	Support ingest file when range deletions exist Summary: Previously we returned NotSupported when ingesting files into a database containing any range deletions. This diff adds the support. - Flush if any memtable contains range deletions overlapping the to-be-ingested file - Place to-be-ingested file before any level that contains range deletions overlapping it. - Added support for `Version` to return iterators over range deletions in a given level. Previously, we piggybacked getting range deletions onto `Version`'s `Get()` / `AddIterator()` functions by passing them a `RangeDelAggregator*`. But file ingestion needs to get iterators over range deletions, not populate an aggregator (since the aggregator does collapsing and doesn't expose the actual ranges). Closes https://github.com/facebook/rocksdb/pull/2370 Differential Revision: D5127648 Pulled By: ajkr fbshipit-source-id: 816faeb9708adfa5287962bafdde717db56e3f1a	2017-05-31 13:57:19 -07:00
Yi Wu	ad19eb8686	Fixing blob db sequence number handling Summary: Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch. Stacking on #2375. Closes https://github.com/facebook/rocksdb/pull/2385 Differential Revision: D5148358 Pulled By: yiwu-arbug fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33	2017-05-31 10:56:45 -07:00
Siying Dong	51ac91f586	Histogram of number of merge operands Summary: Add a histogram in statistics to help users understand how many merge operands they merge. Closes https://github.com/facebook/rocksdb/pull/2373 Differential Revision: D5139983 Pulled By: siying fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182	2017-05-31 07:41:44 -07:00
Tamir Duberstein	103d0692ea	Avoid unsupported attributes when not building with UBSAN Summary: yiwu-arbug see individual commits. Closes https://github.com/facebook/rocksdb/pull/2318 Differential Revision: D5141520 Pulled By: yiwu-arbug fbshipit-source-id: 7987c92ab4461eef36afce5a133d3a0ee0c96300	2017-05-30 11:13:01 -07:00
赵星宇	d03c34497c	update comment of GetNextFile Summary: Closes https://github.com/facebook/rocksdb/pull/2377 Differential Revision: D5141274 Pulled By: lightmark fbshipit-source-id: c237a285b73ad93488c080ea80c71a29a17f1be0	2017-05-26 15:12:13 -07:00
Aaron Gao	f7bb1a0060	support merge and delete in file ingestion Summary: Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested. Closes https://github.com/facebook/rocksdb/pull/2361 Differential Revision: D5139402 Pulled By: lightmark fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88	2017-05-26 12:11:21 -07:00
Sagar Vemuri	7bb1f5d483	Increase of compaction threads should be logged at info level instead of a warning Summary: This log message shouldn't be a warning; some services are seeing high warning count due to this. The count for the below line is a few hundreds of millions, as per Logview: ``` [rocksdb/src/db/column_family.cc:729] [checkpoints] Increasing compaction threads because we have 2 level-0 files ``` Closes https://github.com/facebook/rocksdb/pull/2364 Differential Revision: D5123565 Pulled By: sagar0 fbshipit-source-id: a07ce499a4f82f0ebde9cda9f4948fb9df6a734c	2017-05-26 09:56:13 -07:00
Andrew Kryczka	a99fb9928f	fix column_family_test asan Summary: stop calling Close() at the end of tests holding a compaction pressure token since it causes the write controller to be deleted while it's still needed. these calls were pointless anyways since Close() is already called in the test's destructor. Closes https://github.com/facebook/rocksdb/pull/2367 Differential Revision: D5125906 Pulled By: ajkr fbshipit-source-id: 6cad8673e5546a82ff602ac0ba59cc3f68dbde46	2017-05-24 16:41:51 -07:00
Andrew Kryczka	bb01c1880c	Introduce max_background_jobs mutable option Summary: - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure. - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions. - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled. Closes https://github.com/facebook/rocksdb/pull/2205 Differential Revision: D4937461 Pulled By: ajkr fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9	2017-05-24 11:29:08 -07:00
Siying Dong	41cbb72749	options.delayed_write_rate use the rate of rate_limiter by default. Summary: It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude. Closes https://github.com/facebook/rocksdb/pull/2357 Differential Revision: D5115324 Pulled By: siying fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754	2017-05-24 09:58:24 -07:00
Igor Canadi	52d9e5f7b6	Fix column family seconds_up accounting Summary: `cf_stats_snapshot_.seconds_up` appears to be never updated, unlike `db_stats_snapshot_.seconds_up`, which is updated here: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.cc#L883 This leads to wrong information in the log, for example: ``` Compaction Stats [default] .... Uptime(secs): 85591.2 total, 85591.2 interval ``` Even though DB's interval is correctly logged as 60 seconds: ``` DB Stats Uptime(secs): 85591.2 total, 637.8 interval ``` Closes https://github.com/facebook/rocksdb/pull/2338 Differential Revision: D5114131 Pulled By: sagar0 fbshipit-source-id: 85243a38213236ccbb601a7f7aaa8865eaa8083c	2017-05-23 17:14:04 -07:00
Sagar Vemuri	7d8207f1f2	Fix errors in clang-analyzer builds Summary: Fix build error in db_iter.cc when running clang-analyzer. ``` CC db/db_iter.o db/db_iter.cc:938:21: error: no matching constructor for initialization of 'rocksdb::ParsedInternalKey' ParsedInternalKey ikey(Slice(), 0, 0); ^ ~~~~~~~~~~~~~ ./db/dbformat.h:84:3: note: candidate constructor not viable: no known conversion from 'int' to 'rocksdb::ValueType' for 3rd argument ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t) ^ ./db/dbformat.h:78:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided struct ParsedInternalKey { ^ ./db/dbformat.h:78:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided ./db/dbformat.h:83:3: note: candidate constructor not viable: requires 0 arguments, but 3 were provided ParsedInternalKey() { } // Intentionally left uninitialized (for speed) ^ 1 error generated. ``` Closes https://github.com/facebook/rocksdb/pull/2354 Differential Revision: D5115751 Pulled By: sagar0 fbshipit-source-id: b0e386d4e935e4725b07761c3ca5f7a8cbde3692	2017-05-23 15:11:42 -07:00
Andrew Kryczka	6cc9aef162	New API for background work in single thread pool Summary: Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool. Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool. Closes https://github.com/facebook/rocksdb/pull/2204 Differential Revision: D4936256 Pulled By: ajkr fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01	2017-05-23 11:12:27 -07:00
Yi Wu	9d0a07ed52	Fix rocksdb.estimate-num-keys DB property underflow Summary: rocksdb.estimate-num-keys is compute from `estimate_num_keys - 2 * estimate_num_deletes`. If `2 * estimate_num_deletes > estimate_num_keys` it will underflow. Fixing it. Closes https://github.com/facebook/rocksdb/pull/2348 Differential Revision: D5109272 Pulled By: yiwu-arbug fbshipit-source-id: e1bfb91346a59b7282a282b615002507e9d7c246	2017-05-23 10:42:59 -07:00
Aaron Gao	3e86c0f07c	disable direct reads for log and manifest and add direct io to tests Summary: Disable direct reads for log and manifest. Direct reads should not affect sequential_file Also add kDirectIO for option_config_ in db_test_util Closes https://github.com/facebook/rocksdb/pull/2337 Differential Revision: D5100261 Pulled By: lightmark fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b	2017-05-22 18:41:28 -07:00
Sagar Vemuri	228f49d20a	Fix data races caught by tsan Summary: This fixes the tsan build failures in: - write_callback_test - persistent_cache_test.* Closes https://github.com/facebook/rocksdb/pull/2339 Differential Revision: D5101190 Pulled By: sagar0 fbshipit-source-id: 537e19ed05272b1f34cfbf793aa822b2264a1643	2017-05-22 10:27:23 -07:00
Yi Wu	07bdcb91fe	New WriteImpl to pipeline WAL/memtable write Summary: PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that 1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit. 2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes. Merging #2056 and #2058 into this PR. Closes https://github.com/facebook/rocksdb/pull/2286 Differential Revision: D5054606 Pulled By: yiwu-arbug fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d	2017-05-19 14:26:42 -07:00
Yi Wu	d746aead1a	Suppress clang-analyzer false positive Summary: Fixing two types of clang-analyzer false positives: * db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated. * ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running. Closes https://github.com/facebook/rocksdb/pull/2334 Differential Revision: D5093801 Pulled By: yiwu-arbug fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d	2017-05-19 10:56:28 -07:00
Siying Dong	217b866f47	column_family_test: EnvCounter::num_new_writable_file_ to be atomic Summary: TSAN shows warning of data race of EnvCounter::num_new_writable_file_. Make it atomic. Closes https://github.com/facebook/rocksdb/pull/2331 Differential Revision: D5089215 Pulled By: siying fbshipit-source-id: 15f6dcfb770a3310cbb6337c22482c8b330daffc	2017-05-18 13:56:12 -07:00
yizhu.sun	f5ba131bf8	Fixed some spelling mistakes Summary: Closes https://github.com/facebook/rocksdb/pull/2314 Differential Revision: D5079601 Pulled By: sagar0 fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349	2017-05-17 23:12:36 -07:00
hyunwoo	0ebdd70579	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2312 Differential Revision: D5079631 Pulled By: sagar0 fbshipit-source-id: e4c8d1d89b244ee69e9dea1dd013227cc5241026	2017-05-17 16:41:49 -07:00
Mikhail Antonov	ba685a472a	Support ingest_behind for IngestExternalFile Summary: First cut for early review; there are few conceptual points to answer and some code structure issues. For conceptual points - - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true \|\| disable_auto_compaction=false), the user is responsible to properly open and close DB with required params - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise? - Modifying AssignLevelForIngestedFile seems the place we we'd handle that. On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites. Closes https://github.com/facebook/rocksdb/pull/2144 Differential Revision: D4873732 Pulled By: lightmark fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a	2017-05-17 11:42:42 -07:00
boolean5	cb9392a094	add Transactions and Checkpoint to C API Summary: I've added functions to the C API to support Transactions as requested in #1637 and to support Checkpoint. I have also added the corresponding tests to c_test.c For now, the following is omitted: 1. Optimistic Transactions 2. The column family variation of functions Closes https://github.com/facebook/rocksdb/pull/2236 Differential Revision: D4989510 Pulled By: yiwu-arbug fbshipit-source-id: 518cb39f76d5e9ec9690d633fcdc014b98958071	2017-05-16 22:59:43 -07:00
hyunwoo	f720796e24	fixed typo Summary: fixed exisitng -> existing Closes https://github.com/facebook/rocksdb/pull/2305 Differential Revision: D5070169 Pulled By: yiwu-arbug fbshipit-source-id: 8c8450acf50757b767cf78b78314018395738d96	2017-05-16 11:07:58 -07:00
siddontang	1ca723dbd1	C API: support pinnable get Summary: Closes https://github.com/facebook/rocksdb/pull/2254 Differential Revision: D5053590 Pulled By: yiwu-arbug fbshipit-source-id: 2f365a031b3a2947b4fba21d26d4f8f52af9b9f0	2017-05-16 11:07:58 -07:00
赵星宇	4f9e69ccf4	fix log err Summary: Closes https://github.com/facebook/rocksdb/pull/2206 Differential Revision: D5054222 Pulled By: yiwu-arbug fbshipit-source-id: d8742bda1bf3e76d7b68eeb86df4608031b5cbc8	2017-05-15 16:15:38 -07:00
Yi Wu	3907c94ffb	Fix ColumnFamilyTest:BulkAddDrop Summary: Fix ColumnFamilyTest:BulkAddDrop not deleted CF handles at the end, causing ASAN failure. Closes https://github.com/facebook/rocksdb/pull/2275 Differential Revision: D5040724 Pulled By: yiwu-arbug fbshipit-source-id: 86cd4070c944d01173a3cc36462bb800698af192	2017-05-10 23:05:44 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
Aaron Gao	492fc49a86	fix readampbitmap tests Summary: fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose. test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check Closes https://github.com/facebook/rocksdb/pull/2271 Differential Revision: D5038133 Pulled By: lightmark fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8	2017-05-10 12:29:23 -07:00
Andrew Kryczka	be421b0b16	portable sched_getcpu calls Summary: - added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available. - refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred Closes https://github.com/facebook/rocksdb/pull/2272 Differential Revision: D5038093 Pulled By: ajkr fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7	2017-05-10 12:29:23 -07:00
Aaron Gao	259a00eaca	unbiase readamp bitmap Summary: Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a) and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31. An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block. It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i32, (i+1)32). This diff makes each bit represent a single byte: i32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte. But there is one exception: the last bit will always set with the old way.* (*) - assuming read_amp_bytes_per_bit = 32. Closes https://github.com/facebook/rocksdb/pull/2259 Differential Revision: D5035652 Pulled By: lightmark fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356	2017-05-10 01:49:52 -07:00
Islam AbdelRahman	4897eb250b	dont skip IO for filter blocks Summary: Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek Actually I am not sure why this behavior existed in the first place Closes https://github.com/facebook/rocksdb/pull/2255 Differential Revision: D5010721 Pulled By: maysamyabandeh fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4	2017-05-09 09:52:02 -07:00
Changjian Gao	3f73d54bbd	Add C API to set max_file_opening_threads option Summary: Add `rocksdb_options_set_max_file_opening_threads()` API Closes https://github.com/facebook/rocksdb/pull/2184 Differential Revision: D4923090 Pulled By: lightmark fbshipit-source-id: c4ddce17733d999d426d02f7202b33a46ed6faed	2017-05-08 22:49:32 -07:00
Yi Wu	2cd00773c7	Add bulk create/drop column family API Summary: Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time. The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO. Closes https://github.com/facebook/rocksdb/pull/2248 Differential Revision: D5001578 Pulled By: yiwu-arbug fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03	2017-05-07 23:20:46 -07:00
Tamir Duberstein	fdaefa0309	travis: add Windows cross-compilation Summary: - downcase includes for case-sensitive filesystems - give targets the same name (librocksdb) on all platforms With this patch it is possible to cross-compile RocksDB for Windows from a Linux host using mingw. cc yuslepukhin orgads Closes https://github.com/facebook/rocksdb/pull/2107 Differential Revision: D4849784 Pulled By: siying fbshipit-source-id: ad26ed6b4d393851aa6551e6aa4201faba82ef60	2017-05-05 23:20:01 -07:00
Aaron Gao	a30a696034	do not read next datablock if upperbound is reached Summary: Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks: [a b c d][f g h] Then the index key for the first block will be 'e'. then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already. This diff can reduce RA in most cases. Closes https://github.com/facebook/rocksdb/pull/2239 Differential Revision: D4990693 Pulled By: lightmark fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff	2017-05-05 23:20:01 -07:00
Aaron Gao	2d42cf5ea9	Roundup read bytes in ReadaheadRandomAccessFile Summary: Fix alignment in ReadaheadRandomAccessFile Closes https://github.com/facebook/rocksdb/pull/2253 Differential Revision: D5012336 Pulled By: lightmark fbshipit-source-id: 10d2c829520cb787227ef653ef63d5d701725778	2017-05-05 12:14:14 -07:00
Siying Dong	264d3f540c	Allow IntraL0 compaction in FIFO Compaction Summary: Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files. Closes https://github.com/facebook/rocksdb/pull/2163 Differential Revision: D4895953 Pulled By: siying fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231	2017-05-04 18:16:13 -07:00
Andrew Kryczka	8c3a180e83	Set lower-bound on dynamic level sizes Summary: Changed dynamic leveling to stop setting the base level's size bound below `max_bytes_for_level_base`. Behavior for config where `max_bytes_for_level_base == level0_file_num_compaction_trigger * write_buffer_size` and same amount of data in L0 and base-level: - Before #2027, compaction scoring would favor base-level due to dividing by size smaller than `max_bytes_for_level_base`. - After #2027, L0 and Lbase get equal scores. The disadvantage is L0 is often compacted before reaching the num files trigger since `write_buffer_size` can be bigger than the dynamically chosen base-level size. This increases write-amp. - After this diff, L0 and Lbase still get equal scores. Now it takes `level0_file_num_compaction_trigger` files of size `write_buffer_size` to trigger L0 compaction by size, fixing the write-amp problem above. Closes https://github.com/facebook/rocksdb/pull/2123 Differential Revision: D4861570 Pulled By: ajkr fbshipit-source-id: 467ddef56ed1f647c14d86bb018bcb044c39b964	2017-05-04 18:16:12 -07:00
Andrew Kryczka	7c1c8ce5ac	Avoid calling fallocate with UINT64_MAX Summary: When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249. Test setup: - command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize` - filesystem: xfs before this diff: `fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)` after this diff: `fallocate(10, 01, 0, 1977) = 0` Closes https://github.com/facebook/rocksdb/pull/2252 Differential Revision: D5007275 Pulled By: ajkr fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880	2017-05-04 17:43:22 -07:00
Leonidas Galanis	a45e98a5b5	max_open_files dynamic set, follow up Summary: Followup to make 0x40000 a TableCache constant that indicates infinite capacity Closes https://github.com/facebook/rocksdb/pull/2247 Differential Revision: D5001349 Pulled By: lgalanis fbshipit-source-id: ce7bd2e54b0975bb9f8680fdaa0f8bb0e7ae81a2	2017-05-04 10:42:45 -07:00
Leonidas Galanis	e7ae4a3a02	Max open files mutable Summary: Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache. Closes https://github.com/facebook/rocksdb/pull/2185 Differential Revision: D4979189 Pulled By: yiwu-arbug fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda	2017-05-03 21:13:14 -07:00
siddontang	b551104e04	support PopSavePoint for WriteBatch Summary: Try to fix https://github.com/facebook/rocksdb/issues/1969 Closes https://github.com/facebook/rocksdb/pull/2170 Differential Revision: D4907333 Pulled By: yiwu-arbug fbshipit-source-id: 417b420ff668e6c2fd0dad42a94c57385012edc5	2017-05-03 10:57:45 -07:00
Siying Dong	af6fe69e4c	Fix an issue of manual / auto compaction data race Summary: A data race between a manual and an auto compaction can cause a scheduled automatic compaction to be cancelled and never rescheduled again. This may cause a condition of hanging forever. Fix this by always making sure the cancelled compaction is put back to the compaction queue. Closes https://github.com/facebook/rocksdb/pull/2238 Differential Revision: D4984591 Pulled By: siying fbshipit-source-id: 3ab153886403c7b991896dcb2158b96cac12f227	2017-05-02 15:11:59 -07:00
Siying Dong	aeaba07b2a	Remove an assert that causes TSAN failure. Summary: ColumnFamilyData::ConstructNewMemtable is called out of DB mutex, and it asserts current_ is not empty, but current_ should only be accessed inside DB mutex. Remove this assert to make TSAN happy. Closes https://github.com/facebook/rocksdb/pull/2235 Differential Revision: D4978531 Pulled By: siying fbshipit-source-id: 423685a7dae88ed3faaa9e1b9ccb3427ac704a4b	2017-05-01 16:35:15 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Aaron Gao	0ca3ead0cb	add GetRootDB() in DeleteFilesInRange Summary: In case users cast a subclass of db* into dbimpl* Closes https://github.com/facebook/rocksdb/pull/2222 Differential Revision: D4964486 Pulled By: lightmark fbshipit-source-id: 0ccdc08ee8e7a193dfbbe0218c3cbfd795662ca1	2017-04-27 14:33:17 -07:00
Dmitri Smirnov	cdad04b051	Remove double buffering on RandomRead on Windows. Summary: Remove double buffering on RandomRead on Windows. With more logic appear in file reader/write Read no longer obeys forwarding calls to Windows implementation. Previously direct_io (unbuffered) was only available on Windows but now is supported as generic. We remove intermediate buffering on Windows. Remove random_access_max_buffer_size option which was windows specific. Non-zero values for that opton introduced unnecessary lock contention. Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are no longer necessary. Add aligned buffer reads for cases when requested reads exceed read ahead size. Closes https://github.com/facebook/rocksdb/pull/2105 Differential Revision: D4847770 Pulled By: siying fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f	2017-04-27 12:30:05 -07:00
Siying Dong	e15382c09c	Disable two flaky tests Summary: Closes https://github.com/facebook/rocksdb/pull/2217 Differential Revision: D4959351 Pulled By: siying fbshipit-source-id: ce7c3a430bae0d15e06b3d5c958ebce969d08564	2017-04-26 17:13:46 -07:00
Andrew Kryczka	efc361ef7d	Add user stats Reset API Summary: It resets all the ticker and histogram stats to zero. Needed to change the locking a bit since Reset() is the only operation that manipulates multiple tickers/histograms together, and that operation should be seen as atomic by other operations that access tickers/histograms. Closes https://github.com/facebook/rocksdb/pull/2213 Differential Revision: D4952232 Pulled By: ajkr fbshipit-source-id: c0475c3e4c7b940120d53891b69c3091149a0679	2017-04-26 15:57:01 -07:00
Andrew Kryczka	f6a27d0bce	Extract statistics tests into separate file Summary: I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there. Closes https://github.com/facebook/rocksdb/pull/2211 Differential Revision: D4951558 Pulled By: ajkr fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4	2017-04-26 14:47:23 -07:00
Aaron Gao	7eddecce12	support bulk loading with universal compaction Summary: Support buck load with universal compaction. More test cases to be added. Closes https://github.com/facebook/rocksdb/pull/2202 Differential Revision: D4935360 Pulled By: lightmark fbshipit-source-id: cc3ca1b6f42faa503207dab1408d6bcf393ee5b5	2017-04-26 13:41:32 -07:00
Aaron Gao	72c21fb3f2	call GetRootDB() before cast to DBImpl* in CancelAllBackgroundWork Summary: User could call this with wrapper class of DB or DBImpl Closes https://github.com/facebook/rocksdb/pull/2200 Differential Revision: D4935530 Pulled By: lightmark fbshipit-source-id: df9cb61d67d0f3bbcf62f714d77523a459a92883	2017-04-24 13:47:17 -07:00
Tomas Kolda	04d58970cb	AIX and Solaris Sparc Support Summary: Replacement of #2147 The change was squashed due to a lot of conflicts. Closes https://github.com/facebook/rocksdb/pull/2194 Differential Revision: D4929799 Pulled By: siying fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581	2017-04-21 20:48:04 -07:00
Aaron Gao	cb885bccfe	set compaction_iterator earliest_snapshot to max if no snapshot Summary: It is a potential bug that will be triggered if we ingest files before inserting the first key into an empty db. 0 is a special value reserved to indicate the concept of non-existence. But not good for seqno in this case because 0 is a valid seqno for ingestion(bulk loading) Closes https://github.com/facebook/rocksdb/pull/2183 Differential Revision: D4919827 Pulled By: lightmark fbshipit-source-id: 237eea40f88bd6487b66806109d90065dc02c362	2017-04-20 19:56:59 -07:00
Andrew Kryczka	1dd7760513	Change L0 compaction score using level size Summary: The goal is to avoid the problem of small number of L0 files triggering compaction to base level (which increased write-amp), while still allowing L0 compaction-by-size (so intra-L0 compactions cause score to increase). Closes https://github.com/facebook/rocksdb/pull/2172 Differential Revision: D4908552 Pulled By: ajkr fbshipit-source-id: 4b170142b2b368e24bd7948b2a6f24c69fabf73d	2017-04-19 12:00:01 -07:00
Yi Wu	966ebb02f5	Hide event listeners from lite build Summary: Fixing lite build failure introduce by #2169. Closes https://github.com/facebook/rocksdb/pull/2174 Reviewed By: sagar0 Differential Revision: D4910619 Pulled By: yiwu-arbug fbshipit-source-id: 5213b7b7431cc258688793c8c28153025588d8d9	2017-04-18 17:26:19 -07:00
Siying Dong	c49d704656	Add DB:ResetStats() Summary: Add a function to allow users to reset internal stats without restarting the DB. Closes https://github.com/facebook/rocksdb/pull/2167 Differential Revision: D4907939 Pulled By: siying fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68	2017-04-18 16:56:48 -07:00
Yi Wu	0fcdccc33e	Blob storage helper methods Summary: Split out interfaces needed for blob storage from #1560, including * CompactionEventListener and OnFlushBegin listener interfaces. * Blob filename support. Closes https://github.com/facebook/rocksdb/pull/2169 Differential Revision: D4905463 Pulled By: yiwu-arbug fbshipit-source-id: 564e73448f1b7a367e5e46216a521e57ea9011b5	2017-04-18 12:42:38 -07:00
Yi Wu	e9e6e53247	Simplify write thread logic Summary: The concept about early exit in write thread implementation is a confusing one. It means that if early exit is allowed, batch group leader will not responsible to exit the batch group, but the last finished writer do. In case we need to mark log synced, or encounter memtable insert error, early exit is disallowed. This patch remove such a concept by: * In all cases, the last finished writer (not necessary leader) is responsible to exit batch group. * In case of parallel memtable write, leader will also mark log synced after memtable insert and before signal finish (call `CompleteParallelWorker()`). The purpose is to allow mark log synced (which require locking mutex) can run in parallel to memtable insert in other writers. * The last finish writer should handle memtable insert error (update bg_error_) before exiting batch group. Closes https://github.com/facebook/rocksdb/pull/2134 Differential Revision: D4869667 Pulled By: yiwu-arbug fbshipit-source-id: aec170847c85b90f4179d6a4608a4fe1361544e3	2017-04-13 16:12:04 -07:00
Aaron Gao	44fa8ece9b	change use_direct_writes to use_direct_io_for_flush_and_compaction Summary: Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get(). Closes https://github.com/facebook/rocksdb/pull/2117 Differential Revision: D4860912 Pulled By: lightmark fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19	2017-04-13 16:12:04 -07:00
Igor Canadi	b6b9359ece	Fix BYTES_WRITTEN accounting Summary: BYTES_WRITTEN accounting doesn't work with disabled WAL. For example, this is what we get in the LOG: ``` Cumulative writes: 9794K writes, 228M keys, 9794K commit groups, 1.0 writes per commit group, ingest: 0.00 GB, 0.00 MB/s ``` WAL bytes are tracked in a different statistic: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.h#L105. BYTES_WRITTEN should count all the writes. Closes https://github.com/facebook/rocksdb/pull/2133 Differential Revision: D4880615 Pulled By: yiwu-arbug fbshipit-source-id: 8fd0b223099f3f5ad7df79d4e737d313687fec69	2017-04-13 16:12:03 -07:00
Sagar Vemuri	6a6723ee1e	Move MergeOperatorPinning tests to be with other merge operator tests Summary: Moved MergeOperatorPinning tests from db_test2.cc to db_merge_operator_test.cc. [This is the same code as PR #2104 , which has already been reviewed, but I am creating a new PR as I cannot import from #2104 onto phabricator anymore even after rebasing. I'll close and discard #2104.] Closes https://github.com/facebook/rocksdb/pull/2125 Differential Revision: D4863312 Pulled By: sagar0 fbshipit-source-id: 0f71a7690aa09c1d03ee85ce2bc1d2d89e4f4399	2017-04-11 16:15:06 -07:00
Siying Dong	8f47a97512	File level histogram should be printed per CF, not per DB Summary: Currently level histogram is only printed out for DB stats and for default CF. This is confusing. Change to print for every CF instead. Closes https://github.com/facebook/rocksdb/pull/2126 Differential Revision: D4865373 Pulled By: siying fbshipit-source-id: 1c853e0ac66e00120ee931cabc9daf69ccc2d577	2017-04-11 08:42:03 -07:00
Manuel Ung	1f8b119ed6	Limit maximum memory used in the WriteBatch representation Summary: Extend TransactionOptions to include max_write_batch_size which determines the maximum size of the writebatch representation. If memory limit is exceeded, the operation will abort with subcode kMemoryLimit. Closes https://github.com/facebook/rocksdb/pull/2124 Differential Revision: D4861842 Pulled By: lth fbshipit-source-id: 46fd172ea67cc90bbba829bf0d70cfab2261c161	2017-04-10 15:42:26 -07:00

1 2 3 4 5 ...

2859 Commits