Commit Graph

2897 Commits

Author SHA1 Message Date
Andrew Kryczka
cc01985db0 Introduce bottom-pri thread pool for large universal compactions
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.

This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.

- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580

Differential Revision: D5422916

Pulled By: ajkr

fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
2017-08-03 15:43:29 -07:00
Maysam Yabandeh
58410aee44 Fix the overflow bug in AwaitState
Summary:
https://github.com/facebook/rocksdb/issues/2559 reports an overflow in AwaitState. nbronson has debugged the issue and presented the fix, which is applied to this patch. Moreover this patch adds more comments to clarify the logic in AwaitState.

I tried with both 16 and 64 threads on update benchmark. The fix lowers cpu usage by 1.6 but also lowers the throughput by 1.6 and 2% respectively. Apparently the bug had favored using the spinning more often.

Benchmarks:
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --benchmarks="fillrandom" --threads=16 --num=2000000
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --use_existing_db=1 --benchmarks="updaterandom[X3]" --threads=16 --num=2000000
TEST_TMPDIR=/dev/shm/tmpdb time ./db_bench --use_existing_db=1 --benchmarks="updaterandom[X3]" --threads=64 --num=200000

Results
$ cat update-16t-bug.txt | tail -4
updaterandom [AVG    3 runs] : 234117 ops/sec;   51.8 MB/sec
updaterandom [MEDIAN 3 runs] : 233581 ops/sec;   51.7 MB/sec
3896.42user 1539.12system 6:50.61elapsed 1323%CPU (0avgtext+0avgdata 331308maxresident)k
0inputs+0outputs (0major+1281001minor)pagefaults 0swaps
$ cat update-16t-fixed.txt | tail -4
updaterandom [AVG    3 runs] : 230364 ops/sec;   51.0 MB/sec
updaterandom [MEDIAN 3 runs] : 226169 ops/sec;   50.0 MB/sec
3865.46user 1568.32system 6:57.63elapsed 1301%CPU (0avgtext+0avgdata 315012maxresident)k
0inputs+0outputs (0major+1342568minor)pagefaults 0swaps

$ cat update-64t-bug.txt | tail -4
updaterandom [AVG    3 runs] : 261878 ops/sec;   57.9 MB/sec
updaterandom [MEDIAN 3 runs] : 262859 ops/sec;   58.2 MB/sec
926.27user 578.06system 2:27.46elapsed 1020%CPU (0avgtext+0avgdata 475480maxresident)k
0inputs+0outputs (0major+1058728minor)pagefaults 0swaps
$ cat update-64t-fixed.txt | tail -4
updaterandom [AVG    3 runs] : 256699 ops/sec;   56.8 MB/sec
updaterandom [MEDIAN 3 runs] : 256380 ops/sec;   56.7 MB/sec
933.47user 575.37system 2:30.41elapsed 1003%CPU (0avgtext+0avgdata 482340maxresident)k
0inputs+0outputs (0major+1078557minor)pagefaults 0swaps
Closes https://github.com/facebook/rocksdb/pull/2679

Differential Revision: D5553732

Pulled By: maysamyabandeh

fbshipit-source-id: 98b72dc3a8e0f22ea29d4f7c7790af10c369c5bb
2017-08-03 10:43:28 -07:00
Maysam Yabandeh
c3d5c4d38a Refactor TransactionImpl
Summary:
This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
Closes https://github.com/facebook/rocksdb/pull/2676

Differential Revision: D5549998

Pulled By: maysamyabandeh

fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
2017-08-03 08:57:22 -07:00
奏之章
3218edc573 Fix universal compaction bug
Summary:
this value ``` Compaction::is_trivial_move_ ``` uninitialized .
under universal compaction , we enable ```  CompactionOptionsUniversal::allow_trivial_move  ``` ,
9b11d4345a/db/compaction.cc (L245)
here is a disastrous bug , some sst trivial move to target level without overlap check ...
THEN , DATABASE DAMAGED , WE GOT A LEVEL WITH OVERLAP !
Closes https://github.com/facebook/rocksdb/pull/2634

Differential Revision: D5530722

Pulled By: siying

fbshipit-source-id: 425ab55bca5967110377d634258360bcf88c200e
2017-07-31 14:27:45 -07:00
Andrew Kryczka
6a36b3a7b9 fix db get/write stats
Summary:
we were passing `record_read_stats` (a bool) as the `hist_type` argument, which meant we were updating either `rocksdb.db.get.micros` (`hist_type == 0`) or `rocksdb.db.write.micros` (`hist_type == 1`) with wrong data.
Closes https://github.com/facebook/rocksdb/pull/2666

Differential Revision: D5520384

Pulled By: ajkr

fbshipit-source-id: 2f7c956aec32f8b58c5c18845ac478e0230c9516
2017-07-31 12:12:03 -07:00
Siying Dong
21696ba502 Replace dynamic_cast<>
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645

Differential Revision: D5502723

Pulled By: siying

fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
2017-07-28 16:27:16 -07:00
Sagar Vemuri
ac748c57ed Fix FIFO Compaction with TTL tests
Summary:
- FIFOCompactionWithTTLTest was flaky when run in parallel earlier, and hence it was disabled. Fixed it now.
- Also, faking sleep now instead of really sleeping to make tests more realistic by using TTLs like 1 hour and 1 day.
Closes https://github.com/facebook/rocksdb/pull/2650

Differential Revision: D5506038

Pulled By: sagar0

fbshipit-source-id: deb429a527f045e3e2c5138b547c3e8ac8586aa2
2017-07-28 14:42:59 -07:00
Andrew Kryczka
710411aea6 fix asan/valgrind for TableCache cleanup
Summary:
Breaking commit: d12691b86f

In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction.

This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed.
Closes https://github.com/facebook/rocksdb/pull/2662

Differential Revision: D5515108

Pulled By: ajkr

fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18
2017-07-27 20:28:04 -07:00
Siying Dong
fca4d6da17 Build fewer tests in Travis platform_dependent tests
Summary:
platform_dependent tests in Travis now builds all tests, which is not needed. Only build those tests we need to run.
Closes https://github.com/facebook/rocksdb/pull/2647

Differential Revision: D5513954

Pulled By: siying

fbshipit-source-id: 4d540b146124e70dd25586c47939d19f93655b0a
2017-07-27 17:29:01 -07:00
Aaron Gao
8f553d3c52 remove unnecessary internal_comparator param in newIterator
Summary:
solved https://github.com/facebook/rocksdb/issues/2604
Closes https://github.com/facebook/rocksdb/pull/2648

Differential Revision: D5504875

Pulled By: lightmark

fbshipit-source-id: c14bb62ccbdc9e7bda9cd914cae4ea0765d882ee
2017-07-27 14:30:42 -07:00
Andrew Kryczka
d12691b86f move TableCache::EraseHandle outside of db mutex
Summary:
Post-compaction work holds onto db mutex for the longest time (found by tracing lock acquires/releases with LTTng and correlating timestamps with our info log). Further experimentation showed `TableCache::EraseHandle` is responsible for ~86% of time mutex is held. We can just release the handle outside the db mutex.
Closes https://github.com/facebook/rocksdb/pull/2654

Differential Revision: D5507126

Pulled By: ajkr

fbshipit-source-id: 703c01ddf2aea16bc0f9e33c08935d78aa6b781d
2017-07-27 12:14:41 -07:00
Siying Dong
e7697b8ce8 Fix LITE unit tests
Summary: Closes https://github.com/facebook/rocksdb/pull/2649

Differential Revision: D5505778

Pulled By: siying

fbshipit-source-id: 7e935603ede3d958ea087ed6b8cfc4121e8797bc
2017-07-26 21:11:47 -07:00
Siying Dong
c281b44829 Revert "CRC32 Power Optimization Changes"
Summary:
This reverts commit 2289d38115.
Closes https://github.com/facebook/rocksdb/pull/2652

Differential Revision: D5506163

Pulled By: siying

fbshipit-source-id: 105e31dd9d99090453a6b9f32c165206cd3affa3
2017-07-26 19:31:36 -07:00
Sagar Vemuri
9980de262c Fix FIFO compaction picker test
Summary:
A FIFO compaction picker test is accidentally testing against an instance of level compaction picker.
Closes https://github.com/facebook/rocksdb/pull/2641

Differential Revision: D5495390

Pulled By: sagar0

fbshipit-source-id: 301962736f629b1c499570fb504cdbe66bacb46f
2017-07-26 12:12:26 -07:00
Kamalalochana Subbaiah
2289d38115 CRC32 Power Optimization Changes
Summary:
Support for PowerPC Architecture
Detecting AltiVec Support
Closes https://github.com/facebook/rocksdb/pull/2353

Differential Revision: D5210948

Pulled By: siying

fbshipit-source-id: 859a8c063d37697addd89ba2b8a14e5efd5d24bf
2017-07-26 09:42:29 -07:00
Maysam Yabandeh
30b58cf71a Remove the orphan assert on !need_log_sync
Summary:
We initially had disabled support for write_options.sync when concurrent_prepare_ is set. We later added this support but the statement that asserts this combination is not used was left there. This patch cleans it up.
Closes https://github.com/facebook/rocksdb/pull/2642

Differential Revision: D5496101

Pulled By: maysamyabandeh

fbshipit-source-id: becbc503446f2a51bee24cc861958c090c724ec2
2017-07-25 18:41:52 -07:00
Yi Wu
fe1a5559f3 Fix flaky write_callback_test
Summary:
The test is failing occasionally on the assert: `ASSERT_TRUE(writer->state == WriteThread::State::STATE_INIT)`. This is because the test don't make the leader wait for long enough before updating state for its followers. The patch move the update to `threads_waiting` to the end of `WriteThread::JoinBatchGroup:Wait` callback to avoid this happening.

Also adding `WriteThread::JoinBatchGroup:Start` and have each thread wait there while another thread is linking to the linked-list. This is to make the check of `is_leader` more deterministic.

Also changing two while-loops of `compare_exchange_strong` to plain `fetch_add`, to make it look cleaner.
Closes https://github.com/facebook/rocksdb/pull/2640

Differential Revision: D5491525

Pulled By: yiwu-arbug

fbshipit-source-id: 6e897f122082bd6f98e6d51b31a25e5fd0a3fb82
2017-07-25 16:42:11 -07:00
Islam AbdelRahman
ea8ad4f678 Fix compaction div by zero logging
Summary:
We will divide by zero if `stats.micros` is zero, just add a simple check
This happens sometimes during running tests and UBSAN complains
Closes https://github.com/facebook/rocksdb/pull/2631

Differential Revision: D5481455

Pulled By: IslamAbdelRahman

fbshipit-source-id: 69aa24e64e21de15d9e2b8009adf01675fcc6598
2017-07-24 11:58:02 -07:00
kapitan-k
34112aeffd Added db paths to c
Summary: Closes https://github.com/facebook/rocksdb/pull/2613

Differential Revision: D5476064

Pulled By: sagar0

fbshipit-source-id: 6b30a9eacb93a945bbe499eafb90565fa9f1798b
2017-07-24 11:58:02 -07:00
Daniel Black
1d8aa2961c Gcc 7 ParsedInternalKey replace memset with clear function.
Summary:
I haven't looked to see if a class variable inside a loop like this is always initialised.
Closes https://github.com/facebook/rocksdb/pull/2602

Differential Revision: D5475937

Pulled By: IslamAbdelRahman

fbshipit-source-id: 8570b308f9a4b49e2a56ccc9e9b84d7c46568c15
2017-07-24 11:31:15 -07:00
Siying Dong
e67b35c076 Add Iterator::Refresh()
Summary:
Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
Closes https://github.com/facebook/rocksdb/pull/2621

Differential Revision: D5464500

Pulled By: siying

fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
2017-07-24 10:54:37 -07:00
Andrew Kryczka
a34b2e388e Fix caching of compaction picker's next index
Summary:
The previous implementation of caching `file_size` index made no sense. It only remembered the original span of locked files starting from beginning of `file_size`. We should remember the index after all compactions that have been considered but rejected. This will reduce the work we do while holding the db mutex.
Closes https://github.com/facebook/rocksdb/pull/2624

Differential Revision: D5468152

Pulled By: ajkr

fbshipit-source-id: ab92a4bffe76f9f174d861bb5812b974d1013400
2017-07-21 20:57:15 -07:00
Sagar Vemuri
72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c598, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao
1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Andrew Kryczka
a22b9cc6fe overlapping endpoint fixes in level compaction picker
Summary:
This diff addresses two problems. Both problems cause us to miss scheduling desirable compactions. One side effect is compaction picking can spam logs, as there's no delay after failed attempts to pick compactions.

1. If a compaction pulled in a locked input-level file due to user-key overlap, we would not consider picking another file from the same input level.
2. If a compaction pulled in a locked output-level file due to user-key overlap, we would not consider picking any other compaction on any level.

The code changes are dependent, which is why I solved both problems in a single diff.

- Moved input-level `ExpandInputsToCleanCut` into the loop inside `PickFileToCompact`. This gives two benefits: (1) if it fails, we will try the next-largest file on the same input level; (2) we get the fully-expanded input-level key-range with which we can check for pending compactions in output level.
- Added another call to `ExpandInputsToCleanCut` inside `PickFileToCompact`'s to check for compaction conflicts in output level.
- Deleted call to `IsRangeInCompaction` in `PickFileToCompact`, as `ExpandInputsToCleanCut` also correctly handles the case where original output-level files (i.e., ones not pulled in due to user-key overlap) are pending compaction.
Closes https://github.com/facebook/rocksdb/pull/2615

Differential Revision: D5454643

Pulled By: ajkr

fbshipit-source-id: ea3fb5477d83e97148951af3fd4558d2039e9872
2017-07-19 20:42:00 -07:00
Andrew Kryczka
ffd2a2eefd delete ExpandInputsToCleanCut failure log
Summary:
I decided not even to keep it as an INFO-level log as it is too normal for compactions to be skipped due to locked input files. Removing logging here makes us consistent with how we treat locked files that weren't pulled in due to overlap.

We may want some error handling on line 422, which should never happen when called by `LevelCompactionBuilder::PickCompaction`, as `SetupInitialFiles` skips compactions where overlap causes the output level to pull in locked files.
Closes https://github.com/facebook/rocksdb/pull/2617

Differential Revision: D5458502

Pulled By: ajkr

fbshipit-source-id: c2e5f867c0a77c1812ce4242ab3e085b3eee0bae
2017-07-19 20:42:00 -07:00
Maysam Yabandeh
36651d14ee Moving static AdaptationContext to outside function
Summary:
Moving static AdaptationContext to outside function to bypass tsan's false report with static initializers.

It is because with optimization enabled std::atomic is simplified to as a simple read with no locks. The existing lock produced by static initializer is __cxa_guard_acquire which is apparently not understood by tsan as it is different from normal locks (__gthrw_pthread_mutex_lock).

This is a known problem with tsan:
https://stackoverflow.com/questions/27464190/gccs-tsan-reports-a-data-race-with-a-thread-safe-static-local
https://stackoverflow.com/questions/42062557/c-multithreading-is-initialization-of-a-local-static-lambda-thread-safe

A workaround that I tried was to move the static variable outside the function. It is not a good coding practice since it gives global visibility to variable but it is a hackish workaround until g++ tsan is improved.
Closes https://github.com/facebook/rocksdb/pull/2598

Differential Revision: D5445281

Pulled By: yiwu-arbug

fbshipit-source-id: 6142bd934eb5852d8fd7ce027af593ba697ed41d
2017-07-18 16:58:22 -07:00
Siying Dong
ae28634e9f Remove some left-over BSD headers
Summary: Closes https://github.com/facebook/rocksdb/pull/2608

Differential Revision: D5444797

Pulled By: siying

fbshipit-source-id: 690581d03f37822e059a16085088e8e2d8a45016
2017-07-18 11:56:57 -07:00
Sushma Devendrappa
0655b58582 enable PinnableSlice for RowCache
Summary:
This patch enables using PinnableSlice for RowCache, changes include
not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle.
Closes https://github.com/facebook/rocksdb/pull/2492

Differential Revision: D5316216

Pulled By: maysamyabandeh

fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362
2017-07-17 15:08:30 -07:00
Yi Wu
00464a3140 Fix column_family_test with LITE build
Summary:
Fix column_family_test with LITE build. I need this patch to fix 5.6 branch.
Closes https://github.com/facebook/rocksdb/pull/2597

Differential Revision: D5437171

Pulled By: yiwu-arbug

fbshipit-source-id: 88b9dc5925a6b47af10c1b41bc5b07c4251a84b5
2017-07-17 15:08:24 -07:00
Yedidya Feldblum
f1a056e005 CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera
Summary:
CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera.

The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures.

Reviewed By: Orvid

Differential Revision:
D5432398

Tags: codemod, codemod-opensource

fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914
2017-07-16 21:26:02 -07:00
Siying Dong
3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
奏之章
70440f7a63 Add virtual func IsDeleteRangeSupported
Summary:
this modify allows third-party tables able to support delete range
Closes https://github.com/facebook/rocksdb/pull/2035

Differential Revision: D5407973

Pulled By: ajkr

fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418
2017-07-12 16:58:45 -07:00
Sagar Vemuri
56656e12d6 Temporarily disable FIFOCompactionWithTTLTest
Summary:
FIFOCompactionWithTTLTests are flaky when run in parallel, as there is a time element involved to it. Temporarily disabling them while I investigate a more robust testing solution like, say,  mocking time.
Closes https://github.com/facebook/rocksdb/pull/2548

Differential Revision: D5386084

Pulled By: sagar0

fbshipit-source-id: 262886b25bdf091021d8553e780443a985e9bac4
2017-07-07 20:12:58 -07:00
Andrew Kryczka
33042573db Fix GetCurrentTime() initialization for valgrind
Summary:
Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
Closes https://github.com/facebook/rocksdb/pull/2526

Differential Revision: D5358689

Pulled By: ajkr

fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42
2017-07-05 12:12:00 -07:00
Maysam Yabandeh
7604b463b5 Update the AddDBStats in LITE
Summary: Closes https://github.com/facebook/rocksdb/pull/2525

Differential Revision: D5356859

Pulled By: maysamyabandeh

fbshipit-source-id: f593adad2a8aab12dcd6ab25db076eca51d30d34
2017-06-30 10:56:50 -07:00
Maysam Yabandeh
1e34d07e18 Simplify and document sync rules for logs_ etc
Summary:
Adding/Correcting inline comments and clarify the sync rules. To make it simple to reason, the rules are a big general which ended up to some extra synchronizations. However such synchronizations are not on the fast path, and they are worth the simplicity.
Closes https://github.com/facebook/rocksdb/pull/2517

Differential Revision: D5348239

Pulled By: maysamyabandeh

fbshipit-source-id: ff2e59fb1e568c122d2cdbf598310f3613b7d212
2017-06-30 09:42:28 -07:00
Andrew Kryczka
d310e0f339 Regression test for empty dedicated range deletion file
Summary:
Issue: #2478
Fix: #2503

The bug happened when all of these conditions were satisfied:

- A subcompaction generates no keys
- `RangeDelAggregator::ShouldAddTombstones()` returns true because there's at least one non-obsoleted range deletion in its map
- None of the non-obsolete tombstones overlap with the subcompaction key-range

Under those conditions, we were creating a dedicated file for range deletions which was left empty, thus causing an error in VersionEdit.

I verified this test case fails before the #2503 fix and passes after.
Closes https://github.com/facebook/rocksdb/pull/2521

Differential Revision: D5352568

Pulled By: ajkr

fbshipit-source-id: f619cae39984ce9bb9b7a4e7a9ac0f2bb2ce43e9
2017-06-30 00:11:25 -07:00
Maysam Yabandeh
e9f91a5176 Add a fetch_add variation to AddDBStats
Summary:
AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add.

This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is:
- vanilla: 68618 tps
- removing mutex on AddDBStats (unsafe): 69767 tps
- fetch_add for all AddDBStats: 69200 tps
- fetch_add only for concurrently access AddDBStats (this patch): 69579 tps
Closes https://github.com/facebook/rocksdb/pull/2505

Differential Revision: D5330656

Pulled By: maysamyabandeh

fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383
2017-06-29 17:12:19 -07:00
zhangjinpeng1987
c1b375e96a skip generating empty sst
Summary:
When a compaction job output nothing, there is no necessary to generate a empty sst file which will cause `VersionEdit::EncodeTo` failed.
ref https://github.com/facebook/rocksdb/issues/2478
Closes https://github.com/facebook/rocksdb/pull/2503

Differential Revision: D5350799

Pulled By: ajkr

fbshipit-source-id: df0b4fcf3507fe1c3c435208b762e75478e00143
2017-06-29 15:26:52 -07:00
Mike Kolupaev
397ab11152 Improve Status message for block checksum mismatches
Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.

It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507

Differential Revision: D5345702

Pulled By: al13n321

fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
2017-06-28 21:27:01 -07:00
Siying Dong
18c63af6ef Make "make analyze" happy
Summary:
"make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way.
Closes https://github.com/facebook/rocksdb/pull/2508

Differential Revision: D5341710

Pulled By: siying

fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6
2017-06-28 15:42:27 -07:00
Maysam Yabandeh
01534db24c Fix the reported asan issues
Summary:
This is to resolve the asan complains. In the meanwhile I am working on clarifying/revisiting the sync rules.
Closes https://github.com/facebook/rocksdb/pull/2510

Differential Revision: D5338660

Pulled By: yiwu-arbug

fbshipit-source-id: ce6f6e0826d43a2c0bfa4328a00c78f73cd6498a
2017-06-28 13:13:22 -07:00
Sagar Vemuri
1cd45cd1b3 FIFO Compaction with TTL
Summary:
Introducing FIFO compactions with TTL.

FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.

To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
  - On Flush: Set to the time of flush.
  - On Compaction: Set to the max creation_time of all the files involved in the compaction.
  - On Repair and Recovery: Set to the time of repair/recovery.
  - Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
  - the creation_time of all files involved in compaction is 0.
  - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.

This feature is not supported if max_open_files != -1 or with table formats other than Block-based.

**Test Plan:**
Added tests.

**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100

readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
```

With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20

readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```

SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480

Differential Revision: D5305116

Pulled By: sagar0

fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
2017-06-27 17:11:48 -07:00
Siying Dong
89468c01d4 Fix Windows build broken by 5c97a7c066
Summary:
A typo conversion fails Windows build. Fix it.
Closes https://github.com/facebook/rocksdb/pull/2500

Differential Revision: D5325962

Pulled By: siying

fbshipit-source-id: 2cefdafc9afbc85f856f403af7c876b622400630
2017-06-26 17:28:22 -07:00
Ewout Prangsma
51778612c9 Encryption at rest support
Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.

It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).

The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.

To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:

```
ENCRYPTED_ENV=1 make check_some
```

There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424

Differential Revision: D5322178

Pulled By: sdwilsh

fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
2017-06-26 16:56:24 -07:00
Siying Dong
5c97a7c066 Unit Tests for sync, range sync and file close failures
Summary: Closes https://github.com/facebook/rocksdb/pull/2454

Differential Revision: D5255320

Pulled By: siying

fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913
2017-06-26 13:27:58 -07:00
Siying Dong
d757355cbf Fix bug that flush doesn't respond to fsync result
Summary:
With a regression bug was introduced two years ago, by 6e9fbeb27c , we fail to check return status of fsync call. This can cause we miss the information from the file system and can potentially cause corrupted data which we could have been detected.
Closes https://github.com/facebook/rocksdb/pull/2495

Reviewed By: ajkr

Differential Revision: D5321949

Pulled By: siying

fbshipit-source-id: c68117914bb40700198fc37d0e4c63163a8a1031
2017-06-26 12:41:48 -07:00
Maysam Yabandeh
8e6345d2df Update rename of ParanoidCheck
Summary: Closes https://github.com/facebook/rocksdb/pull/2494

Differential Revision: D5317902

Pulled By: maysamyabandeh

fbshipit-source-id: 097330292180816b3d0c9f4cbbdb6f68f0180200
2017-06-24 18:12:03 -07:00
Maysam Yabandeh
499ebb3ab5 Optimize for serial commits in 2PC
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)

The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.

Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.

Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345

Differential Revision: D5210732

Pulled By: maysamyabandeh

fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
2017-06-24 14:11:29 -07:00
Andrew Kryczka
71f5bcb730 Introduce OnBackgroundError callback
Summary:
Some users want to prevent rocksdb from entering read-only mode in certain error cases. This diff gives them a callback, `OnBackgroundError`, that they can use to achieve it.

- call `OnBackgroundError` every time we consider setting `bg_error_`. Use its result to assign `bg_error_` but not to change the function's return status.
- classified calls using `BackgroundErrorReason` to give the callback some info about where the error happened
- renamed `ParanoidCheck` to something more specific so we can provide a clear `BackgroundErrorReason`
- unit tests for the most common cases: flush or compaction errors
Closes https://github.com/facebook/rocksdb/pull/2477

Differential Revision: D5300190

Pulled By: ajkr

fbshipit-source-id: a0ea4564249719b83428e3f4c6ca2c49e366e9b3
2017-06-22 19:41:50 -07:00
Siying Dong
6837a17621 Fix Data Race Between CreateColumnFamily() and GetAggregatedIntProperty()
Summary:
CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.

Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
Closes https://github.com/facebook/rocksdb/pull/2475

Differential Revision: D5298053

Pulled By: siying

fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
2017-06-22 15:56:47 -07:00
Andrew Kryczka
9467eb6141 Fix flush assertion with tsan
Summary:
DBImpl's instance variables should only be accessed with mutex held. I moved an assert later to uphold this rule.

DBTest.LastWriteBufferDelay test was sporadically failing TSAN because it tried to flush around the same time the db was destroyed, so the variable was accessed simultaneously by two threads.
Closes https://github.com/facebook/rocksdb/pull/2471

Differential Revision: D5286857

Pulled By: ajkr

fbshipit-source-id: 435abd84efa601f667c254e320b0bb5a434b971f
2017-06-20 16:43:44 -07:00
Dmitri Smirnov
a21db161c9 Implement ReopenWritibaleFile on Windows and other fixes
Summary:
Make default impl return NoSupported so the db_blob
  tests exist in a meaningful manner.
  Replace std::thread to port::Thread
Closes https://github.com/facebook/rocksdb/pull/2465

Differential Revision: D5275563

Pulled By: yiwu-arbug

fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6
2017-06-20 10:31:13 -07:00
zhangjinpeng1987
c430d69eed fix coredump for release nullptr
Summary:
Coredump will be triggered when ingest external sst file after delete range.
ref https://github.com/facebook/rocksdb/issues/2398
Closes https://github.com/facebook/rocksdb/pull/2463

Differential Revision: D5275599

Pulled By: ajkr

fbshipit-source-id: 0828dbc062ea8c74e913877cd63494fd3478a30d
2017-06-19 11:41:38 -07:00
hyunwoo
6b5a5dc5d8 fixed typo
Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2430

Differential Revision: D5242471

Pulled By: IslamAbdelRahman

fbshipit-source-id: 832eb3a4c70221444ccd2ae63217823fec56c748
2017-06-13 16:58:01 -07:00
Andrew Kryczka
c217e0b9c7 Call RateLimiter for compaction reads
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433

Differential Revision: D5216946

Pulled By: ajkr

fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
2017-06-13 14:56:46 -07:00
Siying Dong
5d5a28a98c Fix Clang release build broken by 5582123dee
Summary:
5582123dee broken CLANG release build because of an unexpected change. Fix it.
Closes https://github.com/facebook/rocksdb/pull/2443

Differential Revision: D5236297

Pulled By: siying

fbshipit-source-id: 1b410adf13ded149c53e8235e9ea9f3130fb5403
2017-06-13 04:56:35 -07:00
Islam AbdelRahman
d713471da8 Limit trash directory to be 25% of total DB
Summary:
Update DeleteScheduler to delete files immediately if trash directory is >= 25% of DB size
Closes https://github.com/facebook/rocksdb/pull/2436

Differential Revision: D5230384

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5cbda8ac536a3cc72c774641621edc02c8202482
2017-06-12 16:57:21 -07:00
Siying Dong
5582123dee Sample number of reads per SST file
Summary:
We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
Closes https://github.com/facebook/rocksdb/pull/2417

Differential Revision: D5193528

Pulled By: siying

fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df
2017-06-12 07:12:08 -07:00
Siying Dong
db818d2d1a Fix RocksDB Lite build with CLANG
Summary: Closes https://github.com/facebook/rocksdb/pull/2419

Differential Revision: D5193976

Pulled By: siying

fbshipit-source-id: 62d115edee6043237e9d6ad3c2a05481e162c9eb
2017-06-12 06:41:27 -07:00
Yi Wu
85dace2afa Disable DBRangeDelTest::TailingIteratorRangeTombstoneUnsupported for ubsan
Summary:
UBSAN crashes when it run the test. Disabling it for UBSAN.
Closes https://github.com/facebook/rocksdb/pull/2427

Differential Revision: D5210897

Pulled By: yiwu-arbug

fbshipit-source-id: 2f5a876807c98d8db79ab9581965f7e6b29d4163
2017-06-08 12:43:01 -07:00
Siying Dong
52a7f38b19 WriteOptions.low_pri which can throttle low pri writes if needed
Summary:
If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown.
Closes https://github.com/facebook/rocksdb/pull/2369

Differential Revision: D5127619

Pulled By: siying

fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0
2017-06-05 15:02:35 -07:00
Yi Wu
dba9f3722b Fix db_write_test clang/windows build failure
Summary:
Fix db_write_test clang/windows build failure. Explicitly cast size_t (unsigned long) to uint32_t (unsigned int).
Closes https://github.com/facebook/rocksdb/pull/2407

Differential Revision: D5182995

Pulled By: yiwu-arbug

fbshipit-source-id: aba225a9fccb12d5bfbdc2cd6efc11040706a9d2
2017-06-05 12:27:24 -07:00
hyunwoo
c7662a44a4 fixed typo
Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2376

Differential Revision: D5183630

Pulled By: ajkr

fbshipit-source-id: 133cfd0445959e70aa2cd1a12151bf3c0c5c3ac5
2017-06-05 11:27:34 -07:00
Aaron Gao
7f6c02dda1 using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public…
Summary:
… headers

https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.

We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.

make check -j64
Closes https://github.com/facebook/rocksdb/pull/2380

Differential Revision: D5177896

Pulled By: lightmark

fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
2017-06-02 17:26:19 -07:00
Mike Kolupaev
138b87eae4 Fix interaction between CompactionFilter::Decision::kRemoveAndSkipUnt…
Summary:
Fixes the following scenario:
 1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`.
 2. Do a compaction.
 3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`.
 4. At some point compaction filter returns `kRemoveAndSkipUntil`.
 5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded.

The fix is to make compaction iterator use `total_order_seek = true`.

The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice).
Closes https://github.com/facebook/rocksdb/pull/2349

Differential Revision: D5110388

Pulled By: lightmark

fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a
2017-06-02 15:11:38 -07:00
Siying Dong
95b0e89b5d Improve write buffer manager (and allow the size to be tracked in block cache)
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350

Differential Revision: D5110648

Pulled By: siying

fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
2017-06-02 14:26:56 -07:00
Andrew Kryczka
a4d9c02511 Pass CF ID to MemTableRepFactory
Summary:
Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:

- adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
- updates MemTable to create MemTableRep's using the new overload.
Closes https://github.com/facebook/rocksdb/pull/2346

Differential Revision: D5108061

Pulled By: ajkr

fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616
2017-06-02 12:12:06 -07:00
Yi Wu
f68d88be51 Fix DBWriteTest::ReturnSequenceNumberMultiThreaded data race
Summary:
rocksdb::Random is not thread-safe. Have one Random for each thread instead.
Closes https://github.com/facebook/rocksdb/pull/2400

Differential Revision: D5173919

Pulled By: yiwu-arbug

fbshipit-source-id: 1a99c7b877f3893eb22355af49e321bcad4e53e6
2017-06-02 11:42:11 -07:00
Andrew Kryczka
215076ef06 Fix TSAN: avoid arena mode with range deletions
Summary:
The range deletion meta-block iterators weren't getting cleaned up properly since they don't support arena allocation. I didn't implement arena support since, in the general case, each iterator is used only once and separately from all other iterators, so there should be no benefit to data locality.

Anyways, this diff fixes up #2370 by treating range deletion iterators as non-arena-allocated.
Closes https://github.com/facebook/rocksdb/pull/2399

Differential Revision: D5171119

Pulled By: ajkr

fbshipit-source-id: bef6f5c4c5905a124f4993945aed4bd86e2807d8
2017-06-01 22:26:49 -07:00
Andrew Kryczka
3a8a848a55 account for L0 size in estimated compaction bytes
Summary:
also changed the `>` in the comparison against `level0_file_num_compaction_trigger` into a `>=` since exactly `level0_file_num_compaction_trigger` can trigger a compaction from L0.
Closes https://github.com/facebook/rocksdb/pull/2179

Differential Revision: D4915772

Pulled By: ajkr

fbshipit-source-id: e38fec6253de6f9a40e61734615c6670d84038aa
2017-06-01 17:56:59 -07:00
Tamir Duberstein
0dc3040d54 db: avoid #includeing malloc and jemalloc simultaneously
Summary:
This fixes a compilation failure on Linux when the system libc is not
glibc. jemalloc's configure script incorrectly assumes that glibc is
always used on Linux systems, producing glibc-style signatures; when
the system libc is e.g. musl, the following error is observed:

```
  [  0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o
  In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0,
                   from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77:
  /x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void*)' has a different exception specifier
   size_t malloc_usable_size(void *);
          ^~~~~~~~~~~~~~~~~~
  In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0:
  /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()'
   #  define je_malloc_usable_size malloc_usable_size
                                   ^
  /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size'
   JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size(
                                           ^~~~~~~~~~~~~~~~~~~~~
  CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed
```

This works around the issue by rearranging the sources such that
jemalloc's headers are never in the same scope as the system's malloc
header. The jemalloc issue has been reported as well, see:
https://github.com/jemalloc/jemalloc/issues/778.

cc tschottdorf
Closes https://github.com/facebook/rocksdb/pull/2188

Differential Revision: D5163048

Pulled By: siying

fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d
2017-05-31 22:43:02 -07:00
Andrew Kryczka
9c9909bf7d Support ingest file when range deletions exist
Summary:
Previously we returned NotSupported when ingesting files into a database containing any range deletions. This diff adds the support.

- Flush if any memtable contains range deletions overlapping the to-be-ingested file
- Place to-be-ingested file before any level that contains range deletions overlapping it.
- Added support for `Version` to return iterators over range deletions in a given level. Previously, we piggybacked getting range deletions onto `Version`'s `Get()` / `AddIterator()` functions by passing them a `RangeDelAggregator*`. But file ingestion needs to get iterators over range deletions, not populate an aggregator (since the aggregator does collapsing and doesn't expose the actual ranges).
Closes https://github.com/facebook/rocksdb/pull/2370

Differential Revision: D5127648

Pulled By: ajkr

fbshipit-source-id: 816faeb9708adfa5287962bafdde717db56e3f1a
2017-05-31 13:57:19 -07:00
Yi Wu
ad19eb8686 Fixing blob db sequence number handling
Summary:
Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.

Stacking on #2375.
Closes https://github.com/facebook/rocksdb/pull/2385

Differential Revision: D5148358

Pulled By: yiwu-arbug

fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
2017-05-31 10:56:45 -07:00
Siying Dong
51ac91f586 Histogram of number of merge operands
Summary:
Add a histogram in statistics to help users understand how many merge operands they merge.
Closes https://github.com/facebook/rocksdb/pull/2373

Differential Revision: D5139983

Pulled By: siying

fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182
2017-05-31 07:41:44 -07:00
Tamir Duberstein
103d0692ea Avoid unsupported attributes when not building with UBSAN
Summary:
yiwu-arbug see individual commits.
Closes https://github.com/facebook/rocksdb/pull/2318

Differential Revision: D5141520

Pulled By: yiwu-arbug

fbshipit-source-id: 7987c92ab4461eef36afce5a133d3a0ee0c96300
2017-05-30 11:13:01 -07:00
赵星宇
d03c34497c update comment of GetNextFile
Summary: Closes https://github.com/facebook/rocksdb/pull/2377

Differential Revision: D5141274

Pulled By: lightmark

fbshipit-source-id: c237a285b73ad93488c080ea80c71a29a17f1be0
2017-05-26 15:12:13 -07:00
Aaron Gao
f7bb1a0060 support merge and delete in file ingestion
Summary:
Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested.
Closes https://github.com/facebook/rocksdb/pull/2361

Differential Revision: D5139402

Pulled By: lightmark

fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88
2017-05-26 12:11:21 -07:00
Sagar Vemuri
7bb1f5d483 Increase of compaction threads should be logged at info level instead of a warning
Summary:
This log message shouldn't be a warning; some services are seeing high warning count due to this.

The count for the below line is a few hundreds of millions, as per Logview:
```
[rocksdb/src/db/column_family.cc:729] [checkpoints] Increasing compaction threads because we have 2 level-0 files
```
Closes https://github.com/facebook/rocksdb/pull/2364

Differential Revision: D5123565

Pulled By: sagar0

fbshipit-source-id: a07ce499a4f82f0ebde9cda9f4948fb9df6a734c
2017-05-26 09:56:13 -07:00
Andrew Kryczka
a99fb9928f fix column_family_test asan
Summary:
stop calling Close() at the end of tests holding a compaction pressure token since it causes the write controller to be deleted while it's still needed. these calls were pointless anyways since Close() is already called in the test's destructor.
Closes https://github.com/facebook/rocksdb/pull/2367

Differential Revision: D5125906

Pulled By: ajkr

fbshipit-source-id: 6cad8673e5546a82ff602ac0ba59cc3f68dbde46
2017-05-24 16:41:51 -07:00
Andrew Kryczka
bb01c1880c Introduce max_background_jobs mutable option
Summary:
- `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
- `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
- `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
- The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
Closes https://github.com/facebook/rocksdb/pull/2205

Differential Revision: D4937461

Pulled By: ajkr

fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
2017-05-24 11:29:08 -07:00
Siying Dong
41cbb72749 options.delayed_write_rate use the rate of rate_limiter by default.
Summary:
It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude.
Closes https://github.com/facebook/rocksdb/pull/2357

Differential Revision: D5115324

Pulled By: siying

fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754
2017-05-24 09:58:24 -07:00
Igor Canadi
52d9e5f7b6 Fix column family seconds_up accounting
Summary:
`cf_stats_snapshot_.seconds_up` appears to be never updated, unlike `db_stats_snapshot_.seconds_up`, which is updated here: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.cc#L883

This leads to wrong information in the log, for example:
```
** Compaction Stats [default] **

....

Uptime(secs): 85591.2 total, 85591.2 interval
```

Even though DB's interval is correctly logged as 60 seconds:
```
** DB Stats **
Uptime(secs): 85591.2 total, 637.8 interval
```
Closes https://github.com/facebook/rocksdb/pull/2338

Differential Revision: D5114131

Pulled By: sagar0

fbshipit-source-id: 85243a38213236ccbb601a7f7aaa8865eaa8083c
2017-05-23 17:14:04 -07:00
Sagar Vemuri
7d8207f1f2 Fix errors in clang-analyzer builds
Summary:
Fix build error in db_iter.cc when running clang-analyzer.
```
  CC       db/db_iter.o
db/db_iter.cc:938:21: error: no matching constructor for initialization of 'rocksdb::ParsedInternalKey'
  ParsedInternalKey ikey(Slice(), 0, 0);
                    ^    ~~~~~~~~~~~~~
./db/dbformat.h:84:3: note: candidate constructor not viable: no known conversion from 'int' to 'rocksdb::ValueType' for 3rd argument
  ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t)
  ^
./db/dbformat.h:78:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided
struct ParsedInternalKey {
       ^
./db/dbformat.h:78:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided
./db/dbformat.h:83:3: note: candidate constructor not viable: requires 0 arguments, but 3 were provided
  ParsedInternalKey() { }  // Intentionally left uninitialized (for speed)
  ^
1 error generated.
```
Closes https://github.com/facebook/rocksdb/pull/2354

Differential Revision: D5115751

Pulled By: sagar0

fbshipit-source-id: b0e386d4e935e4725b07761c3ca5f7a8cbde3692
2017-05-23 15:11:42 -07:00
Andrew Kryczka
6cc9aef162 New API for background work in single thread pool
Summary:
Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.

Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
Closes https://github.com/facebook/rocksdb/pull/2204

Differential Revision: D4936256

Pulled By: ajkr

fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
2017-05-23 11:12:27 -07:00
Yi Wu
9d0a07ed52 Fix rocksdb.estimate-num-keys DB property underflow
Summary:
rocksdb.estimate-num-keys is compute from `estimate_num_keys - 2 * estimate_num_deletes`. If  `2 * estimate_num_deletes > estimate_num_keys` it will underflow. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/2348

Differential Revision: D5109272

Pulled By: yiwu-arbug

fbshipit-source-id: e1bfb91346a59b7282a282b615002507e9d7c246
2017-05-23 10:42:59 -07:00
Aaron Gao
3e86c0f07c disable direct reads for log and manifest and add direct io to tests
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337

Differential Revision: D5100261

Pulled By: lightmark

fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
2017-05-22 18:41:28 -07:00
Sagar Vemuri
228f49d20a Fix data races caught by tsan
Summary:
This fixes the tsan build failures in:
- write_callback_test
- persistent_cache_test.*
Closes https://github.com/facebook/rocksdb/pull/2339

Differential Revision: D5101190

Pulled By: sagar0

fbshipit-source-id: 537e19ed05272b1f34cfbf793aa822b2264a1643
2017-05-22 10:27:23 -07:00
Yi Wu
07bdcb91fe New WriteImpl to pipeline WAL/memtable write
Summary:
PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that
1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit.
2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes.

Merging #2056 and #2058 into this PR.
Closes https://github.com/facebook/rocksdb/pull/2286

Differential Revision: D5054606

Pulled By: yiwu-arbug

fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d
2017-05-19 14:26:42 -07:00
Yi Wu
d746aead1a Suppress clang-analyzer false positive
Summary:
Fixing two types of clang-analyzer false positives:
* db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated.
* ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running.
Closes https://github.com/facebook/rocksdb/pull/2334

Differential Revision: D5093801

Pulled By: yiwu-arbug

fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d
2017-05-19 10:56:28 -07:00
Siying Dong
217b866f47 column_family_test: EnvCounter::num_new_writable_file_ to be atomic
Summary:
TSAN shows warning of data race of EnvCounter::num_new_writable_file_. Make it atomic.
Closes https://github.com/facebook/rocksdb/pull/2331

Differential Revision: D5089215

Pulled By: siying

fbshipit-source-id: 15f6dcfb770a3310cbb6337c22482c8b330daffc
2017-05-18 13:56:12 -07:00
yizhu.sun
f5ba131bf8 Fixed some spelling mistakes
Summary: Closes https://github.com/facebook/rocksdb/pull/2314

Differential Revision: D5079601

Pulled By: sagar0

fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349
2017-05-17 23:12:36 -07:00
hyunwoo
0ebdd70579 fixed typo
Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2312

Differential Revision: D5079631

Pulled By: sagar0

fbshipit-source-id: e4c8d1d89b244ee69e9dea1dd013227cc5241026
2017-05-17 16:41:49 -07:00
Mikhail Antonov
ba685a472a Support ingest_behind for IngestExternalFile
Summary:
First cut for early review; there are few conceptual points to answer and some code structure issues.

For conceptual points -

 - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
 - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
 - Modifying AssignLevelForIngestedFile seems the place we we'd handle that.

On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
Closes https://github.com/facebook/rocksdb/pull/2144

Differential Revision: D4873732

Pulled By: lightmark

fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
2017-05-17 11:42:42 -07:00
boolean5
cb9392a094 add Transactions and Checkpoint to C API
Summary:
I've added functions to the C API to support Transactions as requested in #1637 and to support Checkpoint.

I have also added the corresponding tests to c_test.c

For now, the following is omitted:

1. Optimistic Transactions
2. The column family variation of functions
Closes https://github.com/facebook/rocksdb/pull/2236

Differential Revision: D4989510

Pulled By: yiwu-arbug

fbshipit-source-id: 518cb39f76d5e9ec9690d633fcdc014b98958071
2017-05-16 22:59:43 -07:00
hyunwoo
f720796e24 fixed typo
Summary:
fixed exisitng -> existing
Closes https://github.com/facebook/rocksdb/pull/2305

Differential Revision: D5070169

Pulled By: yiwu-arbug

fbshipit-source-id: 8c8450acf50757b767cf78b78314018395738d96
2017-05-16 11:07:58 -07:00
siddontang
1ca723dbd1 C API: support pinnable get
Summary: Closes https://github.com/facebook/rocksdb/pull/2254

Differential Revision: D5053590

Pulled By: yiwu-arbug

fbshipit-source-id: 2f365a031b3a2947b4fba21d26d4f8f52af9b9f0
2017-05-16 11:07:58 -07:00
赵星宇
4f9e69ccf4 fix log err
Summary: Closes https://github.com/facebook/rocksdb/pull/2206

Differential Revision: D5054222

Pulled By: yiwu-arbug

fbshipit-source-id: d8742bda1bf3e76d7b68eeb86df4608031b5cbc8
2017-05-15 16:15:38 -07:00
Yi Wu
3907c94ffb Fix ColumnFamilyTest:BulkAddDrop
Summary:
Fix ColumnFamilyTest:BulkAddDrop not deleted CF handles at the end, causing ASAN failure.
Closes https://github.com/facebook/rocksdb/pull/2275

Differential Revision: D5040724

Pulled By: yiwu-arbug

fbshipit-source-id: 86cd4070c944d01173a3cc36462bb800698af192
2017-05-10 23:05:44 -07:00