350 Commits

Author SHA1 Message Date
sdong
d72b31774e Slowdown when writing to the last write buffer
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.

Test Plan:
1. Add a new unit test.
2. Run db_bench with

./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics

based on hard drive and see stopping is avoided with the commit.

Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52047
2015-12-17 10:49:08 -08:00
Islam AbdelRahman
aececc209e Introduce ReadOptions::pin_data (support zero copy for keys)
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999
2015-12-16 12:08:30 -08:00
Venkatesh Radhakrishnan
030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
agiardullo
3bfd3d39a3 Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.

With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).

Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.

Test Plan: unit tests, db bench

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb, yoshinorim

Differential Revision: https://reviews.facebook.net/D50475
2015-12-11 12:34:11 -08:00
charsyam
c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
sdong
56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong
d6e1035a1f A new compaction picking priority that optimizes for write amplification for random updates.
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.

Test Plan: Add a unit test and run it in valgrind too.

Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51459
2015-12-09 18:13:03 -08:00
Islam AbdelRahman
a163cc2d5a Lint everything
Summary:
```
arc2 lint --everything
```

run the linter on the whole code repo to fix exisitng lint issues

Test Plan: make check -j64

Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50769
2015-11-16 12:56:21 -08:00
Yueh-Hsuan Chiang
e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Venkatesh Radhakrishnan
9d50afc3b9 Prefix-based iterating only shows keys in prefix
Summary:
MyRocks testing found an issue that while iterating over keys
that are outside the prefix, sometimes wrong results were seen for keys
outside the prefix. We now tighten the range of keys seen with a new
read option called prefix_seen_at_start. This remembers the starting
prefix and then compares it on a Next for equality of prefix. If they
are from a different prefix, it sets valid to false.

Test Plan: PrefixTest.PrefixValid

Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony

Reviewed By: anthony

Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50211
2015-11-05 13:24:05 -08:00
SherlockNoMad
ccc8c10c0c Move skip_table_builder_flush to BlockBasedTableOption 2015-10-30 18:33:01 -07:00
SherlockNoMad
a6dd0831d5 Add Option to Skip Flushing in TableBuilder 2015-10-29 22:10:25 -07:00
Islam AbdelRahman
2872e0c8c2 Clean and expose CreateLoggerFromOptions
Summary:
CreateLoggerFromOptions have some parameters like  db_log_dir and env, these parameters are redundant since they already exist in DBOptions

this patch remove the redundant parameters and expose CreateLoggerFromOptions to users

Test Plan: make check

Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D49713
2015-10-29 18:07:37 -07:00
sdong
296c3a1f94 "make format" in some recent commits
Summary: Run "make format" for some recent commits.

Test Plan: Build and run tests

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49707
2015-10-29 17:11:14 -07:00
Siying Dong
6388e7f4e2 Merge pull request #798 from yuslepukhin/readahead_buffermanagement
Implement smart buffer management in Windows Env.
2015-10-29 15:02:59 -07:00
Dmitri Smirnov
6fbc4f9f3e Implement smart buffer management.
introduce a new DBOption random_access_max_buffer_size to limit
  the size of the random access buffer used for unbuffered access.
  Implement read ahead buffering when enabled.
  To that effect propagate compaction_readahead_size and the new option
  to the env options to make it available for the implementation.
  Add Hint() override so SetupForCompaction() call would call Hint()
  readahead can now be setup from both Hint() and EnableReadAhead()
  Add new option random_access_max_buffer_size support
  db_bench, options_helper to make it string parsable
  and the unit test.
2015-10-27 14:44:16 -07:00
Praveen Rao
4ce117c4d5 Merge branch 'master' into wal_filter 2015-10-26 19:03:34 -07:00
Praveen Rao
0c59691dde Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status 2015-10-19 13:27:40 -07:00
Sage Weil
543c12ab06 options: add recycle_log_file_num option
Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Praveen Rao
cc4d13e0a8 Put wal_filter under #ifndef ROCKSDB_LITE 2015-10-13 11:10:14 -07:00
Praveen Rao
eb24178553 merge from master 2015-10-12 17:24:21 -07:00
Praveen Rao
59a0c219bb Adding log filter to inspect and filter log records on recovery 2015-10-12 17:03:03 -07:00
Igor Canadi
9803e0d813 compaction_filter.h cleanup
Summary:
Two changes:
1. remove *V2 filter stuff. we deprecated that a while ago
2. clarify what happens when user sets max_subcompactions to bigger than 1

Test Plan: none

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47871
2015-10-08 09:32:50 -07:00
Lakshmi Narayanan
4049bcde39 Added boolean variable to guard fallocate() calls
Summary:
Added boolean variable to guard fallocate() calls.
Set to false to prevent space leaks when tests fail.

Test Plan:
Compliles
Set to false and ran log device tests

Reviewers: sdong, lovro, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48027
2015-10-07 10:04:05 -07:00
sdong
f1b9f804e9 Add a mode to always pick the oldest file to compact for each level
Summary:
Add options.compaction_pri, which specifies the policy about which file to compact first.
kCompactionPriByLargestSeq will compact oldest files first.
Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable.

Test Plan: Will write unit tests

Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45951
2015-09-21 17:21:59 -07:00
jsteemann
f8b770a942 fixed typos 2015-09-17 19:58:09 +02:00
Ari Ekmekji
2b683d4972 Add DBOption.max_subcompaction to option dump
Summary:
RocksDB options can be dumped to the log file, and
up to this point the max_subcompactions option was not included
in this dump. This fixes that.

Test Plan: makek all && make check

Reviewers: MarkCallaghan, igor, noetzli, anthony, yhchiang, sdong

Reviewed By: yhchiang, sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46971
2015-09-15 11:41:46 -07:00
sdong
5de807ac16 Add options.hard_pending_compaction_bytes_limit to stop writes if compaction lagging behind
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.

Test Plan: Add unit test DBTest.HardLimit

Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45999
2015-09-14 12:51:16 -07:00
Ari Ekmekji
3c37b3cccd Determine boundaries of subcompactions
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.

The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.

These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.

Test Plan: make all && make check

Reviewers: yhchiang, anthony, igor, noetzli, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43269
2015-09-10 13:50:00 -07:00
Igor Canadi
ac9bcb55ce Set max_open_files based on ulimit
Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566

Test Plan:
make check

I will also verify that max_max_open_files is reasonable.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46551
2015-09-10 10:49:28 -07:00
agiardullo
44b6e99e1b update max_write_buffer_number_to_maintain docblock
Summary: fix comment

Test Plan: n/a

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46455
2015-09-09 13:36:38 -07:00
agiardullo
b5b2b75e52 better tuning of arena block size
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46467
2015-09-08 20:53:32 -07:00
sdong
7a0dbdf3ac Add ZSTD (not final format) compression type
Summary: Add ZSTD compression type. The same way as adding LZ4.

Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45747
2015-08-28 11:01:13 -07:00
Igor Canadi
5f4166c90e ReadaheadRandomAccessFile -- userspace readahead
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.

We add ReadaheadRandomAccessFile layer only when file is read during compactions.

D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.

Test Plan: make check

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45123
2015-08-26 15:25:59 -07:00
Ari Ekmekji
b6def58f73 Changed 'num_subcompactions' to the more accurate 'max_subcompactions'
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.

I also added a --subcompactions option to db_bench

Test Plan: make all   make check

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45069
2015-08-21 14:25:34 -07:00
sdong
9130873a13 Add options.new_table_reader_for_compaction_inputs
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.

Test Plan: Add the option.

Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang

Reviewed By: igor

Subscribers: igor, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43311
2015-08-21 08:46:29 -07:00
sdong
603b6da8b8 Add options.compaction_measure_io_stats to print write I/O stats in compactions
Summary:
Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:

2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}

Add two more counters in iostats_context.

Also add a parameter of db_bench.

Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44115
2015-08-13 16:52:26 -07:00
Islam AbdelRahman
cee1e8a080 Parallelize LoadTableHandlers
Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover

Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)

Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43755
2015-08-11 12:19:56 -07:00
Mike Kolupaev
e06cf1a098 [wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.

Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.

Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba

Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 06:06:39 -07:00
Ari Ekmekji
5dc3e6881a Update Tests To Enable Subcompactions
Summary:
Updated DBTest DBCompactionTest and CompactionJobStatsTest
to run compaction-related tests once with subcompactions enabled and
once disabled using the TEST_P test type in the Google Test suite.

Test Plan: ./db_test  ./db_compaction-test  ./compaction_job_stats_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43443
2015-08-04 22:19:07 -07:00
Islam AbdelRahman
c45a57b41e Support delete rate limiting
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.

I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile

Test Plan:
added delete_scheduler_test
existing unit tests

Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43221
2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang
14d0bfa429 Add DBOptions::skip_sats_update_on_db_open
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.

This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.

Test Plan: Add DBCompactionTest.SkipStatsUpdateTest

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42843
2015-08-04 13:48:16 -07:00
Ari Ekmekji
40c64434d4 Parallelize L0-L1 Compaction: Restructure Compaction Job
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.

This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.

The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are

  # Correct aggregation of compaction job statistics across multiple threads
  # Proper opening/closing of output files (make sure each thread's is unique)
  # Keys that span multiple L1 files
  # Skewed distributions of keys within L0 files

Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test

Reviewers: igor, noetzli, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42699
2015-08-03 11:32:14 -07:00
Siying Dong
3dbf4ba220 RangeSync not to sync last 1MB of the file
Summary:
From other ones' investigation:

"sync_file_range() behavior highly depends on kernel version and filesystem.

xfs does neighbor page flushing outside of the specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16)... Ranges of the sync_file_range() should be far enough from write() offset (at least 1MB)."

Test Plan: make all check

Reviewers: igor, rven, kradhakrishnan, yhchiang, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: yoshinorim, MarkCallaghan, sumeet, domas, dhruba, leveldb, ljin

Differential Revision: https://reviews.facebook.net/D15807
2015-07-21 16:22:40 -07:00
Igor Canadi
35ca59364c Don't let flushes preempt compactions
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.

We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.

Test Plan: make check (there might be some unit tests that depend on this behavior)

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41931
2015-07-17 12:02:52 -07:00
Igor Canadi
a96fcd09b7 Deprecate CompactionFilterV2
Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.

Test Plan: make check

Reviewers: noetzli, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42405
2015-07-17 18:59:11 +02:00
Igor Canadi
a9c5109515 Deprecate purge_redundant_kvs_while_flush
Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.

Test Plan: none

Reviewers: sdong, yhchiang, anthony, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42063
2015-07-14 13:07:02 +02:00
Igor Canadi
5aea98ddd8 Deprecate WriteOptions::timeout_hint_us
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.

Test Plan: make check

Reviewers: anthony, kradhakrishnan, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41937
2015-07-14 09:35:48 +02:00
Dmitri Smirnov
18285c1e2f Windows Port from Microsoft
Summary: Make RocksDb build and run on Windows to be functionally
 complete and performant. All existing test cases run with no
 regressions. Performance numbers are in the pull-request.

 Test plan: make all of the existing unit tests pass, obtain perf numbers.

 Co-authored-by: Praveen Rao praveensinghrao@outlook.com
 Co-authored-by: Sherlock Huang baihan.huang@gmail.com
 Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
 Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
2015-07-01 16:13:56 -07:00
Islam AbdelRahman
674b1181cf Bottommost level compaction option
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40527
2015-06-23 13:32:40 -07:00