Commit Graph

1166 Commits

Author SHA1 Message Date
sdong
b9f77ba12b When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131
2015-12-23 11:33:15 -08:00
Igor Canadi
8ac7fb8377 Merge pull request #863 from zhangyybuaa/fix_hdfs_error
Fix build error with hdfs
2015-12-22 09:27:51 +01:00
sdong
167fb919a5 ZSTD to use CompressionOptions.level
Summary: Now ZSTD hard code level 1. Change it to use the compression level setting.

Test Plan: Run it with hacked codes of sst_dump and show ZSTD compression sizes with different levels.

Reviewers: rven, anthony, yhchiang, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52041
2015-12-16 16:58:04 -08:00
Islam AbdelRahman
aececc209e Introduce ReadOptions::pin_data (support zero copy for keys)
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999
2015-12-16 12:08:30 -08:00
Venkatesh Radhakrishnan
030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
Dmitri Smirnov
aca403d2b5 Fix another rebase problems. 2015-12-11 17:33:40 -08:00
Dmitri Smirnov
236fe21c92 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:47:34 -08:00
Yueh-Hsuan Chiang
00d6edf6a0 Ensure the destruction order of PosixEnv and ThreadLocalPtr
Summary:
By default, RocksDB initializes the singletons of ThreadLocalPtr first, then initializes PosixEnv
via static initializer.  Destructor terminates objects in reverse order, so terminating PosixEnv
(calling pthread_mutex_lock), then ThreadLocal (calling pthread_mutex_destroy).

However, in certain case, application might initialize PosixEnv first, then ThreadLocalPtr.
This will cause core dump at the end of the program (eg. https://github.com/facebook/mysql-5.6/issues/122)

This patch fix this issue by ensuring the destruction order by moving the global static singletons
to function static singletons.  Since function static singletons are initialized when the function is first
called, this property allows us invoke to enforce the construction of the static PosixEnv and the
singletons of ThreadLocalPtr by calling the function where the ThreadLocalPtr singletons belongs
right before we initialize the static PosixEnv.

Test Plan: Verified in the MyRocks.

Reviewers: yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony, sdong, MarkCallaghan

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51789
2015-12-11 00:21:58 -08:00
charsyam
c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
sdong
56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong
d6e1035a1f A new compaction picking priority that optimizes for write amplification for random updates.
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.

Test Plan: Add a unit test and run it in valgrind too.

Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51459
2015-12-09 18:13:03 -08:00
yuslepukhin
49957f9a98 Prefer integer arithmetics
The code had conversion to double then casting to size_t
  and then casting uint32_t which caused compiler warning (VS15).
2015-12-09 14:06:23 -08:00
Siying Dong
9c227923c6 Merge pull request #788 from OpenChannelSSD/to_fb_master2
Move posix threads into a library
2015-12-08 18:06:38 -08:00
Siying Dong
fa3dbf203f Merge pull request #853 from Vaisman/enable_C4267_warning
Enable C4267 warning
2015-12-08 17:59:24 -08:00
Siying Dong
56bbecc316 Merge pull request #867 from SherlockNoMad/CacheFix
Replace malloc with new for LRU Cache Handle
2015-12-08 17:58:29 -08:00
Yueh-Hsuan Chiang
774b80e99e Resubmit the fix for a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51717
2015-12-08 17:01:02 -08:00
sdong
ea11923550 Upgrade to ZSTD 0.4.2
Summary: Change to call the new compression function.

Test Plan: build and run db_bench with the compression to make sure it compresses.

Reviewers: anthony, rven, kradhakrishnan, IslamAbdelRahman, igor, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51603
2015-12-08 16:33:26 -08:00
sdong
770dea9325 Fix occasional failure of DBTest.DynamicCompactionOptions
Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run.

Test Plan: Run the test multiple times in an environment which can cause failure.

Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51687
2015-12-07 18:38:39 -08:00
sdong
f307036bde Revert "Fix a race condition in persisting options"
This reverts commit 2fa3ed5180. It breaks RocksDB lite build
2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang
2fa3ed5180 Fix a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51609
2015-12-07 15:25:12 -08:00
Javier González
b2863017b1 Move posix threads into a library
Summary: This patch moves all posix thread logic to a separate library.
The motivation is to allow another environments to easily reuse posix
threads. HDFS wraps already posix threads; this split would simplify
this code.

Test Plan: No new functionality is added to posix Env or the threading
library, thus the current tests should suffice.
2015-12-07 12:03:38 +01:00
SherlockNoMad
3a98a7ae7f Replace malloc with new for LRU Cache Handle 2015-12-04 15:12:07 -08:00
Zhang Yangyang
4687ced5db fix ToString() not declared error 2015-12-02 21:45:28 +08:00
sdong
d27ea4c9e5 Initialize options.row_cache
Summary: options.row_cache should already been initialized as null by default. Still try to set it following current convention, because one valgrind failure reports a failure related to it.

Test Plan: Run all unit tests

Reviewers: yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51303
2015-11-30 10:30:35 -08:00
Nathan Bronson
9a9d4759b2 InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node.  This allows us to remove the pointer from the node
to the key.  As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.

For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList.  This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index).  Taking advantage of inline allocation for
those cases is left to future work.

The perf win is biggest for small values.  For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec.  For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51123
2015-11-24 15:16:02 -08:00
Vasili Svirski
41b32c6059 Enable C4267 warning
* conversion from 'size_t' to 'type', by add static_cast

Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
2015-11-24 16:33:09 +03:00
yuslepukhin
047bd22aae Build on Visual Studio 2015 Update 1 2015-11-20 15:31:47 -08:00
Dmitri Smirnov
89bacb7e7d Enable MS Warning C4804 : unsafe use of type 'bool' in operation 2015-11-18 16:23:19 -08:00
Islam AbdelRahman
4159ab8169 Merge pull request #839 from SherlockNoMad/memtableOption
Support Memtable Factory Parse in option_helper.cc
2015-11-17 17:09:49 -08:00
sdong
6170fec251 Fix build broken by previous commit of "option helper refactor"
Summary:
The commit of option helper refactor broken the build:
(1) a git merge problem
(2) some uncaught compiler warning
Fix it.

Test Plan: Make sure "make all" passes

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50943
2015-11-17 16:52:54 -08:00
Siying Dong
3a6643c2fd Merge pull request #805 from SherlockNoMad/OptionHelperFix
Option Helper Refactoring
2015-11-17 16:24:52 -08:00
SherlockNoMad
bd7be035e0 Support Memtable Factory Parse in option_helper.cc 2015-11-17 14:29:01 -08:00
Islam AbdelRahman
a163cc2d5a Lint everything
Summary:
```
arc2 lint --everything
```

run the linter on the whole code repo to fix exisitng lint issues

Test Plan: make check -j64

Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50769
2015-11-16 12:56:21 -08:00
Yueh-Hsuan Chiang
e11f676e34 Add OptionsUtil::LoadOptionsFromFile() API
Summary:
This patch adds OptionsUtil::LoadOptionsFromFile() and
OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
to construct DBOptions and ColumnFamilyOptions from a RocksDB
options file.  Note that most pointer-typed options such as
merge_operator will not be constructed.

With this API, developers no longer need to remember all the
options in order to reopen an existing rocksdb instance like
the following:

  DBOptions db_options;
  std::vector<std::string> cf_names;
  std::vector<ColumnFamilyOptions> cf_opts;

  // Load primitive-typed options from an existing DB
  OptionsUtil::LoadLatestOptionsFromDB(
      dbname, &db_options, &cf_names, &cf_opts);

  // Initialize necessary pointer-typed options
  cf_opts[0].merge_operator.reset(new MyMergeOperator());
  ...

  // Construct the vector of ColumnFamilyDescriptor
  std::vector<ColumnFamilyDescriptor> cf_descs;
  for (size_t i = 0; i < cf_opts.size(); ++i) {
    cf_descs.emplace_back(cf_names[i], cf_opts[i]);
  }

  // Open the DB
  DB* db = nullptr;
  std::vector<ColumnFamilyHandle*> cf_handles;
  auto s = DB::Open(db_options, dbname, cf_descs,
                    &handles, &db);

Test Plan:
Augment existing tests in column_family_test
options_test
db_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49095
2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang
e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Dmitri Smirnov
5270b33bd3 Make use of portable uint64_t type to make possible file access
in 64-bit.

  Currently, a signed off_t type is being used for the following
  interfaces for both offset and the length in bytes:
  * `Allocate`
  * `RangeSync`

  On Linux `off_t` is automatically either 32 or 64-bit depending on
  the platform. On Windows it is always a 32-bit signed long which
  limits file access and in particular space pre-allocation
  to effectively 2 Gb.

  Proposal is to replace off_t with uint64_t as a portable type
  always access files with 64-bit interfaces.

  May need to modify posix code but lack resources to test it.
2015-11-10 17:03:42 -08:00
Nathan Bronson
505accda38 remove constexpr from util/random.h for MSVC compat
Summary:
Scoped anonymous enums seem to be better supported than static
constexpr at the moment, so this diff replaces the latter with the former.
Also, this diff removes an incorrect inclusion of pthread.h.  MSVC build
was broken starting with D50439.

Test Plan:
1. build
2. observe proper skiplist behavior by absence of pathological slowdown
3. push diff to tmp_try_windows branch to tickle AppVeyor
4. wait for contbuild before committing to master

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50517
2015-11-10 16:41:23 -08:00
Nathan Bronson
b81b430987 Switch to thread-local random for skiplist
Summary:
Using a TLS random instance for skiplist makes it smaller
(useful for hash_skiplist_rep) and prepares skiplist for concurrent
adds.  This diff also modifies the branching factor math to avoid an
unnecessary division.

This diff has the effect of changing the sequence of skip list node
height choices made by tests, so it has the potential to cause unit
test failures for tests that implicitly rely on the exact structure
of the skip list.  Tests that try to exactly trigger a compaction are
likely suspects for this problem (these tests have always been brittle to
changes in the skiplist details).  I've minimizes this risk by reseeding
the main thread's Random at the beginning of each test, increasing the
universal compaction size_ratio limit from 101% to 105% for some tests,
and verifying that the tests pass many times.

Test Plan: for i in `seq 0 9`; do make check; done

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50439
2015-11-09 19:25:22 -08:00
Dmitri Smirnov
20f57b1715 Enable Windows warnings C4307 C4309 C4512 C4701
Enable C4307 'operator' : integral constant overflow
  Longs and ints on Windows are 32-bit hence the overflow
  Enable C4309 'conversion' : truncation of constant value
  Enable C4512 'class' : assignment operator could not be generated
  Enable C4701 Potentially uninitialized local variable 'name' used
2015-11-06 11:34:06 -08:00
Venkatesh Radhakrishnan
9d50afc3b9 Prefix-based iterating only shows keys in prefix
Summary:
MyRocks testing found an issue that while iterating over keys
that are outside the prefix, sometimes wrong results were seen for keys
outside the prefix. We now tighten the range of keys seen with a new
read option called prefix_seen_at_start. This remembers the starting
prefix and then compares it on a Next for equality of prefix. If they
are from a different prefix, it sets valid to false.

Test Plan: PrefixTest.PrefixValid

Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony

Reviewed By: anthony

Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50211
2015-11-05 13:24:05 -08:00
Islam AbdelRahman
042fb053fd Fix clang
Summary: Fix build for clang

Test Plan:
USE_CLANG=1 make all -j64
make clean
make check -j64

Reviewers: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50217
2015-11-04 21:02:20 -08:00
Yueh-Hsuan Chiang
183cadfc87 Add OptionsSanityCheckLevel
Summary:
This patch introduces OptionsSanityCheckLevel internally to enable
sanity check rocksdb options.

Utilities API will be added in the follow-up diffs.

Test Plan: Added more tests in options_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49515
2015-11-04 18:53:30 -08:00
Islam AbdelRahman
f31442fb5c Merge pull request #803 from SherlockNoMad/SkipFlush
Add Option to Skip Flushing in TableBuilder
2015-11-02 14:56:11 -08:00
Dmitri Smirnov
a0163c0682 Do not disable compiler warnings:
C4101 'identifier' : unreferenced local variable
  C4189 'identifier' : local variable is initialized but not referenced
  C4100 'identifier' : unreferenced formal parameter
  C4296 'operator' : expression is always false
2015-11-02 14:11:28 -08:00
SherlockNoMad
ccc8c10c0c Move skip_table_builder_flush to BlockBasedTableOption 2015-10-30 18:33:01 -07:00
SherlockNoMad
84992d6475 Option Helper Refactoring 2015-10-30 15:58:46 -07:00
sdong
335e4ce8ca options_test: fix a bug of assertion
Summary: new_cf_opt.table_factory->Name() is char*, ASSERT_EQ doesn't work with char* directly. Construct a string using it.

Test Plan: Run the test that failed.

Reviewers: igor, kradhakrishnan, rven, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49767
2015-10-30 11:53:47 -07:00
Siying Dong
66a3a87ab3 Merge pull request #797 from SherlockNoMad/optionHelper
Support PlainTableOption in option_helper.cc
2015-10-30 09:44:11 -07:00
SherlockNoMad
a6dd0831d5 Add Option to Skip Flushing in TableBuilder 2015-10-29 22:10:25 -07:00
Islam AbdelRahman
2872e0c8c2 Clean and expose CreateLoggerFromOptions
Summary:
CreateLoggerFromOptions have some parameters like  db_log_dir and env, these parameters are redundant since they already exist in DBOptions

this patch remove the redundant parameters and expose CreateLoggerFromOptions to users

Test Plan: make check

Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D49713
2015-10-29 18:07:37 -07:00
sdong
296c3a1f94 "make format" in some recent commits
Summary: Run "make format" for some recent commits.

Test Plan: Build and run tests

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49707
2015-10-29 17:11:14 -07:00
Siying Dong
6388e7f4e2 Merge pull request #798 from yuslepukhin/readahead_buffermanagement
Implement smart buffer management in Windows Env.
2015-10-29 15:02:59 -07:00
SherlockNoMad
b69b9b624e Support PlainTableOption in option_helper 2015-10-28 23:01:33 -07:00
sdong
47414c6cd6 Move include/posix/io_posix.h to util/io_posix.h
Summary: include/posix/io_posix.h is not a public API. Although include/posix/ is not a public header directory, it is confusing to put non-public headers to under include/. Move it to util/ to be clearer.

Test Plan: Run all tests

Reviewers: rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49611
2015-10-28 12:15:51 -07:00
sdong
2889df84cb Revert "Avoid to reply on ROCKSDB_FALLOCATE_PRESENT in include/posix/io_posix.h"
This reverts commit c37223c083.
2015-10-28 11:55:20 -07:00
Islam AbdelRahman
72d6e758b4 Fix WritableFileWriter::Append() return
Summary: It looks like WritableFileWriter::Append() was returning OK() even when there is an error

Test Plan: make check

Reviewers: sdong, yhchiang, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D49569
2015-10-27 21:04:00 -07:00
sdong
c37223c083 Avoid to reply on ROCKSDB_FALLOCATE_PRESENT in include/posix/io_posix.h
Summary: include/posix/io_posix.h should not depend on ROCKSDB_FALLOCATE_PRESENT. Remove it.

Test Plan: Build it with both of ROCKSDB_FALLOCATE_PRESENT defined and not defined.

Reviewers: rven, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49563
2015-10-27 16:48:40 -07:00
Dmitri Smirnov
6fbc4f9f3e Implement smart buffer management.
introduce a new DBOption random_access_max_buffer_size to limit
  the size of the random access buffer used for unbuffered access.
  Implement read ahead buffering when enabled.
  To that effect propagate compaction_readahead_size and the new option
  to the env options to make it available for the implementation.
  Add Hint() override so SetupForCompaction() call would call Hint()
  readahead can now be setup from both Hint() and EnableReadAhead()
  Add new option random_access_max_buffer_size support
  db_bench, options_helper to make it string parsable
  and the unit test.
2015-10-27 14:44:16 -07:00
sdong
d6219e4d9b Mac build break caused by include/posix/io_posix.h not declearing errno,
Summary: Mac build breaks as include/posix/io_posix.h doesn't include errno. Move the exact function declaration to io_posix.cc

Test Plan: Run all test. Will run on Mac

Reviewers: rven, anthony, yhchiang, IslamAbdelRahman, igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49551
2015-10-27 14:16:19 -07:00
Siying Dong
beb69d4511 Merge pull request #765 from PraveenSinghRao/wal_filter
Adding wal filter to inspect and filter wal records on recovery
2015-10-27 12:13:01 -07:00
sdong
ab0f3b964f crash_test to trigger some less frequent crash point more frequently
Summary: crash_test still has a very low chance to hit some crash point. Have another mode for covering them more likely.

Test Plan: Run crash_test and see db_stress is called with expected prameters.

Reviewers: kradhakrishnan, igor, anthony, rven, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49473
2015-10-27 12:06:06 -07:00
Praveen Rao
4ce117c4d5 Merge branch 'master' into wal_filter 2015-10-26 19:03:34 -07:00
sdong
44d4057d78 Avoid some includes in io_posix.h
Summary: IO Posix depends on too many .h files. Move most of them to .cc files.

Test Plan: make all

Reviewers: anthony, rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49479
2015-10-26 17:00:25 -07:00
Siying Dong
138876a62c Merge pull request #746 from ceph/wip-recycle
Add Options.recycle_log_file_num for Recycling WAL Files
2015-10-26 15:01:28 -07:00
Javier González
6e6dd5f6f9 Split posix storage backend into Env and library
Summary: This patch splits the posix storage backend into Env and
the actual *File implementations. The motivation is to allow other Envs
to use posix as a library. This enables a storage backend different from
posix to split its secondary storage between a normal file system
partition managed by posix, and it own media.

Test Plan: No new functionality is added to posix Env or the library,
thus the current tests should suffice.
2015-10-22 17:31:31 +02:00
Praveen Rao
2938c5c137 merge upstream changes 2015-10-19 15:21:33 -07:00
Igor Canadi
4e07c99a9a Fix iOS build
Summary: We don't yet have a CI build for iOS, so our iOS compile gets broken sometimes. Most of the errors are from assumption that size_t is 64-bit, while it's actually 32-bit on some (all?) iOS platforms. This diff fixes the compile.

Test Plan:
TARGET_OS=IOS make static_lib

Observe there are no warnings

Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49029
2015-10-19 13:40:44 -07:00
Praveen Rao
0c59691dde Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status 2015-10-19 13:27:40 -07:00
Sage Weil
543c12ab06 options: add recycle_log_file_num option
Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Sage Weil
1bcafb62f4 env: add ReuseWritableFile
Add an environment method to reuse an existing file.  Provide a generic
implementation that does a simple rename + open (writeable), and also a
posix variant that is more careful about error handling (if we fail to
open, do not rename, etc.).

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Alexey Maykov
e1a09a7703 Implementation for GetPropertiesOfTablesInRange
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.

Test Plan: ran the GetPropertiesOfTablesInRange

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48651
2015-10-17 13:34:43 -07:00
sdong
277dea78f0 Add more kill points
Summary:
Add kill points in:
1. after creating a file
2. before writing a manifest record
3. before syncing manifest
4. before creating a new current file
5. after creating a new current file

Test Plan: Run all current tests.

Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48855
2015-10-16 14:35:12 -07:00
Venkatesh Radhakrishnan
a98fbacfa0 Moving memtable related files from util to a new directory memtable
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48915
2015-10-16 14:10:33 -07:00
sdong
e1a5ff857b Allow users to disable some kill points in db_stress
Summary:
Give a name for every kill point, and allow users to disable some kill points based on prefixes. The kill points can be passed by db_stress through a command line paramter. This provides a way for users to boost the chance of triggering low frequency kill points
This allow follow up changes in crash test scripts to improve crash test coverage.

Test Plan:
Manually run db_stress with variable values of --kill_random_test and --kill_prefix_blacklist. Like this:
 --kill_random_test=2 --kill_prefix_blacklist=Posix,WritableFileWriter::Append,WritableFileWriter::WriteBuffered,WritableFileWriter::Sync

Reviewers: igor, kradhakrishnan, rven, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48735
2015-10-15 14:33:13 -07:00
Venkatesh Radhakrishnan
63e507c59c Move ldb and sst_dump from utils to tools.
Summary: As part of cleaning up dependencies for tech debt week, we are moving ldb and sst_dump tools from util to tools, since they are tools.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48747
2015-10-14 17:08:28 -07:00
sdong
dae49e829e Make DBTest.ReadLatencyHistogramByLevel more robust
Summary:
Two fixes:
1. Wait compaction after generating each L0 file so that we are sure there are one L0 file left.
2. https://reviews.facebook.net/D48423 increased from 500 keys to 700 keys but in verification phase we are still querying the first 500 keys. It is a bug to fix.

Test Plan: Run the test in the same environment that fails by chance of one in tens of times. It doesn't fail after 1000 times.

Reviewers: yhchiang, IslamAbdelRahman, igor, rven, kradhakrishnan

Reviewed By: rven, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48759
2015-10-14 16:08:55 -07:00
Venkatesh Radhakrishnan
e587dbe03a Move manual_compaction_test.cc from util to db
Summary: manual_compaction_test.cc incorrectly in util. Moved to db.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48687
2015-10-14 11:06:27 -07:00
sdong
35ad531be3 Seperate InternalIterator from Iterator
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.

This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.

Test Plan: Run all existing tests.

Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48549
2015-10-13 15:32:13 -07:00
Praveen Rao
cc4d13e0a8 Put wal_filter under #ifndef ROCKSDB_LITE 2015-10-13 11:10:14 -07:00
Yueh-Hsuan Chiang
7062d0ea68 Make perf_context.db_mutex_lock_nanos and db_condition_wait_nanos only measures DB Mutex
Summary:
In the current implementation, perf_context.db_mutex_lock_nanos and
perf_context.db_condition_wait_nanos also include the mutex-wait time
other than DB Mutex.

This patch fix this issue by incrementing the counters only when it detects
a DB mutex.

Test Plan: perf_context_test

Reviewers: anthony, IslamAbdelRahman, sdong, igor

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48555
2015-10-13 10:41:48 -07:00
Praveen Rao
eb24178553 merge from master 2015-10-12 17:24:21 -07:00
Islam AbdelRahman
c64ae05b1c Move TEST_NewInternalIterator to NewInternalIterator
Summary:
Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517
This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc

Test Plan:
make check
make static_lib

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48561
2015-10-12 17:22:37 -07:00
Praveen Rao
59a0c219bb Adding log filter to inspect and filter log records on recovery 2015-10-12 17:03:03 -07:00
Venkatesh Radhakrishnan
f1fdf5205b Clean up dependency: Move db_test_util.* to db directory
Summary:
As part of tech debt week, we are cleaning up dependencies.
This diff moves db_test_util.[h,cc] from util to db directory.

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48543
2015-10-12 13:05:42 -07:00
Yueh-Hsuan Chiang
2379944093 Fixed an incorrect replace of const value in util/options_helper.cc
Summary: Fixed an incorrect replace of const value in util/options_helper.cc

Test Plan: options_test

Reviewers: igor, sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48513
2015-10-11 17:42:22 -07:00
Yueh-Hsuan Chiang
0bb8ea56be [RocksDB Options File] Add TableOptions section and support BlockBasedTable
Summary:
Introduce TableOptions section and support BlockBasedTable in RocksDB
options file.  A TableOptions section has the following format:

  [TableOptions/<FactoryClassName> "<ColumnFamily Name>"]

which includes information about its TableFactory class and belonging
column family.  Below is an example TableOptions section of a
BlockBasedTableOptions that belongs to the default column family:

  [TableOptions/BlockBasedTable "default"]
    format_version=0
    whole_key_filtering=true
    block_size_deviation=10
    block_size=4096
    block_restart_interval=16
    filter_policy=nullptr
    no_block_cache=false
    checksum=kCRC32c
    cache_index_and_filter_blocks=false
    index_type=kBinarySearch
    hash_index_allow_collision=true
    flush_block_policy_factory=FlushBlockBySizePolicyFactory

Currently, Cache-type options (i.e., block_cache and block_cache_compressed)
are not supported.

Test Plan: options_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48435
2015-10-11 12:17:42 -07:00
sdong
776bd8d5eb Pass column family ID to table property collector
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.

Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48411
2015-10-09 14:36:51 -07:00
Islam AbdelRahman
51fa7ecec5 Bytes read/written from cache statistics
Summary: Add 2 new counters BLOCK_CACHE_BYTES_WRITE, BLOCK_CACHE_BYTES_READ to keep track of how many bytes were written to the cache and how many bytes that we read from cache

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48195
2015-10-07 15:17:20 -07:00
dyniusz
a065cdb388 bloom hit/miss stats for SST and memtable
Summary:
	hit and miss bloom filter stats for memtable and SST
	stats added to perf_context struct
	key matches and prefix matches combined into one stat

Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47859
2015-10-07 11:23:20 -07:00
Igor Canadi
40cdf797d2 Fix compile error on platforms without fallocate()
Summary:
If a platform doesn't have ROCKSDB_FALLOCATE_PRESENT, then compiler complains:

util/env_posix.cc:354:8: error: private field 'allow_fallocate_' is not used [-Werror,-Wunused-private-field]

This was caught by travis.

Test Plan: compiles with ROCKSDB_FALLOCATE_PRESENT.

Reviewers: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48327
2015-10-07 11:02:23 -07:00
Lakshmi Narayanan
4049bcde39 Added boolean variable to guard fallocate() calls
Summary:
Added boolean variable to guard fallocate() calls.
Set to false to prevent space leaks when tests fail.

Test Plan:
Compliles
Set to false and ran log device tests

Reviewers: sdong, lovro, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48027
2015-10-07 10:04:05 -07:00
Igor Canadi
d80ce7f99a Compaction filter on merge operands
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.

The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)

Do we also need a compaction filter to get called on merge results?

Test Plan: make && make check

Reviewers: lovro, tnovak, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: noetzli, kolmike, leveldb, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D47847
2015-10-07 09:30:03 -07:00
dyniusz
0267502655 Support for LevelDB SST with .ldb suffix
Summary:
	Handle SST files with both ".sst" and ".ldb" suffix.
	This enables user to migrate from leveldb to rocksdb.

Test Plan:
        Added unit test with DB operating on SSTs with names schema.
        See db/dc_test.cc:SSTsWithLdbSuffixHandling for details

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48003
2015-10-06 17:46:22 -07:00
Yueh-Hsuan Chiang
5c7bf56d35 [RocksDB Options] Support more options in RocksDBOptionParser for sanity check.
Summary:
RocksDBOptionsParser now supports CompressionType and the following
pointer-typed options in RocksDBOptionParser
for sanity check:
  prefix_extractor
  table_factory
  comparator
  compaction_filter
  compaction_filter_factory
  merge_operator
  memtable_factory

In the RocksDB Options file, only high level information about pointer-typed
options are serialized, and those information is only used for verification
/ sanity check purpose.

Test Plan: added more tests in options_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47925
2015-10-02 15:35:32 -07:00
Evan Shaw
7a23e4d8ca New amalgamation target
This commit adds two new targets to the Makefile: rocksdb.cc and rocksdb.h

These files, when combined with the c.h header, are a self-contained RocksDB
source distribution called an amalgamation. (The name comes from SQLite's, which
is similar in concept.)

The main benefit of an amalgamation is that it's very easy to drop into a
new project. It also compiles faster compared to compiling individual source
files and potentially gives the compiler more opportunity to make optimizations
since it can see all functions at once.

rocksdb.cc and rocksdb.h are generated by a new script, amalgamate.py.
A detailed description of how amalgamate.py works is in a comment at the top of
the file.

There are also some small changes to existing files to enable the amalgamation:
* Use quotes for includes in unity build
* Fix an old header inclusion in util/xfunc.cc
* Move some includes outside ifdef in util/env_hdfs.cc
* Separate out tool sources in Makefile so they won't be included in unity.cc
* Unity build now produces a static library

Closes #733
2015-10-01 08:29:31 +13:00
Dmitri Smirnov
9320ffd67a Improve CI build and fix Windows build breakage
Is there a way to enforce CMake additions for internal changes that seem to come
  w/o a PR?
2015-09-30 11:20:23 -07:00
Yueh-Hsuan Chiang
da1cf8a9bc Add a missing check for deprecated options in options_helper.cc
Summary: Add a missing check for deprecated options in options_helper.cc

Test Plan: options_test

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47793
2015-09-29 17:58:00 -07:00
Yueh-Hsuan Chiang
1e73b11af7 Better handling of deprecated options in RocksDBOptionsParser
Summary:
Previously, we treat deprecated options as normal options in
RocksDBOptionsParser.  However, these deprecated options should
not be verified and serialized.

Test Plan: options_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47775
2015-09-29 17:13:02 -07:00
Yueh-Hsuan Chiang
a8b8295d18 Fixed a compile warning in options_test.cc under clang
Summary:
Fixed the following compile warning in options_test.cc under clang
util/options_test.cc:94:12: error: 'Skip' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
    Status Skip(uint64_t n) {
           ^
./include/rocksdb/env.h:368:18: note: overridden virtual function is here
  virtual Status Skip(uint64_t n) = 0;
                 ^

Test Plan: options_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47763
2015-09-29 14:59:02 -07:00
Yueh-Hsuan Chiang
74b100ac17 RocksDB Options file format and its serialization / deserialization.
Summary:
This patch defines the format of RocksDB options file, which
follows the INI file format, and implements functions for its
serialization and deserialization.  An example RocksDB options
file can be found in examples/rocksdb_option_file_example.ini.

A typical RocksDB options file has three sections, which are
Version, DBOptions, and more than one CFOptions.  The RocksDB
options file in general follows the basic INI file format
with the following extensions / modifications:
 * Escaped characters
   We escaped the following characters:
    - \n -- line feed - new line
    - \r -- carriage return
    - \\ -- backslash \
    - \: -- colon symbol :
    - \# -- hash tag #
 * Comments
   We support # style comments.  Comments can appear at the ending
   part of a line.
 * Statements
   A statement is of the form option_name = value.
   Each statement contains a '=', where extra white-spaces
   are supported. However, we don't support multi-lined statement.
   Furthermore, each line can only contain at most one statement.
 * Section
   Sections are of the form [SecitonTitle "SectionArgument"],
   where section argument is optional.
 * List
   We use colon-separated string to represent a list.
   For instance, n1:n2:n3:n4 is a list containing four values.

Below is an example of a RocksDB options file:

[Version]
  rocksdb_version=4.0.0
  options_file_version=1.0
[DBOptions]
  max_open_files=12345
  max_background_flushes=301
[CFOptions "default"]
[CFOptions "the second column family"]
[CFOptions "the third column family"]

Test Plan: Added many tests in options_test.cc

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46059
2015-09-29 14:42:40 -07:00