Commit Graph

1066 Commits

Author SHA1 Message Date
Siying Dong
7fafd52dce Merge pull request #900 from shuzhang1989/hdfs_env_fix
add a factory method for creating hdfs env
2015-12-28 09:28:04 -08:00
Nathan Bronson
7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
Shu Zhang
4fd23fb130 add a factory method for creating hdfs env 2015-12-23 17:26:50 -08:00
sdong
15b8902264 Change default options.delayed_write_rate
Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be.

Test Plan: Run all tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52281
2015-12-23 14:51:55 -08:00
sdong
b9f77ba12b When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131
2015-12-23 11:33:15 -08:00
Dmitri Smirnov
dbb8260f7e Make Status moveable
Status is a class which is frequently returned by value from functions.
  Making it movable avoids 99% of the copies automatically
  on return by value.
2015-12-22 16:06:20 -08:00
Islam AbdelRahman
2bf9b968ca Fix lite_build
Summary: Fix compiling under ROCKSDB_LITE

Test Plan:
OPT="-DROCKSDB_LITE" make -j64 check
make check -j64

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52239
2015-12-22 11:58:13 -08:00
Islam AbdelRahman
d005c66faf Report compaction reason in CompactionListener
Summary:
Add CompactionReason to CompactionJobInfo
This will allow users to understand why compaction started which will help options tuning

Test Plan:
added new tests
make check -j64

Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51975
2015-12-22 11:37:19 -08:00
Alex Yang
33e09c0e19 add call to install superversion and schedule work in enableautocompactions
Summary:
This patch fixes https://github.com/facebook/mysql-5.6/issues/121

There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction.

Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush,
SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the
column families to avoid the situation above.

This is still a first pass for feedback.
Could also just call SchedePendingFlush and SchedulePendingCompaction directly.

Test Plan:
Run on Asan build
cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000

Ensure that it no longer hangs during the test.

Reviewers: hermanlee4, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D51747
2015-12-21 10:06:49 -08:00
agiardullo
eff309867e Do not use timed_mutex in TransactionDB
Summary: Stopped using std::timed_mutex as it has known issues in older versiong of gcc.  Ran into these problems when testing MongoRocks.

Test Plan: unit tests.  Manual mongo testing on gcc 4.8.

Reviewers: igor, yhchiang, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52197
2015-12-18 17:26:02 -08:00
sdong
d72b31774e Slowdown when writing to the last write buffer
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.

Test Plan:
1. Add a new unit test.
2. Run db_bench with

./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics

based on hard drive and see stopping is avoided with the commit.

Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52047
2015-12-17 10:49:08 -08:00
Venkatesh Radhakrishnan
6b2a3ac92c Add documentation for unschedFunction
Summary:
Documenting the unschedFunction parameter to Schedule as
requested by Michael Kolupaev.

Test Plan: build, unit test

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kolmike, dhruba

Differential Revision: https://reviews.facebook.net/D52089
2015-12-17 10:41:39 -08:00
Islam AbdelRahman
32ff05e971 Bump version to 4.4
Summary: Bump version to 4.4

Test Plan: none

Reviewers: sdong, rven, yhchiang, anthony, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52035
2015-12-16 14:32:58 -08:00
Islam AbdelRahman
aececc209e Introduce ReadOptions::pin_data (support zero copy for keys)
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999
2015-12-16 12:08:30 -08:00
Venkatesh Radhakrishnan
030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
SherlockNoMad
768a61486c Fix appVeyor Build problem 2015-12-11 21:10:49 -08:00
agiardullo
84f98792d6 Transaction::SetWriteOptions()
Summary: Add support to change write options after creating a transaction.  This is needed for MongoRocks.

Test Plan: added test

Reviewers: sdong, rven, kradhakrishnan, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51867
2015-12-11 16:08:25 -08:00
agiardullo
3bfd3d39a3 Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.

With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).

Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.

Test Plan: unit tests, db bench

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb, yoshinorim

Differential Revision: https://reviews.facebook.net/D50475
2015-12-11 12:34:11 -08:00
Igor Canadi
64fa43843b Merge pull request #862 from ceph/wip-env
implement EnvMirror
2015-12-10 18:45:07 -08:00
Sage Weil
2074ddd625 env: add EnvMirror
This is an Env implementation that mirrors all storage-related methods on
two different backend Env's and verifies that they return the same
results (return status and read results).  This is useful for implementing
a new Env and verifying its correctness.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-12-10 21:32:45 -05:00
Yueh-Hsuan Chiang
a3ba5915c8 Correct a comment in include/rocksdb/cache.h
Summary: Correct a comment in include/rocksdb/cache.h

Test Plan: No code change.

Reviewers: igor, sdong, IslamAbdelRahman, rven, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51831
2015-12-10 16:39:10 -08:00
charsyam
c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
sdong
56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong
d6e1035a1f A new compaction picking priority that optimizes for write amplification for random updates.
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.

Test Plan: Add a unit test and run it in valgrind too.

Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51459
2015-12-09 18:13:03 -08:00
agiardullo
e5c5f23814 Support marking snapshots for write-conflict checking - Take 2
Summary:
D51183 was reverted due to breaking the LITE build.

This diff is the same as D51183 but with a fix for the LITE BUILD(D51693)

Test Plan: run all unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51711
2015-12-08 16:47:31 -08:00
sdong
1d63c3d610 Revert "Support marking snapshots for write-conflict checking"
This reverts commit ec704aafdc for it broke RocksDB LITE build.
2015-12-08 09:27:17 -08:00
agiardullo
ec704aafdc Support marking snapshots for write-conflict checking
Summary:
D50475 enables using SST files for transaction write-conflict checking.  In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295).  If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization.

This diff allows Transactions to mark snapshots as being used for write-conflict checking.  Then, during compaction, we will be able to optimize SingleDeletes better in the future.

This diff adds a flag to SnapshotImpl which is used by Transactions.  This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator.  This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information).

Test Plan: no behavior change, ran existing tests

Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51183
2015-12-07 19:40:51 -08:00
Jay Edgar
b28b7c6dd9 Added callback notification when a snapshot is created
Summary: When SetSnapshot() is used the caller immediately knows a snapshot has been created, but when SetSnapshotOnNextOperation() is used the caller needs a way to get notified when that snapshot has been generated.  This creates an interface that the client can implement that will be called at the time the snapshot is created.

Test Plan: Added a new SetSnapshotOnNextOperationWithNotification test into the transaction_test.

Reviewers: sdong, anthony

Reviewed By: anthony

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51177
2015-12-04 10:20:36 -08:00
Alex Yang
e8180f9901 added public api to schedule flush/compaction, code to prevent race with db::open
Summary:
Fixes T8781168.

Added a new function EnableAutoCompactions in db.h to be publicly
avialable.  This allows compaction to be re-enabled after disabling it via
SetOptions

Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open
Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to
prevent race condition.

Test Plan:
Ran make all check

verified fix on myrocks side:
was able to reproduce the seg fault with
../tools/mysqltest.sh --mem --force rocksdb.drop_table

method was to manually sleep the thread after DB::Open but before TransactionDB ptr was
assigned in transaction_db_impl.cc:
  DB::Open(db_options, dbname, column_families_copy, handles, &db);
  clock_t goal = (60000 * 10) + clock();
  while (goal > clock());
  ...dbptr(aka rdb) gets assigned below

verified my changes fixed the issue.

Also added unit test 'ToggleAutoCompaction' in transaction_test.cc

Reviewers: hermanlee4, anthony

Reviewed By: anthony

Subscribers: alex, dhruba

Differential Revision: https://reviews.facebook.net/D51147
2015-12-03 22:59:44 -08:00
Islam AbdelRahman
72930485b5 Fix clang build
Summary: Fix clang

Test Plan: make check

Reviewers: sdong, yhchiang, rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51417
2015-11-30 10:03:07 -08:00
sdong
6bbfa1874b BackupDB to have a mode to use file size in file name
Summary: Getting file size from all the backup files can take a long time. In some cases, the sizes are available in file names. We allow a mode to get those sizes from file name.

Test Plan:
Make some unit tests in backupable_db_test to run in such a mode.
Make sure RocksDB Lite builds too.

Reviewers: IslamAbdelRahman, rven, yhchiang, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: muthu, asameet, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51243
2015-11-25 11:55:37 -08:00
Igor Canadi
f3ea00bc85 Merge pull request #856 from ceph/wip-env
EnvWrapper: add ReuseWritableFile
2015-11-25 11:38:09 -08:00
Sage Weil
4cedd6b038 EnvWrapper: add ReuseWritableFile
This was missed when ReuseWritableFile was added to Env in
1bcafb62f4.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-11-25 14:30:05 -05:00
Venkatesh Radhakrishnan
81be49c755 Have a way for compaction filter to ignore snapshots
Summary:
Provide an API for compaction filter to specify that it needs
to be applied even if there are snapshots.

Test Plan: DBTestCompactionFilter.CompactionFilterIgnoreSnapshot

Reviewers: yhchiang, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51087
2015-11-20 15:57:26 -08:00
SherlockNoMad
bd7be035e0 Support Memtable Factory Parse in option_helper.cc 2015-11-17 14:29:01 -08:00
sdong
9bc9c93bd4 Move to version 4.3
Summary: RocksDB 4.2 is already cut. Move to 4.3

Test Plan: Not needed

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50799
2015-11-16 14:29:08 -08:00
Islam AbdelRahman
a163cc2d5a Lint everything
Summary:
```
arc2 lint --everything
```

run the linter on the whole code repo to fix exisitng lint issues

Test Plan: make check -j64

Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50769
2015-11-16 12:56:21 -08:00
Yueh-Hsuan Chiang
d781da8164 Add CheckOptionsCompatibility() API to options_util
Summary:
Add CheckOptionsCompatibility() API to options_util that returns
Status::OK if the input DBOptions and ColumnFamilyDescriptors
are compatible with the latest options stored in the specified DB path.

Test Plan: Added tests in options_util_test

Reviewers: igor, anthony, IslamAbdelRahman, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50649
2015-11-12 16:52:51 -08:00
Yueh-Hsuan Chiang
e11f676e34 Add OptionsUtil::LoadOptionsFromFile() API
Summary:
This patch adds OptionsUtil::LoadOptionsFromFile() and
OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
to construct DBOptions and ColumnFamilyOptions from a RocksDB
options file.  Note that most pointer-typed options such as
merge_operator will not be constructed.

With this API, developers no longer need to remember all the
options in order to reopen an existing rocksdb instance like
the following:

  DBOptions db_options;
  std::vector<std::string> cf_names;
  std::vector<ColumnFamilyOptions> cf_opts;

  // Load primitive-typed options from an existing DB
  OptionsUtil::LoadLatestOptionsFromDB(
      dbname, &db_options, &cf_names, &cf_opts);

  // Initialize necessary pointer-typed options
  cf_opts[0].merge_operator.reset(new MyMergeOperator());
  ...

  // Construct the vector of ColumnFamilyDescriptor
  std::vector<ColumnFamilyDescriptor> cf_descs;
  for (size_t i = 0; i < cf_opts.size(); ++i) {
    cf_descs.emplace_back(cf_names[i], cf_opts[i]);
  }

  // Open the DB
  DB* db = nullptr;
  std::vector<ColumnFamilyHandle*> cf_handles;
  auto s = DB::Open(db_options, dbname, cf_descs,
                    &handles, &db);

Test Plan:
Augment existing tests in column_family_test
options_test
db_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49095
2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang
e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Siying Dong
7ed2c3e45b Merge pull request #823 from yuslepukhin/fix_off_t_type
Make use of portable `uint64_t` type to make possible 64-bit file access
2015-11-10 18:57:17 -08:00
Dmitri Smirnov
5270b33bd3 Make use of portable uint64_t type to make possible file access
in 64-bit.

  Currently, a signed off_t type is being used for the following
  interfaces for both offset and the length in bytes:
  * `Allocate`
  * `RangeSync`

  On Linux `off_t` is automatically either 32 or 64-bit depending on
  the platform. On Windows it is always a 32-bit signed long which
  limits file access and in particular space pre-allocation
  to effectively 2 Gb.

  Proposal is to replace off_t with uint64_t as a portable type
  always access files with 64-bit interfaces.

  May need to modify posix code but lack resources to test it.
2015-11-10 17:03:42 -08:00
Nathan Bronson
631863c63b track WriteBatch contents
Summary:
Parallel writes will only be possible for certain combinations of
flags and WriteBatch contents.  Traversing the WriteBatch at write time
to check these conditions would be expensive, but it is very cheap to
keep track of when building WriteBatch-es.  When loading WriteBatch-es
during recovery, a deferred computation state is used so that the flags
never need to be computed.

Test Plan:
1. add asserts and EXPECT_EQ-s
2. make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50337
2015-11-10 16:56:06 -08:00
Yueh-Hsuan Chiang
f3ca28ab03 Correct the comment of GetApproximateMemoryUsageByType
Summary: Correct the comment of GetApproximateMemoryUsageByType.

Test Plan: No code change.

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50409
2015-11-08 09:02:35 -08:00
Islam AbdelRahman
838676c17b Revert "Adding new table properties"
Summary:
Reverting https://reviews.facebook.net/D34269 for now
after I landed it a flaky test started continuously failing, I am almost sure this patch is not related to the test but I will revert it until I figure out why it's failing

Test Plan: make check

Reviewers: kradhakrishnan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50385
2015-11-06 16:49:38 -08:00
Islam AbdelRahman
8be568a9c2 Adding new table properties
Summary:
This diff introduce new table properties that will be written for block based tables
These properties are
  - comparator name
  - merge operator name
  - property collectors names

Test Plan:
  - Added a new unit test to verify that these tests are written/read correctly
  - Running all other tests right now (wont land until all tests finish)

Reviewers: rven, kradhakrishnan, igor, sdong, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34269
2015-11-06 11:19:01 -08:00
agiardullo
fe789c5f2b Document SingleDelete
Summary: Docuemented what is currently supported by SingleDelete based on its current implementation.

Test Plan: n/a

Reviewers: sdong, kradhakrishnan, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50205
2015-11-05 17:24:07 -08:00
Venkatesh Radhakrishnan
9d50afc3b9 Prefix-based iterating only shows keys in prefix
Summary:
MyRocks testing found an issue that while iterating over keys
that are outside the prefix, sometimes wrong results were seen for keys
outside the prefix. We now tighten the range of keys seen with a new
read option called prefix_seen_at_start. This remembers the starting
prefix and then compares it on a Next for equality of prefix. If they
are from a different prefix, it sets valid to false.

Test Plan: PrefixTest.PrefixValid

Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony

Reviewed By: anthony

Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50211
2015-11-05 13:24:05 -08:00
Yueh-Hsuan Chiang
7d7ee2b654 Add Memory Insight support to utilities
Summary:
This patch introduces utilities/memory, which currently includes
GetApproximateMemoryUsageByType that reports different types of
rocksdb memory usage given a list of input DBs.

The API also take care of the case where Cache could be shared
across multiple column families / multiple db instances.

Currently, it reports memory usage of memtable, table-readers
and cache.

Test Plan: utilities/memory/memory_test.cc

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49257
2015-11-03 17:52:17 -08:00
Yueh-Hsuan Chiang
3ecbab0040 Add GetAggregatedIntProperty(): returns the aggregated value from all CFs
Summary:
This patch adds GetAggregatedIntProperty() that returns the aggregated
value from all CFs

Test Plan: Added a test in db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49497
2015-11-03 15:54:18 -08:00
Satnam Singh
93a9667223 Merge branch 'master' of github.com:facebook/rocksdb 2015-11-03 15:36:14 -08:00
Satnam Singh
c9aef3c41c Add RocksDb/GeoDb Iterator interface
Summary:
This diff is a first step towards an iterator based interface for the
SearchRadial method which replaces a vector of GeoObjects with an
iterator for GeoObjects. This diff works by just wrapping the iterator
for the encapsulated vector of GeoObjects. A future diff could extend
this approach by defining an interator in terms of the underlying
iteration in SearchRadial which would then remove the need to have
an in-memory representation for all the matching GeoObjects.
Fixes T8421387

Test Plan:
The existing tests have been modified to work with the new
interface.

Reviewers: IslamAbdelRahman, kradhakrishnan, dhruba, igor

Reviewed By: igor

Subscribers: igor, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50031
2015-11-03 15:20:58 -08:00
Islam AbdelRahman
f31442fb5c Merge pull request #803 from SherlockNoMad/SkipFlush
Add Option to Skip Flushing in TableBuilder
2015-11-02 14:56:11 -08:00
SherlockNoMad
ccc8c10c0c Move skip_table_builder_flush to BlockBasedTableOption 2015-10-30 18:33:01 -07:00
Islam AbdelRahman
ff4499e297 Update DB::AddFile() to have less restrictions
Summary:
Update DB::AddFile() restrictions to be
  - Key range in loaded table file don't overlap with existing keys or tombstones in DB.
  - No other writes happen during AddFile call.

The updated AddFile() will verify that the file key range don't overlap with any keys or tombstones in the DB, and then add the file to L0

Test Plan: unit tests

Reviewers: igor, rven, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: adsharma, ameyag, dhruba

Differential Revision: https://reviews.facebook.net/D49233
2015-10-30 16:38:10 -07:00
Siying Dong
66a3a87ab3 Merge pull request #797 from SherlockNoMad/optionHelper
Support PlainTableOption in option_helper.cc
2015-10-30 09:44:11 -07:00
SherlockNoMad
a6dd0831d5 Add Option to Skip Flushing in TableBuilder 2015-10-29 22:10:25 -07:00
Islam AbdelRahman
2872e0c8c2 Clean and expose CreateLoggerFromOptions
Summary:
CreateLoggerFromOptions have some parameters like  db_log_dir and env, these parameters are redundant since they already exist in DBOptions

this patch remove the redundant parameters and expose CreateLoggerFromOptions to users

Test Plan: make check

Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D49713
2015-10-29 18:07:37 -07:00
sdong
296c3a1f94 "make format" in some recent commits
Summary: Run "make format" for some recent commits.

Test Plan: Build and run tests

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49707
2015-10-29 17:11:14 -07:00
Siying Dong
6388e7f4e2 Merge pull request #798 from yuslepukhin/readahead_buffermanagement
Implement smart buffer management in Windows Env.
2015-10-29 15:02:59 -07:00
SherlockNoMad
b69b9b624e Support PlainTableOption in option_helper 2015-10-28 23:01:33 -07:00
sdong
47414c6cd6 Move include/posix/io_posix.h to util/io_posix.h
Summary: include/posix/io_posix.h is not a public API. Although include/posix/ is not a public header directory, it is confusing to put non-public headers to under include/. Move it to util/ to be clearer.

Test Plan: Run all tests

Reviewers: rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49611
2015-10-28 12:15:51 -07:00
sdong
2889df84cb Revert "Avoid to reply on ROCKSDB_FALLOCATE_PRESENT in include/posix/io_posix.h"
This reverts commit c37223c083.
2015-10-28 11:55:20 -07:00
Siying Dong
d0a18c2840 Merge pull request #786 from aloukissas/unused_param
Fix unused parameter warnings in db.h
2015-10-27 16:51:22 -07:00
sdong
c37223c083 Avoid to reply on ROCKSDB_FALLOCATE_PRESENT in include/posix/io_posix.h
Summary: include/posix/io_posix.h should not depend on ROCKSDB_FALLOCATE_PRESENT. Remove it.

Test Plan: Build it with both of ROCKSDB_FALLOCATE_PRESENT defined and not defined.

Reviewers: rven, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49563
2015-10-27 16:48:40 -07:00
Dmitri Smirnov
6fbc4f9f3e Implement smart buffer management.
introduce a new DBOption random_access_max_buffer_size to limit
  the size of the random access buffer used for unbuffered access.
  Implement read ahead buffering when enabled.
  To that effect propagate compaction_readahead_size and the new option
  to the env options to make it available for the implementation.
  Add Hint() override so SetupForCompaction() call would call Hint()
  readahead can now be setup from both Hint() and EnableReadAhead()
  Add new option random_access_max_buffer_size support
  db_bench, options_helper to make it string parsable
  and the unit test.
2015-10-27 14:44:16 -07:00
sdong
d6219e4d9b Mac build break caused by include/posix/io_posix.h not declearing errno,
Summary: Mac build breaks as include/posix/io_posix.h doesn't include errno. Move the exact function declaration to io_posix.cc

Test Plan: Run all test. Will run on Mac

Reviewers: rven, anthony, yhchiang, IslamAbdelRahman, igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49551
2015-10-27 14:16:19 -07:00
Praveen Rao
4ce117c4d5 Merge branch 'master' into wal_filter 2015-10-26 19:03:34 -07:00
Praveen Rao
32cdec634e Fail recovery if filter provides more records than original and corresponding unit-test, fix naming conventions 2015-10-26 18:11:18 -07:00
sdong
44d4057d78 Avoid some includes in io_posix.h
Summary: IO Posix depends on too many .h files. Move most of them to .cc files.

Test Plan: make all

Reviewers: anthony, rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49479
2015-10-26 17:00:25 -07:00
Alex Loukissas
b0980ff748 Fix unused parameter warnings. 2015-10-26 16:32:14 -07:00
Siying Dong
138876a62c Merge pull request #746 from ceph/wip-recycle
Add Options.recycle_log_file_num for Recycling WAL Files
2015-10-26 15:01:28 -07:00
sdong
d691111146 include/posix/io_posix.h should have a once declartion
Summary: include/posix/io_posix.h doesn't not prevent multiple includes. Need to fix it. It is also breaking unity build.

Test Plan: Run unity build and see error go away.

Reviewers: rven, igor, IslamAbdelRahman, kradhakrishnan, anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49281
2015-10-23 07:45:00 -07:00
Siying Dong
8c11c5dee8 Merge pull request #768 from OpenChannelSSD/to_fb_master2
Split posix storage backend into Env and library
2015-10-22 13:33:51 -07:00
Javier González
6e6dd5f6f9 Split posix storage backend into Env and library
Summary: This patch splits the posix storage backend into Env and
the actual *File implementations. The motivation is to allow other Envs
to use posix as a library. This enables a storage backend different from
posix to split its secondary storage between a normal file system
partition managed by posix, and it own media.

Test Plan: No new functionality is added to posix Env or the library,
thus the current tests should suffice.
2015-10-22 17:31:31 +02:00
Alexey Maykov
980a82ee2f Fix a bug in GetApproximateSizes
Summary: Need to pass through the memtable parameter.

Test Plan: built, tested through myrocks

Reviewers: igor, sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D49167
2015-10-21 18:34:39 -07:00
Praveen Rao
2938c5c137 merge upstream changes 2015-10-19 15:21:33 -07:00
Islam AbdelRahman
e3b1d23d3e Bump version to 4.2
Summary: Bump the version to 4.2 ( the unreleased version ), so that when fbcode_unittests run it can differentiate between old and new APIs

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D49041
2015-10-19 15:05:59 -07:00
Praveen Rao
0c59691dde Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status 2015-10-19 13:27:40 -07:00
agiardullo
cfaa33f9a5 Update transaction iterator documentation
Summary: Remove warning about an issue that was resolved.  Turns out the issue was a false-alarm.

Test Plan: n/a

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49011
2015-10-19 13:08:18 -07:00
Alexey Maykov
f18acd8875 Fixed the clang compilation failure
Summary: As above.

Test Plan: USE_CLANG=1 make check -j

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48981
2015-10-19 10:38:50 -07:00
Sage Weil
543c12ab06 options: add recycle_log_file_num option
Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Sage Weil
1bcafb62f4 env: add ReuseWritableFile
Add an environment method to reuse an existing file.  Provide a generic
implementation that does a simple rename + open (writeable), and also a
posix variant that is more careful about error handling (if we fail to
open, do not rename, etc.).

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Alexey Maykov
e1a09a7703 Implementation for GetPropertiesOfTablesInRange
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.

Test Plan: ran the GetPropertiesOfTablesInRange

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48651
2015-10-17 13:34:43 -07:00
Yueh-Hsuan Chiang
ad471453e8 Allow GetProperty to report the number of currently running flushes / compactions.
Summary:
Add rocksdb.num-running-compactions and rocksdb.num-running-flushes
to GetIntProperty() that reports the number of currently running
compactions / flushes.

Test Plan: augmented existing tests in db_test

Reviewers: igor, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48693
2015-10-17 00:16:36 -07:00
Jay Edgar
8f143e03fb Add ClearSnapshot()
Summary:
MyRocks needs the ability to clear a snapshot for Read Committed support

Test Plan: transaction_test

Reviewers: anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48861
2015-10-16 11:53:30 -07:00
sdong
666fb5df43 Remove DefaultCompactionFilterFactory.
Summary: DefaultCompactionFilterFactory is not used anymore after recent changes. Remove it.

Test Plan: Just run existing tests.

Reviewers: rven, igor, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48699
2015-10-14 11:02:30 -07:00
Yueh-Hsuan Chiang
29a47cd2b8 Include the time unit in the comment of perf_context timers
Summary: Include the time unit in the comment of perf_context timers

Test Plan: make perf_context_test

Reviewers: igor, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48663
2015-10-13 17:24:13 -07:00
sdong
35ad531be3 Seperate InternalIterator from Iterator
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.

This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.

Test Plan: Run all existing tests.

Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48549
2015-10-13 15:32:13 -07:00
Praveen Rao
cc4d13e0a8 Put wal_filter under #ifndef ROCKSDB_LITE 2015-10-13 11:10:14 -07:00
Praveen Rao
eb24178553 merge from master 2015-10-12 17:24:21 -07:00
Praveen Rao
59a0c219bb Adding log filter to inspect and filter log records on recovery 2015-10-12 17:03:03 -07:00
Yueh-Hsuan Chiang
0bb8ea56be [RocksDB Options File] Add TableOptions section and support BlockBasedTable
Summary:
Introduce TableOptions section and support BlockBasedTable in RocksDB
options file.  A TableOptions section has the following format:

  [TableOptions/<FactoryClassName> "<ColumnFamily Name>"]

which includes information about its TableFactory class and belonging
column family.  Below is an example TableOptions section of a
BlockBasedTableOptions that belongs to the default column family:

  [TableOptions/BlockBasedTable "default"]
    format_version=0
    whole_key_filtering=true
    block_size_deviation=10
    block_size=4096
    block_restart_interval=16
    filter_policy=nullptr
    no_block_cache=false
    checksum=kCRC32c
    cache_index_and_filter_blocks=false
    index_type=kBinarySearch
    hash_index_allow_collision=true
    flush_block_policy_factory=FlushBlockBySizePolicyFactory

Currently, Cache-type options (i.e., block_cache and block_cache_compressed)
are not supported.

Test Plan: options_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48435
2015-10-11 12:17:42 -07:00
Alexey Maykov
3d07b815f6 Passing table properties to compaction callback
Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online.

Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats.

Reviewers: igor, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48267
2015-10-09 18:10:55 -07:00
agiardullo
def74f8763 Deferred snapshot creation in transactions
Summary: Support for Transaction::CreateSnapshotOnNextOperation().  This is to fix a write-conflict race-condition that Yoshinori was running into when testing MyRocks with LinkBench.

Test Plan: New tests

Reviewers: yhchiang, spetrunia, rven, igor, yoshinorim, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48099
2015-10-09 15:46:16 -07:00
agiardullo
c5f3707d42 DisableIndexing() for Transactions
Summary:
MyRocks reported some perfomance issues when inserting many keys into a transaction due to the cost of inserting new keys into WriteBatchWithIndex.  Frequently, they don't even need the keys to be indexed as they don't need to read them back.  DisableIndexing() can be used to avoid the cost of indexing.

I also plan on eventually investigating if we can improve WriteBatchWithIndex performance.  But even if we improved the perf here, it is still beneficial to be able to disable the indexing all together for large transactions.

Test Plan: unit test

Reviewers: igor, rven, yoshinorim, spetrunia, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48471
2015-10-09 15:36:09 -07:00
sdong
776bd8d5eb Pass column family ID to table property collector
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.

Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48411
2015-10-09 14:36:51 -07:00
sdong
000836a880 CompactionFilter::Context to contain column family ID
Summary: Add the column family ID to compaction filter context, so it is easier for compaction filter to apply different logic for different column families.

Test Plan: Add a unit test to verify the column family ID passed is correct.

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48357
2015-10-08 11:27:38 -07:00
Yueh-Hsuan Chiang
3a0bf873b5 Change RocksDB version to 4.1
Summary: Change RocksDB version to 4.1

Test Plan: no code change.

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48387
2015-10-08 11:15:18 -07:00
Igor Canadi
9803e0d813 compaction_filter.h cleanup
Summary:
Two changes:
1. remove *V2 filter stuff. we deprecated that a while ago
2. clarify what happens when user sets max_subcompactions to bigger than 1

Test Plan: none

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47871
2015-10-08 09:32:50 -07:00
Islam AbdelRahman
51fa7ecec5 Bytes read/written from cache statistics
Summary: Add 2 new counters BLOCK_CACHE_BYTES_WRITE, BLOCK_CACHE_BYTES_READ to keep track of how many bytes were written to the cache and how many bytes that we read from cache

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48195
2015-10-07 15:17:20 -07:00
dyniusz
a065cdb388 bloom hit/miss stats for SST and memtable
Summary:
	hit and miss bloom filter stats for memtable and SST
	stats added to perf_context struct
	key matches and prefix matches combined into one stat

Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47859
2015-10-07 11:23:20 -07:00
Lakshmi Narayanan
4049bcde39 Added boolean variable to guard fallocate() calls
Summary:
Added boolean variable to guard fallocate() calls.
Set to false to prevent space leaks when tests fail.

Test Plan:
Compliles
Set to false and ran log device tests

Reviewers: sdong, lovro, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48027
2015-10-07 10:04:05 -07:00
Igor Canadi
d80ce7f99a Compaction filter on merge operands
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.

The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)

Do we also need a compaction filter to get called on merge results?

Test Plan: make && make check

Reviewers: lovro, tnovak, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: noetzli, kolmike, leveldb, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D47847
2015-10-07 09:30:03 -07:00
Islam AbdelRahman
9babaeed16 Update dump_tool and undump_tool to accept Options
Summary:
Refactor dump_tool and undump_tool so that it's possible to use them with customized options
for example setting a specific comparator similar to what Dragon is doing with the LdbTool

https://phabricator.fb.com/diffusion/FBCODE/browse/master/dragon/tools/Ldb.cpp

Test Plan:
compiles
used it to dump / undump a dragon shard

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, adsharma

Differential Revision: https://reviews.facebook.net/D47853
2015-10-05 19:49:48 -07:00
Igor Canadi
115427ef63 Add APIs PauseBackgroundWork() and ContinueBackgroundWork()
Summary:
To support a new MongoDB capability, we need to make sure that we don't do any IO for a short period of time. For background, see:
* https://jira.mongodb.org/browse/SERVER-20704
* https://jira.mongodb.org/browse/SERVER-18899

To implement that, I add a new API calls PauseBackgroundWork() and ContinueBackgroundWork() which reuse the capability we already have in place for RefitLevel() function.

Test Plan: Added a new test in db_test. Made sure that test fails when PauseBackgroundWork() is commented out.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47901
2015-10-02 13:17:34 -07:00
agiardullo
03b08ba9a9 Return MergeInProgress when fetching from transactions or WBWI with overwrite_key
Summary:
WriteBatchWithIndex::GetFromBatchAndDB only works correctly for overwrite_key=false.  Transactions use overwrite_key=true (since WriteBatchWithIndex::GetIteratorWithBase only works when overwrite_key=true).  So currently, Transactions could return incorrectly merged results when calling Get/GetForUpdate().

Until a permanent fix can be put in place, Transaction::Get[ForUpdate] and WriteBatchWithIndex::GetFromBatch[AndDB] will now return MergeInProgress if the most recent write to a key in the batch is a Merge.

Test Plan: more tests

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47817
2015-09-30 11:14:42 -07:00
Yueh-Hsuan Chiang
74b100ac17 RocksDB Options file format and its serialization / deserialization.
Summary:
This patch defines the format of RocksDB options file, which
follows the INI file format, and implements functions for its
serialization and deserialization.  An example RocksDB options
file can be found in examples/rocksdb_option_file_example.ini.

A typical RocksDB options file has three sections, which are
Version, DBOptions, and more than one CFOptions.  The RocksDB
options file in general follows the basic INI file format
with the following extensions / modifications:
 * Escaped characters
   We escaped the following characters:
    - \n -- line feed - new line
    - \r -- carriage return
    - \\ -- backslash \
    - \: -- colon symbol :
    - \# -- hash tag #
 * Comments
   We support # style comments.  Comments can appear at the ending
   part of a line.
 * Statements
   A statement is of the form option_name = value.
   Each statement contains a '=', where extra white-spaces
   are supported. However, we don't support multi-lined statement.
   Furthermore, each line can only contain at most one statement.
 * Section
   Sections are of the form [SecitonTitle "SectionArgument"],
   where section argument is optional.
 * List
   We use colon-separated string to represent a list.
   For instance, n1:n2:n3:n4 is a list containing four values.

Below is an example of a RocksDB options file:

[Version]
  rocksdb_version=4.0.0
  options_file_version=1.0
[DBOptions]
  max_open_files=12345
  max_background_flushes=301
[CFOptions "default"]
[CFOptions "the second column family"]
[CFOptions "the third column family"]

Test Plan: Added many tests in options_test.cc

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46059
2015-09-29 14:42:40 -07:00
Igor Canadi
1eff1834b2 Remove non-existing functions. Closes #680
Summary: See https://github.com/facebook/rocksdb/issues/680

Test Plan: compiles

Reviewers: sdong, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47721
2015-09-29 09:34:42 -07:00
agiardullo
afe0dc539b SingleDelete support for Transactions
Summary: Transactional SingleDelete is needed for MyRocks.  Note: This diff requires D47529.

Test Plan: Added some new tests in this diff as well as more tests added in D47529

Reviewers: rven, sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47535
2015-09-28 12:14:26 -07:00
agiardullo
25fd743d75 Fix SingleDelete support in WriteBatchWithIndex
Summary: Fixed some  bugs in using SingleDelete on a WriteBatchWithIndex and added some tests.

Test Plan: new tests

Reviewers: sdong, yhchiang, rven, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47529
2015-09-25 12:23:07 -07:00
Islam AbdelRahman
f03b5c987b Add experimental DB::AddFile() to plug sst files into empty DB
Summary:
This is an initial version of bulk load feature

This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are
(1) Memtables are empty
(2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0
(3) Added sst files values are not overlapping

Test Plan: unit testing

Reviewers: igor, ott, sdong

Reviewed By: sdong

Subscribers: leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D39081
2015-09-23 12:42:43 -07:00
sdong
f1b9f804e9 Add a mode to always pick the oldest file to compact for each level
Summary:
Add options.compaction_pri, which specifies the policy about which file to compact first.
kCompactionPriByLargestSeq will compact oldest files first.
Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable.

Test Plan: Will write unit tests

Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45951
2015-09-21 17:21:59 -07:00
Igor Canadi
199744f4c4 Merge pull request #728 from jsteemann/fix-missing-include-header
add missing header required for std::function
2015-09-21 19:50:00 +02:00
jsteemann
669b892f97 add missing header required for std::function
otherwise Visual Studio will have trouble compiling this file
2015-09-18 22:18:40 +02:00
jsteemann
624ef456dd fixed formatting. thanks @4tXJ7f for pointing me at make format 2015-09-18 22:03:47 +02:00
jsteemann
4704833357 pass input string to WriteBatch() by const reference
this may lead to copying less data (in case compilers don't
optimize away copying the string by themselves)
2015-09-18 20:20:32 +02:00
Andres Noetzli
014fd55adc Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).

In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.

Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.

Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
  deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
  this patch supported this so it could be resurrected if needed)

Test Plan: make all check

Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor

Reviewed By: igor

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
jsteemann
f8b770a942 fixed typos 2015-09-17 19:58:09 +02:00
Dmytro Okhonko
31a27a3606 Callback for informing backup downloading added
Summary:
In case of huge db backup infromation about progress of downloading would help.
New callback parameter in CreateNewBackup() function will trigger whenever a some amount of data downloaded.
Task: 8057631

Test Plan:
ProgressCallbackDuringBackup test that cover new functionality added to BackupableDBTest tests.
other test succeed as well.

Reviewers: Guenena, benj, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46575
2015-09-15 17:08:30 -07:00
Ari Ekmekji
2b683d4972 Add DBOption.max_subcompaction to option dump
Summary:
RocksDB options can be dumped to the log file, and
up to this point the max_subcompactions option was not included
in this dump. This fixes that.

Test Plan: makek all && make check

Reviewers: MarkCallaghan, igor, noetzli, anthony, yhchiang, sdong

Reviewed By: yhchiang, sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46971
2015-09-15 11:41:46 -07:00
Yoshinori Matsunobu
4886073174 Adding Slice::difference_offset() function
Summary:
There are some use cases in MyRocks to compare two slices
and to return the first byte where they differ. It may be
useful to add it as a RocksDB Slice function.

Test Plan: db_test

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D46935
2015-09-15 10:32:42 -07:00
sdong
5de807ac16 Add options.hard_pending_compaction_bytes_limit to stop writes if compaction lagging behind
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.

Test Plan: Add unit test DBTest.HardLimit

Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45999
2015-09-14 12:51:16 -07:00
Siying Dong
592f6bf782 Merge pull request #716 from yuslepukhin/refactor_file_reader_writer_win
Refactor to support file_reader_writer on Windows.
2015-09-14 12:29:01 -07:00
Ari Ekmekji
03ddce9a01 Add counters for L0 stall while L0-L1 compaction is taking place
Summary:
Although there are currently counters to keep track of the
stall caused by having too many L0 files, there is no distinction as
to whether when that stall occurs either (A) L0-L1 compaction is taking
place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
at the moment. This diff adds a counter for (A) so that the nature of L0
stalls can be better understood.

Test Plan: make all && make check

Reviewers: sdong, igor, anthony, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D46749
2015-09-14 11:03:37 -07:00
Dmitri Smirnov
ddc8b44998 Address code review comments both GH and internal
Fix compilation issues on GCC/CLANG
 Address Windows Release test build issues due to Sync
2015-09-11 17:36:48 -07:00
Manuel Ung
aeb4612685 Add counters for seek/next/prev
Summary:
There are currently no statistics on seeks, only on gets. This adds the following counters:

rocksdb.number.db.seek
rocksdb.number.db.next
rocksdb.number.db.prev
(number of calls)

rocksdb.db.iterate.bytes.read
(number of bytes read from key + value using seek/next/prev)

rocksdb.number.keys.seek.found
rocksdb.number.keys.next.found
rocksdb.number.keys.prev.found
(number of calls where seek/next/prev found a value)

Test Plan:
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5 -reverse_iterator

Reviewers: yhchiang, rven, kradhakrishnan, IslamAbdelRahman, MarkCallaghan, sdong, igor

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46605
2015-09-11 11:37:44 -07:00
Islam AbdelRahman
45e9e4f0bb Refactor NewTableReader to accept TableReaderOptions
Summary:
Refactoring NewTableReader to accept TableReaderOptions
This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071

Test Plan: run existing tests

Reviewers: igor, yhchiang, anthony, rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46179
2015-09-11 11:36:33 -07:00
Dmitri Smirnov
30e82d5c41 Refactor to support file_reader_writer on Windows.
Summary. A change https://reviews.facebook.net/differential/diff/224721/
  Has attempted to move common functionality out of platform dependent
  code to a new facility called file_reader_writer.
  This includes:
  - perf counters
  - Buffering
  - RateLimiting

  However, the change did not attempt to refactor Windows code.
  To mitigate, we introduce new quering interfaces such as UseOSBuffer(),
  GetRequiredBufferAlignment() and ReaderWriterForward()
  for pure forwarding where required.
  Introduce WritableFile got a new method Truncate(). This is to communicate
  to the file as to how much data it has on close.
   - When space is pre-allocated on Linux it is filled with zeros implicitly,
    no such thing exist on Windows so we must truncate file on close.
   - When operating in unbuffered mode the last page is filled with zeros but we still want to truncate.

   Previously, Close() would take care of it but now buffer management is shifted to the wrappers and the file has
   no idea about the file true size.

   This means that Close() on the wrapper level must always include
   Truncate() as well as wrapper __dtor should call Close() and
   against double Close().
   Move buffered/unbuffered write logic to the wrapper.
   Utilize Aligned buffer class.
   Adjust tests and implement Truncate() where necessary.
   Come up with reasonable defaults for new virtual interfaces.
   Forward calls for RandomAccessReadAhead class to avoid double
   buffering and locking (double locking in unbuffered mode on WIndows).
2015-09-11 09:57:02 -07:00
Ari Ekmekji
3c37b3cccd Determine boundaries of subcompactions
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.

The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.

These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.

Test Plan: make all && make check

Reviewers: yhchiang, anthony, igor, noetzli, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43269
2015-09-10 13:50:00 -07:00
Igor Canadi
ac9bcb55ce Set max_open_files based on ulimit
Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566

Test Plan:
make check

I will also verify that max_max_open_files is reasonable.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46551
2015-09-10 10:49:28 -07:00
Islam AbdelRahman
f3f2032c46 Release RocksDB 4.0.0
Summary: Release RocksDB 4.0.0

Test Plan: no test

Reviewers: sdong, yhchiang, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46515
2015-09-09 16:01:03 -07:00
agiardullo
44b6e99e1b update max_write_buffer_number_to_maintain docblock
Summary: fix comment

Test Plan: n/a

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46455
2015-09-09 13:36:38 -07:00
agiardullo
aa6eed0c1e Transaction stats
Summary: Added funtions to fetch the number of locked keys in a transaction, the number of pending puts/merge/deletes, and the elapsed time

Test Plan: unit tests

Reviewers: yoshinorim, jkedgar, rven, sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45417
2015-09-09 13:35:53 -07:00
agiardullo
b5b2b75e52 better tuning of arena block size
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46467
2015-09-08 20:53:32 -07:00
agiardullo
5e94f68f35 TransactionDB Custom Locking API
Summary:
Prototype of API to allow MyRocks to override default Mutex/CondVar used by transactions with their own implementations.  They would simply need to pass their own implementations of Mutex/CondVar to the templated TransactionDB::Open().

Default implementation of TransactionDBMutex/TransactionDBCondVar provided (but the code is not currently changed to use this).

Let me know if this API makes sense or if it should be changed

Test Plan: n/a

Reviewers: yhchiang, rven, igor, sdong, spetrunia

Reviewed By: spetrunia

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43761
2015-09-08 17:03:57 -07:00
Andres Noetzli
6bdc484fd8 Added Equal method to Comparator interface
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46233
2015-09-08 15:30:49 -07:00
Mayank Pundir
d9f42aa60d Adding a verifyBackup method to BackupEngine
Summary: This diff adds a verifyBackup method to BackupEngine. The method verifies the name and size of each file in the backup.

Test Plan: Unit test cases created and passing.

Reviewers: igor, benj

Subscribers: zelaine.fong, yhchiang, sdong, lgalanis, dhruba, AaronFeldman

Differential Revision: https://reviews.facebook.net/D46029
2015-09-03 17:27:21 -07:00
Paul Marinescu
50dc5f0c5a Replace BackupRateLimiter with GenericRateLimiter
Summary: BackupRateLimiter removed and uses replaced with the existing GenericRateLimiter

Test Plan:
make all check

make clean
USE_CLANG=1 make all

make clean
OPT=-DROCKSDB_LITE make release

Reviewers: leveldb, igor

Reviewed By: igor

Subscribers: igor, dhruba

Differential Revision: https://reviews.facebook.net/D46095
2015-09-03 17:00:09 -07:00
agiardullo
0f1aab6c12 Add SetLockTimeout for Transactions
Summary: MyRocks wants to be able to change the lock timeout of a transaction that has already started.  Expose existing SetLockTimeout function to users.

Test Plan: unit test

Reviewers: spetrunia, rven, sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45987
2015-09-02 20:07:19 -07:00
Andres Noetzli
3c9cef1eed Unified maps with Comparator for sorting, other cleanup
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).

Test Plan: make clean check all

Reviewers: rven, anthony, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45993
2015-09-02 13:58:22 -07:00
Dmitri Smirnov
f14c3363e1 Make WinEnv::NowMicros return system time
Previous change for the function
  555ca3e7b7 (diff-bdc04e0404c2db4fd3ac5118a63eaa4a)
  made use of the QueryPerformanceCounter to return microseconds values that do not repeat
  as std::chrono::system_clock returned values that made auto_roll_logger_test fail.

 The interface documentation does not state that we need to return
 system time describing the return value as a number of microsecs since some
 moment in time. However, because on Linux it is implemented using gettimeofday
 various pieces of code (such as GenericRateLimiter) took advantage of that
 and make use of NowMicros() as a system timestamp. Thus the previous change
 broke rate_limiter_test on Windows.

 In addition, the interface name NowMicros() suggests that it is actually
 a timestamp so people use it as such.

 This change makes use of the new system call on Windows that returns
 system time with required precision. This change preserves the fix
 for  auto_roll_logger_test and fixes rate_limiter_test.

 Note that DBTest.RateLimitingTest still fails due to a separately reported issue.
2015-09-02 11:12:07 -07:00
agiardullo
77a28615ec Support static Status messages
Summary: Provide a way to specify a detailed static error message for a Status without incurring a memcpy.  Let me know what people think of this approach.

Test Plan: added simple test

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44259
2015-08-31 16:13:29 -07:00
sdong
7a0dbdf3ac Add ZSTD (not final format) compression type
Summary: Add ZSTD compression type. The same way as adding LZ4.

Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45747
2015-08-28 11:01:13 -07:00
Javier González
0886f4f66a Helper functions to support direct IO
Summary:
This patch adds the helper functions and variables to allow a backend
implementing WritableFile to support direct IO when persisting a
memtable.

Test Plan:
Since there is no upstream implementation of WritableFile supporting
direct IO, the new behavior is disabled.

Tests should be provided by the backend implementing WritableFile.
2015-08-27 08:36:56 +02:00
Yueh-Hsuan Chiang
1fb2abae2d ColumnFamilyOptions serialization / deserialization.
Summary:
This patch adds GetStringFromColumnFamilyOptions(), the inverse function
of the existing GetColumnFamilyOptionsFromString(), and improves
the implementation of GetColumnFamilyOptionsFromString().

Test Plan: Add a test in options_test.cc

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: noetzli, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45009
2015-08-26 16:13:56 -07:00
Igor Canadi
5f4166c90e ReadaheadRandomAccessFile -- userspace readahead
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.

We add ReadaheadRandomAccessFile layer only when file is read during compactions.

D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.

Test Plan: make check

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45123
2015-08-26 15:25:59 -07:00
Yueh-Hsuan Chiang
9ccf1bd3e2 Correct the comment for GetProperty() API.
Summary:
"rocksdb.aggregated-table-properties" and "rocksdb.aggregated-table-properties-at-level<N>"
should belong to GetProperty() instead of GetIntProperty(), but the comment mistakenly
classifies them to GetIntProperty().

This patch fix this comment error.

Test Plan: no code change.

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45561
2015-08-25 16:45:23 -07:00
Andres Notzli
09d982f9e0 Fix compact_files_example
Summary:
See task #7983654. The example was triggering an assert in compaction job
because the compaction was not marked as manual. With this patch,
CompactionPicker::FormCompaction() marks compactions as manual. This patch
also fixes a couple of typos, adds optimistic_transaction_example to
.gitignore and librocksdb as a dependency for examples. Adding librocksdb as
a dependency makes sure that the examples are built with the latest changes
in librocksdb.

Test Plan: make clean && cd examples && make all && ./compact_files_example

Reviewers: rven, sdong, anthony, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45117
2015-08-25 12:29:44 -07:00
Yueh-Hsuan Chiang
6996de87af Expose per-level aggregated table properties via GetProperty()
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.

Test Plan: Added tests in db_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45087
2015-08-25 12:03:54 -07:00