Commit Graph

941 Commits

Author SHA1 Message Date
Dmitri Smirnov
ddc8b44998 Address code review comments both GH and internal
Fix compilation issues on GCC/CLANG
 Address Windows Release test build issues due to Sync
2015-09-11 17:36:48 -07:00
Manuel Ung
aeb4612685 Add counters for seek/next/prev
Summary:
There are currently no statistics on seeks, only on gets. This adds the following counters:

rocksdb.number.db.seek
rocksdb.number.db.next
rocksdb.number.db.prev
(number of calls)

rocksdb.db.iterate.bytes.read
(number of bytes read from key + value using seek/next/prev)

rocksdb.number.keys.seek.found
rocksdb.number.keys.next.found
rocksdb.number.keys.prev.found
(number of calls where seek/next/prev found a value)

Test Plan:
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5 -reverse_iterator

Reviewers: yhchiang, rven, kradhakrishnan, IslamAbdelRahman, MarkCallaghan, sdong, igor

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46605
2015-09-11 11:37:44 -07:00
Islam AbdelRahman
45e9e4f0bb Refactor NewTableReader to accept TableReaderOptions
Summary:
Refactoring NewTableReader to accept TableReaderOptions
This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071

Test Plan: run existing tests

Reviewers: igor, yhchiang, anthony, rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46179
2015-09-11 11:36:33 -07:00
Dmitri Smirnov
30e82d5c41 Refactor to support file_reader_writer on Windows.
Summary. A change https://reviews.facebook.net/differential/diff/224721/
  Has attempted to move common functionality out of platform dependent
  code to a new facility called file_reader_writer.
  This includes:
  - perf counters
  - Buffering
  - RateLimiting

  However, the change did not attempt to refactor Windows code.
  To mitigate, we introduce new quering interfaces such as UseOSBuffer(),
  GetRequiredBufferAlignment() and ReaderWriterForward()
  for pure forwarding where required.
  Introduce WritableFile got a new method Truncate(). This is to communicate
  to the file as to how much data it has on close.
   - When space is pre-allocated on Linux it is filled with zeros implicitly,
    no such thing exist on Windows so we must truncate file on close.
   - When operating in unbuffered mode the last page is filled with zeros but we still want to truncate.

   Previously, Close() would take care of it but now buffer management is shifted to the wrappers and the file has
   no idea about the file true size.

   This means that Close() on the wrapper level must always include
   Truncate() as well as wrapper __dtor should call Close() and
   against double Close().
   Move buffered/unbuffered write logic to the wrapper.
   Utilize Aligned buffer class.
   Adjust tests and implement Truncate() where necessary.
   Come up with reasonable defaults for new virtual interfaces.
   Forward calls for RandomAccessReadAhead class to avoid double
   buffering and locking (double locking in unbuffered mode on WIndows).
2015-09-11 09:57:02 -07:00
Ari Ekmekji
3c37b3cccd Determine boundaries of subcompactions
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.

The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.

These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.

Test Plan: make all && make check

Reviewers: yhchiang, anthony, igor, noetzli, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43269
2015-09-10 13:50:00 -07:00
Igor Canadi
ac9bcb55ce Set max_open_files based on ulimit
Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566

Test Plan:
make check

I will also verify that max_max_open_files is reasonable.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46551
2015-09-10 10:49:28 -07:00
Islam AbdelRahman
f3f2032c46 Release RocksDB 4.0.0
Summary: Release RocksDB 4.0.0

Test Plan: no test

Reviewers: sdong, yhchiang, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46515
2015-09-09 16:01:03 -07:00
agiardullo
44b6e99e1b update max_write_buffer_number_to_maintain docblock
Summary: fix comment

Test Plan: n/a

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46455
2015-09-09 13:36:38 -07:00
agiardullo
aa6eed0c1e Transaction stats
Summary: Added funtions to fetch the number of locked keys in a transaction, the number of pending puts/merge/deletes, and the elapsed time

Test Plan: unit tests

Reviewers: yoshinorim, jkedgar, rven, sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45417
2015-09-09 13:35:53 -07:00
agiardullo
b5b2b75e52 better tuning of arena block size
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46467
2015-09-08 20:53:32 -07:00
agiardullo
5e94f68f35 TransactionDB Custom Locking API
Summary:
Prototype of API to allow MyRocks to override default Mutex/CondVar used by transactions with their own implementations.  They would simply need to pass their own implementations of Mutex/CondVar to the templated TransactionDB::Open().

Default implementation of TransactionDBMutex/TransactionDBCondVar provided (but the code is not currently changed to use this).

Let me know if this API makes sense or if it should be changed

Test Plan: n/a

Reviewers: yhchiang, rven, igor, sdong, spetrunia

Reviewed By: spetrunia

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43761
2015-09-08 17:03:57 -07:00
Andres Noetzli
6bdc484fd8 Added Equal method to Comparator interface
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46233
2015-09-08 15:30:49 -07:00
Mayank Pundir
d9f42aa60d Adding a verifyBackup method to BackupEngine
Summary: This diff adds a verifyBackup method to BackupEngine. The method verifies the name and size of each file in the backup.

Test Plan: Unit test cases created and passing.

Reviewers: igor, benj

Subscribers: zelaine.fong, yhchiang, sdong, lgalanis, dhruba, AaronFeldman

Differential Revision: https://reviews.facebook.net/D46029
2015-09-03 17:27:21 -07:00
Paul Marinescu
50dc5f0c5a Replace BackupRateLimiter with GenericRateLimiter
Summary: BackupRateLimiter removed and uses replaced with the existing GenericRateLimiter

Test Plan:
make all check

make clean
USE_CLANG=1 make all

make clean
OPT=-DROCKSDB_LITE make release

Reviewers: leveldb, igor

Reviewed By: igor

Subscribers: igor, dhruba

Differential Revision: https://reviews.facebook.net/D46095
2015-09-03 17:00:09 -07:00
agiardullo
0f1aab6c12 Add SetLockTimeout for Transactions
Summary: MyRocks wants to be able to change the lock timeout of a transaction that has already started.  Expose existing SetLockTimeout function to users.

Test Plan: unit test

Reviewers: spetrunia, rven, sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45987
2015-09-02 20:07:19 -07:00
Andres Noetzli
3c9cef1eed Unified maps with Comparator for sorting, other cleanup
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).

Test Plan: make clean check all

Reviewers: rven, anthony, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45993
2015-09-02 13:58:22 -07:00
Dmitri Smirnov
f14c3363e1 Make WinEnv::NowMicros return system time
Previous change for the function
  555ca3e7b7 (diff-bdc04e0404c2db4fd3ac5118a63eaa4a)
  made use of the QueryPerformanceCounter to return microseconds values that do not repeat
  as std::chrono::system_clock returned values that made auto_roll_logger_test fail.

 The interface documentation does not state that we need to return
 system time describing the return value as a number of microsecs since some
 moment in time. However, because on Linux it is implemented using gettimeofday
 various pieces of code (such as GenericRateLimiter) took advantage of that
 and make use of NowMicros() as a system timestamp. Thus the previous change
 broke rate_limiter_test on Windows.

 In addition, the interface name NowMicros() suggests that it is actually
 a timestamp so people use it as such.

 This change makes use of the new system call on Windows that returns
 system time with required precision. This change preserves the fix
 for  auto_roll_logger_test and fixes rate_limiter_test.

 Note that DBTest.RateLimitingTest still fails due to a separately reported issue.
2015-09-02 11:12:07 -07:00
agiardullo
77a28615ec Support static Status messages
Summary: Provide a way to specify a detailed static error message for a Status without incurring a memcpy.  Let me know what people think of this approach.

Test Plan: added simple test

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44259
2015-08-31 16:13:29 -07:00
sdong
7a0dbdf3ac Add ZSTD (not final format) compression type
Summary: Add ZSTD compression type. The same way as adding LZ4.

Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45747
2015-08-28 11:01:13 -07:00
Javier González
0886f4f66a Helper functions to support direct IO
Summary:
This patch adds the helper functions and variables to allow a backend
implementing WritableFile to support direct IO when persisting a
memtable.

Test Plan:
Since there is no upstream implementation of WritableFile supporting
direct IO, the new behavior is disabled.

Tests should be provided by the backend implementing WritableFile.
2015-08-27 08:36:56 +02:00
Yueh-Hsuan Chiang
1fb2abae2d ColumnFamilyOptions serialization / deserialization.
Summary:
This patch adds GetStringFromColumnFamilyOptions(), the inverse function
of the existing GetColumnFamilyOptionsFromString(), and improves
the implementation of GetColumnFamilyOptionsFromString().

Test Plan: Add a test in options_test.cc

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: noetzli, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45009
2015-08-26 16:13:56 -07:00
Igor Canadi
5f4166c90e ReadaheadRandomAccessFile -- userspace readahead
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.

We add ReadaheadRandomAccessFile layer only when file is read during compactions.

D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.

Test Plan: make check

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45123
2015-08-26 15:25:59 -07:00
Yueh-Hsuan Chiang
9ccf1bd3e2 Correct the comment for GetProperty() API.
Summary:
"rocksdb.aggregated-table-properties" and "rocksdb.aggregated-table-properties-at-level<N>"
should belong to GetProperty() instead of GetIntProperty(), but the comment mistakenly
classifies them to GetIntProperty().

This patch fix this comment error.

Test Plan: no code change.

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45561
2015-08-25 16:45:23 -07:00
Andres Notzli
09d982f9e0 Fix compact_files_example
Summary:
See task #7983654. The example was triggering an assert in compaction job
because the compaction was not marked as manual. With this patch,
CompactionPicker::FormCompaction() marks compactions as manual. This patch
also fixes a couple of typos, adds optimistic_transaction_example to
.gitignore and librocksdb as a dependency for examples. Adding librocksdb as
a dependency makes sure that the examples are built with the latest changes
in librocksdb.

Test Plan: make clean && cd examples && make all && ./compact_files_example

Reviewers: rven, sdong, anthony, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45117
2015-08-25 12:29:44 -07:00
Yueh-Hsuan Chiang
6996de87af Expose per-level aggregated table properties via GetProperty()
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.

Test Plan: Added tests in db_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45087
2015-08-25 12:03:54 -07:00
Ari Ekmekji
b6def58f73 Changed 'num_subcompactions' to the more accurate 'max_subcompactions'
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.

I also added a --subcompactions option to db_bench

Test Plan: make all   make check

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45069
2015-08-21 14:25:34 -07:00
sdong
9130873a13 Add options.new_table_reader_for_compaction_inputs
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.

Test Plan: Add the option.

Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang

Reviewed By: igor

Subscribers: igor, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43311
2015-08-21 08:46:29 -07:00
sdong
07d2d34160 Add a counter about estimated pending compaction bytes
Summary:
Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.

Test Plan: Add unit tests

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44205
2015-08-20 22:17:10 -07:00
Islam AbdelRahman
027ca5b2cd Total SST files size DB Property
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions

Test Plan: Unittests for the new property

Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44799
2015-08-20 11:47:19 -07:00
Islam AbdelRahman
3fd70b05b8 Rate limit deletes issued by DestroyDB
Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately

Test Plan: added a unittest

Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong

Reviewed By: sdong

Subscribers: jeanxu2012, dhruba

Differential Revision: https://reviews.facebook.net/D44955
2015-08-19 15:02:17 -07:00
Yueh-Hsuan Chiang
df79eafcb3 Introduce GetIntProperty("rocksdb.size-all-mem-tables")
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.

This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.

Test Plan: Added a test in db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44229
2015-08-19 13:32:09 -07:00
Yueh-Hsuan Chiang
9ec9571593 DBOptions serialization and deserialization
Summary:
This patch implements DBOptions deserialization and improve
the current implementation of DBOptions serialization by
using a static structure that stores the offset of each
DBOptions member variables to perform serialization and
deserialization instead of using tons of if-then-branch
to determine the mapping between string and variables.

Test Plan: Added test in options_test.cc

Reviewers: igor, anthony, sdong, IslamAbdelRahman

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44097
2015-08-18 13:30:18 -07:00
Ari Ekmekji
f0da6977a3 [Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State
Summary:
In prepration for running multiple threads at the same time during
a compaction job, this patch assigns each subcompaction its own state
(instead of sharing the one global CompactionState). Each subcompaction then
uses this state to update its statistics, keep track of its snapshots, etc.
during the course of execution. Then at the end of all the executions the
statistics are aggregated across the subcompactions so that the final result
is the same as if only one larger compaction had run.

Test Plan: ./db_test  ./db_compaction_test  ./compaction_job_test

Reviewers: sdong, anthony, igor, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43239
2015-08-18 11:06:23 -07:00
Andres Notzli
f32a572099 Simplify querying of merge results
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.

Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.

Test Plan: make clean all check

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43353
2015-08-17 17:34:38 -07:00
sdong
603b6da8b8 Add options.compaction_measure_io_stats to print write I/O stats in compactions
Summary:
Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:

2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}

Add two more counters in iostats_context.

Also add a parameter of db_bench.

Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44115
2015-08-13 16:52:26 -07:00
Islam AbdelRahman
dc9d5634fd Change master to 3.14
Summary: Change master version to 3.14

Test Plan: simple change

Reviewers: sdong, yhchiang, kradhakrishnan, rven, anthony, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44187
2015-08-13 11:17:10 -07:00
Islam AbdelRahman
3bd9db420e [Cleanup] Remove RandomRWFile
Summary: RandomRWFile is not used anywhere in out code base, this patch remove RandomRWFile

Test Plan:
make check -j64
USE_CLANG=1 make all -j64
OPT=-DROCKSDB_LITE make release -j64

Reviewers: sdong, yhchiang, anthony, kradhakrishnan, rven, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44091
2015-08-12 10:18:59 -07:00
agiardullo
c3466eab07 Have Transactions use WriteBatch::RollbackToSavePoint
Summary:
Clean up transactions to use the new RollbackToSavePoint api in WriteBatchWithIndex.

Note, this diff depends on Pessimistic Transactions diff and ManagedSnapshot diff (D40869 and D43293).

Test Plan: unit tests

Reviewers: rven, yhchiang, kradhakrishnan, spetrunia, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43371
2015-08-11 17:53:30 -07:00
agiardullo
0db807ec28 Transaction error statuses
Summary:
Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures.

https://github.com/MySQLOnRocksDB/mysql-5.6/issues/86#issuecomment-124605954

Test Plan: unit tests

Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43323
2015-08-11 17:52:56 -07:00
agiardullo
c2f2cb0214 Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions.  This diff contains the api changes discussed in D38913.  This diff is pretty large, so let me know if people would prefer to meet up to discuss it.

MyRocks folks:  please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.

Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint().  After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex.  We can then decide which route is preferable.

Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.

Test Plan: Unit tests, db_bench parallel testing.

Reviewers: igor, rven, sdong, yhchiang, yoshinorim

Reviewed By: sdong

Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40869
2015-08-11 17:52:23 -07:00
Islam AbdelRahman
cee1e8a080 Parallelize LoadTableHandlers
Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover

Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)

Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43755
2015-08-11 12:19:56 -07:00
Andres Notzli
68f934355a Better CompactionJob testing
Summary:
Changed compaction_job_test to support better/more thorough
tests and added two tests. Also changed MockFileContents
to order using InternalKeyComparator.

Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check

Reviewers: sdong, rven, igor, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42837
2015-08-07 21:59:51 -07:00
agiardullo
16ea1c7d1c simple ManagedSnapshot wrapper
Summary: Implemented this simple wrapper for something else I was working on.  Seemed like it makes sense to expose it instead of burying it in some random code.

Test Plan: added test

Reviewers: rven, kradhakrishnan, sdong, yhchiang

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43293
2015-08-06 17:59:05 -07:00
sdong
c7742452eb Add Statistics.getHistogramString() to print more detailed outputs of a histogram
Summary:
Provide a way for users to know more detailed ditribution of a histogram metrics. Example outputs:

Manually add statement
  fprintf(stdout, "%s\n", dbstats->getHistogramString(SST_READ_MICROS).c_str());
Will print out something like:

Count: 989151  Average: 1.7659  StdDev: 1.52
Min: 0.0000  Median: 1.2071  Max: 860.0000
Percentiles: P50: 1.21 P75: 1.70 P99: 5.12 P99.9: 13.67 P99.99: 21.70
------------------------------------------------------
[       0,       1 )   390839  39.513%  39.513% ########
[       1,       2 )   500918  50.641%  90.154% ##########
[       2,       3 )    79358   8.023%  98.177% ##
[       3,       4 )     6297   0.637%  98.813%
[       4,       5 )     1712   0.173%  98.986%
[       5,       6 )     1134   0.115%  99.101%
[       6,       7 )     1222   0.124%  99.224%
[       7,       8 )     1529   0.155%  99.379%
[       8,       9 )     1264   0.128%  99.507%
[       9,      10 )      988   0.100%  99.607%
[      10,      12 )     1378   0.139%  99.746%
[      12,      14 )     1828   0.185%  99.931%
[      14,      16 )      410   0.041%  99.972%
[      16,      18 )       72   0.007%  99.980%
[      18,      20 )       67   0.007%  99.986%
[      20,      25 )      106   0.011%  99.997%
[      25,      30 )       24   0.002%  99.999%
[      30,      35 )        1   0.000% 100.000%
[     250,     300 )        2   0.000% 100.000%
[     300,     350 )        1   0.000% 100.000%
[     800,     900 )        1   0.000% 100.000%

Test Plan: Manually add a print in db_bench and make sure it prints out as expected. Will add some codes to cover the function

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43611
2015-08-05 20:05:56 -07:00
Poornima Chozhiyath Raman
960d936e83 Add function 'GetInfoLogList()'
Summary: The list of info log files of a db can be obtained using the new function.

Test Plan: New test in db_test.cc passed.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41715
2015-08-05 16:16:46 -07:00
sdong
3ae386eafe Add statistic histogram "rocksdb.sst.read.micros"
Summary: Measure read latency histogram and put in statistics. Compaction inputs are excluded from it when possible (unfortunately usually no possible as we usually take table reader from table cache.

Test Plan:
Run db_bench and it shows the stats, like:

rocksdb.sst.read.micros statistics Percentiles :=> 50 : 1.238522 95 : 2.529740 99 : 3.912180

Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, MarkCallaghan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43275
2015-08-05 13:02:33 -07:00
Andres Notzli
c465071029 Removing duplicate code
Summary:
While working on https://reviews.facebook.net/D43179 , I found
duplicate code in the tests. This patch removes it.

Test Plan: make clean all check

Reviewers: igor, sdong, rven, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43263
2015-08-05 07:33:27 -07:00
Mike Kolupaev
e06cf1a098 [wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.

Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.

Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba

Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 06:06:39 -07:00
Ari Ekmekji
5dc3e6881a Update Tests To Enable Subcompactions
Summary:
Updated DBTest DBCompactionTest and CompactionJobStatsTest
to run compaction-related tests once with subcompactions enabled and
once disabled using the TEST_P test type in the Google Test suite.

Test Plan: ./db_test  ./db_compaction-test  ./compaction_job_stats_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43443
2015-08-04 22:19:07 -07:00
Islam AbdelRahman
c45a57b41e Support delete rate limiting
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.

I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile

Test Plan:
added delete_scheduler_test
existing unit tests

Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43221
2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang
24daff6d7a Fix a typo and update HISTORY.md for NewCompactOnDeletionCollectorFactory().
Summary: Fix a typo and update HISTORY.md for NewCompactOnDeletionCollectorFactory().

Test Plan: no code change.

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43521
2015-08-04 13:51:05 -07:00
Yueh-Hsuan Chiang
14d0bfa429 Add DBOptions::skip_sats_update_on_db_open
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.

This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.

Test Plan: Add DBCompactionTest.SkipStatsUpdateTest

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42843
2015-08-04 13:48:16 -07:00
Yueh-Hsuan Chiang
26894303c1 Add CompactOnDeletionCollector in utilities/table_properties_collectors.
Summary:
This diff adds CompactOnDeletionCollector in utilities/table_properties_collectors,
which applies a sliding window to a sst file and mark this file as need-compaction
when it observe enough deletion entries within the consecutive keys covered by
the sliding window.

Test Plan: compact_on_deletion_collector_test

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yoshinorim, sdong

Reviewed By: sdong

Subscribers: maykov, dhruba

Differential Revision: https://reviews.facebook.net/D41175
2015-08-03 20:42:55 -07:00
Ari Ekmekji
40c64434d4 Parallelize L0-L1 Compaction: Restructure Compaction Job
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.

This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.

The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are

  # Correct aggregation of compaction job statistics across multiple threads
  # Proper opening/closing of output files (make sure each thread's is unique)
  # Keys that span multiple L1 files
  # Skewed distributions of keys within L0 files

Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test

Reviewers: igor, noetzli, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42699
2015-08-03 11:32:14 -07:00
agiardullo
8161bdb5a0 WriteBatch Save Points
Summary:
Support RollbackToSavePoint() in WriteBatch and WriteBatchWithIndex.  Support for partial transaction rollback is needed for MyRocks.

An alternate implementation of Transaction::RollbackToSavePoint() exists in D40869.  However, the other implementation is messier because it is implemented outside of WriteBatch.  This implementation is much cleaner and also exposes a potentially useful feature to WriteBatch.

Test Plan: Added unit tests

Reviewers: IslamAbdelRahman, kradhakrishnan, maykov, yoshinorim, hermanlee4, spetrunia, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42723
2015-07-29 16:54:23 -07:00
Andres Notzli
e95c59cd2f Count number of corrupt keys during compaction
Summary:
For task #7771355, we would like to log the number of corrupt keys
during a compaction. This patch implements and tests the count
as part of CompactionJobStats.

Test Plan: make && make check

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42921
2015-07-28 16:41:40 -07:00
Yueh-Hsuan Chiang
14f4137602 Correct the comment of DB::GetApproximateSizes
Summary: Correct the comment of DB::GetApproximateSizes

Test Plan: no code change

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42939
2015-07-24 16:29:05 -07:00
agiardullo
6002801e09 Abandon ROCKSDB_WARNING
Summary:
My latest fix to pragma_error.h caused compilation errors for another internal project.  I am now unable to figure out how to get pragma_error working on all platforms and build environments (nor am I able to test any other options).

Seems like the best option is to get rid of this macro.  include/utilities has been deprecated for a year now, so lets just deal with a breaking change in 3.13 to remove these files.  And I guess we'll have to live with having an extra convenience.h.

Thoughts?

Test Plan: build

Reviewers: igor, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42597
2015-07-22 12:56:56 -07:00
Andres Notzli
06aebca592 Report live data size estimate
Summary:
Fixes T6548822. Added a new function for estimating the size of the live data
as proposed in the task. The value can be accessed through the property
rocksdb.estimate-live-data-size.

Test Plan:
There are two unit tests in version_set_test and a simple test in db_test.
make version_set_test && ./version_set_test;
make db_test && ./db_test gtest_filter=GetProperty

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41493
2015-07-21 21:33:20 -07:00
Siying Dong
3dbf4ba220 RangeSync not to sync last 1MB of the file
Summary:
From other ones' investigation:

"sync_file_range() behavior highly depends on kernel version and filesystem.

xfs does neighbor page flushing outside of the specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16)... Ranges of the sync_file_range() should be far enough from write() offset (at least 1MB)."

Test Plan: make all check

Reviewers: igor, rven, kradhakrishnan, yhchiang, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: yoshinorim, MarkCallaghan, sumeet, domas, dhruba, leveldb, ljin

Differential Revision: https://reviews.facebook.net/D15807
2015-07-21 16:22:40 -07:00
agiardullo
064294081b Improved FileExists API
Summary: Add new CheckFileExists method.  Considered changing the FileExists api but didn't want to break anyone's builds.

Test Plan: unit tests

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42003
2015-07-20 17:20:40 -07:00
Islam AbdelRahman
aa8ac6445b Skip unsupported tests in ROCKSDB_LITE
Summary:
Skipping these tests in ROCKSDB_LITE since they are not supported
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test

Test Plan:
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test

Reviewers: igor, sdong, yhchiang, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42573
2015-07-20 11:24:54 -07:00
sdong
6e9fbeb27c Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.

Test Plan: Run all existing unit tests.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42321
2015-07-17 16:58:18 -07:00
Igor Canadi
35ca59364c Don't let flushes preempt compactions
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.

We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.

Test Plan: make check (there might be some unit tests that depend on this behavior)

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41931
2015-07-17 12:02:52 -07:00
agiardullo
79373c372d Fix ROCKSDB_WARNING
Summary:
ROCKSDB_WARNING is only defined if either ROCKSDB_PLATFORM_POSIX or OS_WIN is defined.  This works well for building rocksdb with its own build scripts.  But this won't work when an outside project(like mongodb) doesn't define ROCKSDB_PLATFORM_POSIX.

This fix defines ROCKSDB_WARNING for all platforms.  No idea if its defined correctly on non-posix,non-windows platforms but this is no worse that the current situation where this macro is missing on unexpected platforms.

This fix should hopefully fix anyone whose build broke now that we've switched from using #warning to Pragma (to support windows).  Unfortunately, while mongo-rocks compiles, it ignores the Pragma and doesn't print a warning.  I have not been able to figure out a way to implement this portably on all platforms.

Of course, an alternate solution would be to just get rid of ROCKSDB_WARNING and live with include file redirects indefinitely.  Thoughts?

Test Plan: build rocks, build mongorocks

Reviewers: igor, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42477
2015-07-17 11:04:55 -07:00
Igor Canadi
a96fcd09b7 Deprecate CompactionFilterV2
Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.

Test Plan: make check

Reviewers: noetzli, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42405
2015-07-17 18:59:11 +02:00
Siying Dong
d730c36777 Merge pull request #657 from yuslepukhin/ensure_clean_public_headers
Ensure Windows build w/o port/port.h in public headers
2015-07-16 17:47:30 -07:00
Dmitri Smirnov
d1a457181d Ensure Windows build w/o port/port.h in public headers
- Remove make file defines from public headers and use _WIN32 because it is compiler defined
 - use __GNUC__ and __clang__ to guard non-portable attributes
 - add #include "port/port.h" to some new .cc files.
 - minor changes in CMakeLists to reflect recent changes
2015-07-16 12:10:16 -07:00
agiardullo
43e9825625 Fix mongo build -take 2
Summary: quick fix for now.  will figure out a better fix soon

Test Plan: build

Reviewers: sdong, igor, kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42381
2015-07-15 18:24:02 -07:00
agiardullo
d8263d9584 Unbreak mongo build
Summary: Unbreak build for mongo.  later, we'll have to figure out if there is a platform independent way to implement ROCKSDB_WARNING

Test Plan: build

Reviewers: igor, sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42369
2015-07-15 17:56:37 -07:00
Igor Canadi
12c5528a89 Bump to RocksDB 3.13
Summary: master is now 3.13-to-be. I'm testing some internal stuff that has #ifdef based on RocksDB version.

Test Plan: none

Reviewers: sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42249
2015-07-15 16:03:20 -07:00
agiardullo
81d072623c move convenience.h out of utilities
Summary: Moved convenience.h out of utilities to remove a dependency on utilities in db.

Test Plan: unit tests.  Also compiled a link to the old location to verify the _Pragma works.

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42201
2015-07-15 14:51:51 -07:00
Igor Canadi
8a9fca2619 Better error handling in BackupEngine
Summary:
Couple of changes here:
* NewBackupEngine() and NewReadOnlyBackupEngine() are now removed. They were deprecated since RocksDB 3.8. Changing these to new functions should be pretty straight-forward. As a followup, I'll fix all fbcode callsights
* Instead of initializing backup engine in the constructor, we initialize it in a separate function now. That way, we can catch all errors and return appropriate status code.
* We catch all errors during initializations and return them to the client properly.
* Added new tests to backupable_db_test, to make sure that we can't open BackupEngine when there are Env errors.
* Transitioned backupable_db_test to use BackupEngine rather than BackupableDB. From the two available APIs, judging by the current use-cases, it looks like BackupEngine API won. It's much more flexible since it doesn't require StackableDB.

Test Plan: Added a new unit test to backupable_db_test

Reviewers: yhchiang, sdong, AaronFeldman

Reviewed By: AaronFeldman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41925
2015-07-14 20:51:36 +02:00
Igor Canadi
a9c5109515 Deprecate purge_redundant_kvs_while_flush
Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.

Test Plan: none

Reviewers: sdong, yhchiang, anthony, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42063
2015-07-14 13:07:02 +02:00
Igor Canadi
5aea98ddd8 Deprecate WriteOptions::timeout_hint_us
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.

Test Plan: make check

Reviewers: anthony, kradhakrishnan, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41937
2015-07-14 09:35:48 +02:00
Ari Ekmekji
8bca83e5dd Add tombstone information in CompactionJobStats
Summary:
Added new statistics in CompactionJobStats to keep track of
deletion entries and the expiration of those entries. Updated these
fields in compaction_job.cc as compaction took place and wrote a new
test in compaction_job_stats_test.cc to verify accuracy.

Test Plan:
Wrote new test DeletionStatsTest in
compaction_job_stats_test.cc to verify

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41355
2015-07-13 15:51:38 -07:00
sdong
f9728640f3 "make format" against last 10 commits
Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.

Test Plan: Build it.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41961
2015-07-13 13:50:18 -07:00
sdong
76d3cd3286 Fix public API dependency on internal codes and dependency on MAX_INT32
Summary:
Public API depends on port/port.h which is wrong. Fix it.
Also with gcc 4.8.1 build was broken as MAX_INT32 was not recognized. Fix it by using ::max in linux.

Test Plan: Build it and try to build an external project on top of it.

Reviewers: anthony, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41745
2015-07-11 10:32:11 -07:00
sdong
a6e38fd170 Fix a uncleaned counter in PerfContext::Reset()
Summary:
new_table_iterator_nanos is not cleaned in PerfContext::Reset() while new_table_block_iter_nanos is cleaned twice. Fix it.
Also fix a comment.

Test Plan: Build and db_bench with --perf_context to see the value shown.

Reviewers: kradhakrishnan, anthony, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41721
2015-07-10 16:27:32 -07:00
Siying Dong
e41cbd9c2f Merge pull request #646 from yuslepukhin/ms_win_port
Windows Port from Microsoft
2015-07-10 15:53:39 -07:00
Dmitri Smirnov
296de4ae6d Address review comments
Rule of five: add destructor
 Add a note to COMMIT.md for 3rd party json.
2015-07-10 15:21:04 -07:00
sdong
041b6f95a2 perf_context: report time spent on reading index and bloom blocks
Summary: Add a perf context counter to help users figure out time spent on reading indexes and bloom filter blocks.

Test Plan: Will write a unit test

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41433
2015-07-10 14:45:42 -07:00
Dmitri Smirnov
c903ccc4c2 Merge from github/master 2015-07-09 18:01:08 -07:00
unknown
5c79132335 Revert the changes related to Options, as requested to seperate them into
a different patch.
2015-07-09 11:31:42 -07:00
Dmitri Smirnov
ef4b87f1b2 Commit both PR and internal code review changes 2015-07-07 16:58:20 -07:00
Poornima Chozhiyath Raman
c0b23dd5b0 Enabling trivial move in universal compaction
Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction.

Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases.

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40875
2015-07-07 14:18:55 -07:00
Dmitri Smirnov
d2f0912bd3 Merge the latest changes from github/master 2015-07-02 17:23:41 -07:00
Ari Ekmekji
35cd75c379 Introduce InfoLogLevel::HEADER_LEVEL
Summary:
 Introduced a new category in the enum InfoLogLevel in env.h.
 Modifed Log() in env.cc to use the Header()
 when the InfoLogLevel == HEADER_LEVEL.
 Updated tests in auto_roll_logger_test to ensure
 the header is handled properly in these cases.

Test Plan: Augment existing tests in auto_roll_logger_test

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41067
2015-07-02 17:14:39 -07:00
Dmitri Smirnov
feb99c31a4 Merge remote-tracking branch 'origin' into ms_win_port 2015-07-02 13:14:18 -07:00
agiardullo
4159f5b87b Prepare 3.12
Summary: About to cut release

Test Plan: none

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41061
2015-07-02 12:20:36 -07:00
Aaron Feldman
a69bc91e37 Multithreaded backup and restore in BackupEngineImpl
Summary:
Add a new field: BackupableDBOptions.max_background_copies.
CreateNewBackup() and RestoreDBFromBackup() will use this number of threads to perform copies.
If there is a backup rate limit, then max_background_copies must be 1.
Update backupable_db_test.cc to test multi-threaded backup and restore.
Update backupable_db_test.cc to test backups when the backup environment is not the same as the database environment.

Test Plan:
Run ./backupable_db_test
Run valgrind ./backupable_db_test
Run with TSAN and ASAN

Reviewers: yhchiang, rven, anthony, sdong, igor

Reviewed By: igor

Subscribers: yhchiang, anthony, sdong, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40725
2015-07-02 11:35:51 -07:00
Dmitri Smirnov
9dbde7277c Merge remote-tracking branch 'origin' into ms_win_port 2015-07-02 11:34:22 -07:00
Dmitri Smirnov
ca2fe2c1b6 Address GCC compilation issues
invalid suffix on literal
 no return statement in function returning non-void CuckooStep::operator=
 extra qualification ‘rocksdb::spatial::Variant::
 dereferencing type-punned pointer will break strict-aliasing rules
2015-07-01 17:04:08 -07:00
Dmitri Smirnov
18285c1e2f Windows Port from Microsoft
Summary: Make RocksDb build and run on Windows to be functionally
 complete and performant. All existing test cases run with no
 regressions. Performance numbers are in the pull-request.

 Test plan: make all of the existing unit tests pass, obtain perf numbers.

 Co-authored-by: Praveen Rao praveensinghrao@outlook.com
 Co-authored-by: Sherlock Huang baihan.huang@gmail.com
 Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
 Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
2015-07-01 16:13:56 -07:00
Igor Canadi
4cbc4e6f88 Call merge operators with empty values
Summary: It's not really nice to call user's API with garbage data in new_value. This diff makes sure that new_value is empty before calling the merge operator.

Test Plan: Added assert to Merge operator in merge_test

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40773
2015-06-26 11:35:46 -07:00
Yueh-Hsuan Chiang
48da7a9cad Improve the comment for BYTES_READ in statistics.
Summary:
BYTES_READ only count the number of logical bytes read from
the DB::Get() function.  It neither includes all logical bytes read
nor indicates IO read bytes.

This patch improves the comment for BYTES_READ.

Test Plan: Only change comment.

Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40599
2015-06-24 15:00:51 -07:00
Islam AbdelRahman
674b1181cf Bottommost level compaction option
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40527
2015-06-23 13:32:40 -07:00
Giuseppe Ottaviano
782a1590f9 Implement a table-level row cache
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.

Supports snapshots and merge operations.

Test Plan: Ran `make valgrind_check commit-prereq`

Reviewers: igor, philipp, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39849
2015-06-23 10:25:45 -07:00
krad
de85e4cadf Introduce WAL recovery consistency levels
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.

1. RecoverAfterRestart (kTolerateCorruptedTailRecords)

This mocks the current recovery mode.

2. RecoverAfterCleanShutdown (kAbsoluteConsistency)

This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.

3. RecoverPointInTime (kPointInTimeRecovery)

This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.

4. RecoverAfterDisaster (kSkipAnyCorruptRecord)

This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.

Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.

Reviewers: leveldb, sdong, igor

Subscribers: yoshinorim, dhruba

Differential Revision: https://reviews.facebook.net/D38487
2015-06-22 15:28:12 -07:00
krad
7015fd81c4 Add read_nanos to IOStatsContext.
Summary: MyRocks need a mechanism to track read outliers. We need to expose this
stat.

Test Plan: None

Reviewers: sdong

CC: leveldb

Task ID: #7152512

Blame Rev:
2015-06-22 11:09:35 -07:00
Yueh-Hsuan Chiang
eade498bda Block utilities/write_batch_with_index in ROCKSDB_LITE
Summary:
Block utilities/write_batch_with_index in ROCKSDB_LITE as we
don't include anly utilities in ROCKSDB_LITE

Test Plan: write_batch_with_index_test

Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40347
2015-06-18 15:54:05 -07:00
Igor Canadi
760e9a94de Fail DB::Open() when the requested compression is not available
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.

This patch fails DB::Open() if we don't support the compression that is specified in the options.

Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39687
2015-06-18 14:55:05 -07:00
Aaron Feldman
69bb210d58 Add Cache.GetPinnedUsageUsage()
Summary:
  Add the funcion Cache.GetPinnedUsage() to return the memory size of entries
  that are in use by the system (that is, all the entries not in the LRU list).

Test Plan:
  Run ./cache_test and examine PinnedUsageTest.

Reviewers: tnovak, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40305
2015-06-18 13:56:31 -07:00
Islam AbdelRahman
4eabbdb7ec Skip bottommost level compaction if possible
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level

Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set

Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that  should use force_bottommost_level_compaction to pass in a deterministic way

Test Plan:
make check
adding new tests

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40059
2015-06-18 11:03:31 -07:00
Igor Canadi
4b8bb62f0a Don't dump DBOptions for each column family
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.

Test Plan: Check out the LOG

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39729
2015-06-18 10:15:54 -07:00
Islam AbdelRahman
12e030a992 Use CompactRangeOptions for CompactRange
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated

Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40209
2015-06-17 14:36:14 -07:00
Holodov Alexander
84a9c6a53a add comment 2015-05-16 15:29:39 +04:00
Holodov Alexander
eeb44366ba C api: human-readable statistics 2015-05-16 12:34:28 +04:00
sdong
40f562e747 Allow GetApproximateSize() to include mem table size if it is skip list memtable
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.

Test Plan: Add a test case

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40119
2015-06-16 18:13:23 -07:00
sdong
7842920be5 Slow down writes by bytes written
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.

The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work

hard_rate_limit is deprecated.

options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.

Test Plan: Add new unit tests in db_test

Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor

Reviewed By: igor

Subscribers: ikabiljo, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36351
2015-06-11 20:42:18 -07:00
Islam AbdelRahman
d6ce0f7c61 Add largest sequence to FlushJobInfo
Summary:
Adding largest sequence number to FlushJobInfo
and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39927
2015-06-11 15:22:22 -07:00
Yueh-Hsuan Chiang
3eddd1abe9 Add Env::GetThreadID(), which returns the ID of the current thread.
Summary:
Add Env::GetThreadID(), which returns the ID of the current thread.

In addition, make GetThreadList() and InfoLog use same unique ID for the same thread.

Test Plan:
db_test
listener_test

Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39735
2015-06-11 14:18:02 -07:00
Igor Canadi
821cff114e Re-generate WriteEntry on WBWIIterator::Entry()
Summary:
[This is the resubmit of D39813. Tests were failing, so I reverted the diff. I found the bug and I'm now resubmitting]

If we don't do this, any calls to Entry() after WBWI mutation will result in undefined behavior. We need to re-fetch the offset from the skip list and regenerate the new pointer (because string's base pointer can change while mutating).

Test Plan: COMPILE_WITH_ASAN=1 make write_batch_with_index_test && ./write_batch_with_index_test

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39897
2015-06-10 12:57:38 -07:00
Igor Canadi
75222d130e Revert "Fix compile"
This reverts commit 51440f83ec.

Revert "Re-generate WriteEntry on WBWIIterator::Entry()"

This reverts commit 4949ef08db.
2015-06-10 11:05:27 -07:00
Igor Canadi
47f1e72125 Merge pull request #630 from rdallman/c-wb-logdata
C: add WriteBatch.PutLogData support
2015-06-10 10:50:28 -07:00
Igor Canadi
4949ef08db Re-generate WriteEntry on WBWIIterator::Entry()
Summary: If we don't do this, any calls to Entry() after WBWI mutation will result in undefined behavior. We need to re-fetch the offset from the skip list and regenerate the new pointer (because string's base pointer can change while mutating).

Test Plan: COMPILE_WITH_ASAN=1 make write_batch_with_index_test && ./write_batch_with_index_test

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39813
2015-06-10 10:35:19 -07:00
Reed Allman
735df66552 C: add WriteBatch.PutLogData support 2015-06-10 00:12:33 -07:00
Islam AbdelRahman
643bbbf081 Use nullptr for default compaction_filter_factory
Summary:
Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI

Test Plan: make check

Reviewers: yoshinorim, ott, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39693
2015-06-08 16:34:26 -07:00
Igor Canadi
133130a4f5 Merge pull request #625 from rdallman/c-slice-parts-support
C: add support for WriteBatch SliceParts params
2015-06-08 13:14:44 -04:00
Igor Canadi
de4d172d0f Merge pull request #622 from rdallman/c-multiget
C: add MultiGet support
2015-06-08 13:13:54 -04:00
sdong
6df589b446 Add TablePropertiesCollector::NeedCompact() to suggest DB to further compact output files
Summary:
It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file.
It can be used to allow users to suggest DB to clear up delete tombstones faster.

Test Plan: Add a unit test.

Reviewers: igor, yhchiang, kradhakrishnan, rven

Reviewed By: rven

Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39585
2015-06-05 20:18:21 -07:00
Yueh-Hsuan Chiang
2e764f06ea [API Change] Improve EventListener::OnFlushCompleted interface
Summary:
EventListener::OnFlushCompleted() now passes a structure instead
of a list of parameters.  This minimizes the API change in the
future.

Test Plan:
listener_test
compact_files_test
example/compact_files_example

Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor

Reviewed By: rven, igor

Subscribers: IslamAbdelRahman, rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39543
2015-06-05 12:28:51 -07:00
Yueh-Hsuan Chiang
bb808eaddb Changed the CompactionJobStats::output_key_prefix type from char[] to string.
Summary:
Keys in RocksDB can be arbitrary byte strings.  However, in the current
CompactionJobStats, smallest_output_key_prefix and largest_output_key_prefix
are of type char[] without having a length, which is insufficient to handle
non-null terminated strings.

This patch change their type to std::string.

Test Plan: compaction_job_stats_test

Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39537
2015-06-04 12:31:12 -07:00
Yueh-Hsuan Chiang
0b3172d071 Add EventListener::OnTableFileDeletion()
Summary:
Add EventListener::OnTableFileDeletion(), which will be
called when a table file is deleted.

Test Plan: Extend three existing tests in db_test to verify the deleted files.

Reviewers: rven, anthony, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38931
2015-06-03 19:57:01 -07:00
Reed Allman
211a195d41 C: add MultiGet support 2015-06-03 17:57:42 -07:00
Reed Allman
5dc174e11a C: add support for WriteBatch SliceParts params 2015-06-03 17:08:00 -07:00
Igor Canadi
408cc4b8e0 Revert "Merge pull request #621 from rdallman/c-slice-parts-support"
This reverts commit 78382d4ba7, reversing
changes made to ca8b85ac04.
2015-06-03 13:34:07 -04:00
Igor Canadi
78382d4ba7 Merge pull request #621 from rdallman/c-slice-parts-support
C: add support for WriteBatch SliceParts params
2015-06-03 13:16:14 -04:00
agiardullo
ca8b85ac04 better document max_write_buffer_number_to_maintain
Test Plan: compile

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39285
2015-06-02 21:24:19 -07:00
Yueh-Hsuan Chiang
fe5c6321cb Allow EventListener::OnCompactionCompleted to return CompactionJobStats.
Summary:
Allow EventListener::OnCompactionCompleted to return CompactionJobStats,
which contains useful information about a compaction.

Example CompactionJobStats returned by OnCompactionCompleted():
    smallest_output_key_prefix 05000000
    largest_output_key_prefix 06990000
    elapsed_time 42419
    num_input_records 300
    num_input_files 3
    num_input_files_at_output_level 2
    num_output_records 200
    num_output_files 1
    actual_bytes_input 167200
    actual_bytes_output 110688
    total_input_raw_key_bytes 5400
    total_input_raw_value_bytes 300000
    num_records_replaced 100
    is_manual_compaction 1

Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats.

Reviewers: rven, igor, anthony, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38463
2015-06-02 17:07:16 -07:00
Yueh-Hsuan Chiang
8d8d4e45ba Fixed ROCKSDB_LITE compile error due to the missing of TableFileCreationInfo
Summary: Fixed ROCKSDB_LITE compile error due to the missing of TableFileCreationInfo

Test Plan: make OPT=-DROCKSDB_LITE shared_lib -j32

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39405
2015-06-02 14:27:46 -07:00
Yueh-Hsuan Chiang
fc83821270 Add EventListener::OnTableFileCreated()
Summary:
Add EventListener::OnTableFileCreated(), which will be called
when a table file is created.  This patch is part of the
EventLogger and EventListener integration.

Test Plan: Augment existing test in db/listener_test.cc

Reviewers: anthony, kradhakrishnan, rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38865
2015-06-02 14:12:23 -07:00
sdong
ac81130fa0 Fix Bug: CompactRange() doesn't change to correct level caused by using wrong level
Summary: In previous change https://reviews.facebook.net/D39099 , while renaming parameters, use a wrong parameter, causing CompactRange() to compact not wrong level.

Test Plan: Run "DBTest.MigrateToDynamicLevelMaxBytesBase" which failed with the patch.

Reviewers: rven, yhchiang, kradhakrishnan, igor, anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39393
2015-06-02 10:00:58 -07:00
Mike Kolupaev
ec7a944360 more times in perf_context and iostats_context
Summary:
We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls:
 - perf_context.db_mutex_lock_nanos
 - perf_context.db_condition_wait_nanos
 - iostats_context.open_time
 - iostats_context.allocate_time
 - iostats_context.write_time
 - iostats_context.range_sync_time
 - iostats_context.logger_time

In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long.

Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took.

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: march, dhruba, tnovak

Differential Revision: https://reviews.facebook.net/D39177
2015-06-02 02:07:58 -07:00
sdong
4266d4fd90 Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange()
Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true.

Test Plan: Add a unit test for it.

Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39099
2015-06-01 18:21:14 -07:00
Mike Kolupaev
2ecac9f96d add rocksdb::WritableFileWrapper similar to rocksdb::EnvWrapper
Summary: It used to be no good (known to me) non-intrusive way to wrap WritableFile - you can't call protected virtual methods of the wrapped pointer to WritableFile. This diff adds a convenience class WritableFileWrapper that makes wrapping WritableFile both possible and easy.

Test Plan: `make clean; make -j release`, `make clean; OPT=-DROCKSDB_LITE make release`, `make clean; USE_CLANG=1 make -j all`.

Reviewers: sdong, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, tnovak, march

Differential Revision: https://reviews.facebook.net/D39147
2015-06-01 11:22:36 -07:00
Igor Canadi
a187e66ad0 Merge pull request #617 from rdallman/wb-merge-sliceparts
WriteBatch.Merge w/ SliceParts support
2015-05-31 13:34:34 -04:00
agiardullo
dc9d70de65 Optimistic Transactions
Summary: Optimistic transactions supporting begin/commit/rollback semantics.  Currently relies on checking the memtable to determine if there are any collisions at commit time.  Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty.  You should probably start with transaction.h to get an overview of what is currently supported.

Test Plan: Added a new test, but still need to look into stress testing.

Reviewers: yhchiang, igor, rven, sdong

Reviewed By: sdong

Subscribers: adamretter, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D33435
2015-05-29 14:36:35 -07:00
Yueh-Hsuan Chiang
ebfdb3c7f6 Fixed a compile error in ROCKSDB_LITE
Summary: Fixed a compile error in ROCKSDB_LITE

Test Plan: make db_stress OPT=-DROCKSDB_LITE -j32

Reviewers: sdong, igor, anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39201
2015-05-29 13:21:09 -07:00
Yueh-Hsuan Chiang
9ffc8ba024 Include EventListener in stress test.
Summary: Include EventListener in stress test.

Test Plan: make blackbox_crash_test whitebox_crash_test

Reviewers: anthony, igor, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39105
2015-05-29 13:17:49 -07:00
Reed Allman
21cd6b7ad8 C: add support for WriteBatch SliceParts params 2015-05-29 10:23:43 -07:00
Reed Allman
a0635ba3f6 WriteBatch.Merge w/ SliceParts support
also hooked up WriteBatchInternal
2015-05-29 04:30:03 -07:00
agiardullo
c815351038 Support saving history in memtable_list
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts.  But after flushing, we don't have any memtables, and transactions could fail to commit.  So we want to someone keep around some extra history to use for conflict checking.  In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.

After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure).  It seems like the best place for this is abstracted inside the memtable_list.  I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.

This diff adds a new parameter to control how much memtable history to keep around after flushing.  However, it sounds like people aren't too fond of adding new parameters.  So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers.  This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit.  (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached).  So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).

However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.

Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests.  Added testing in memtablelist_test and planning on adding more testing here.

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37443
2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang
672dda9b3b [API Change] Move listeners from ColumnFamilyOptions to DBOptions
Summary: Move listeners from ColumnFamilyOptions to DBOptions

Test Plan:
listener_test
compact_files_test

Reviewers: rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39087
2015-05-28 13:21:39 -07:00
Reed Allman
9c38ce1d02 C: extra bbto / noop slice transform 2015-05-22 22:56:28 -07:00
Igor Canadi
4cb4d546cd Set stats_dump_period_sec to 600 by default
Summary: Having stats in our LOG more often will help a lot with perf debugging.

Test Plan: none

Reviewers: sdong, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38781
2015-05-21 14:22:16 -04:00
Yueh-Hsuan Chiang
e2c1d4b57f [Public API Change] Make DB::GetDbIdentity() be const function.
Summary: Make DB::GetDbIdentity() be const function.

Test Plan: make db_test

Reviewers: igor, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38745
2015-05-21 11:01:48 -07:00
Igor Canadi
4a855c0799 Add an option wal_bytes_per_sync to control sync_file_range for WAL files
Summary:
sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html

Some users don't want us to call sync_file_range on WALs. Some other do.
Thus, I'm adding a separate option wal_bytes_per_sync to control calling
sync_file_range on WAL files. bytes_per_sync will apply only to table
files now.

Test Plan: no more sync_file_range for WAL as evidenced by strace

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38253
2015-05-18 17:03:59 -07:00
sdong
6fa7085121 CompactRange skips levels 1 to base_level -1 for dynamic level base size
Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level.

Test Plan: Run all unit tests

Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37125
2015-05-18 10:54:11 -07:00
Yueh-Hsuan Chiang
3f0867c0fe Allow GetThreadList to report Flush properties.
Summary:
Allow GetThreadList to report Flush properties, which includes:
* job id
* number of bytes that has been written since flush started.
* total size of input mem-tables

Test Plan:
./db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=100 --value_size=1000

Sample output from db_bench which tracks same flush job

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140213879898240   High Pri      default                Flush       5789 us                    FlushJob::WriteLevel0Table              BytesMemtables 4112835 | BytesWritten 577104 | JobID 8 |

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140213879898240   High Pri      default                Flush     30.634 ms                    FlushJob::WriteLevel0Table              BytesMemtables 4112835 | BytesWritten 1734865 | JobID 8 |

Reviewers: rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38505
2015-05-15 23:22:22 -07:00