Commit Graph

1631 Commits

Author SHA1 Message Date
Yi Wu
602c13a964 Remove fadvise with direct IO read
Summary:
Remove the logic since we don't use buffer cache with direct IO. Resolve
read regression we currently have.
Closes https://github.com/facebook/rocksdb/pull/1782

Differential Revision: D4430408

Pulled By: yiwu-arbug

fbshipit-source-id: 5557bba
2017-01-18 12:09:10 -08:00
Kefu Chai
e8a096000b util/thread_local.h: silence a clang-build warning
Summary:
otherwise clang complains with

/home/jenkins/workspace/ceph-master/src/rocksdb/util/thread_local.h:205:5:
error: macro expansion producing 'defined' has undefined behavior
[-Werror,-Wexpansion-to-defined]
^
/home/jenkins/workspace/ceph-master/src/rocksdb/util/thread_local.h:22:4:
note: expanded from macro 'ROCKSDB_SUPPORT_THREAD_LOCAL'
!defined(OS_WIN) && !defined(OS_MACOSX) && !defined(IOS_CROSS_COMPILE)
^`

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/1757

Differential Revision: D4394140

Pulled By: siying

fbshipit-source-id: f0beda0
2017-01-15 13:24:16 -08:00
Aaron Gao
3e6899d116 change UseDirectIO() to use_direct_io()
Summary:
also change variable name `direct_io_` to `use_direct_io_` in WritableFile to make it consistent with read path.
Closes https://github.com/facebook/rocksdb/pull/1770

Differential Revision: D4416435

Pulled By: lightmark

fbshipit-source-id: 4143c53
2017-01-13 12:09:15 -08:00
Aaron Gao
d4e07a8459 fix warning of unused direct io helper functions
Summary:
add build guard
Closes https://github.com/facebook/rocksdb/pull/1771

Differential Revision: D4410779

Pulled By: siying

fbshipit-source-id: 3796c30
2017-01-12 12:39:14 -08:00
Aaron Gao
dc2584eea0 direct reads refactor
Summary:
direct IO reads refactoring
remove unnecessary classes and unified interfaces
tested with db_bench

need more change for options and ON/OFF for different files.
Since disabled is default, it should be fine now
Closes https://github.com/facebook/rocksdb/pull/1636

Differential Revision: D4307189

Pulled By: lightmark

fbshipit-source-id: 6991e22
2017-01-11 16:54:12 -08:00
Anirban Rahut
62384ebe9c Guarding extra fallocate call with TRAVIS because its not working pro…
Summary:
…perly on travis

 There is some old code in PosixWritableFile::Close(), which
truncates the file to the measured size and then does an extra fallocate
with KEEP_SIZE. This is commented as a failsafe because in some
cases ftruncate doesn't do the right job (I don't know of an instance of
this btw). However doing an fallocate with KEEP_SIZE should not increase
the file size. However on Travis Worker which is Docker (likely AUFS )
its not working. There are comments on web that show that the AUFS
author had initially not implemented fallocate, and then did it later.
So not sure what is the quality of the implementation.
Closes https://github.com/facebook/rocksdb/pull/1765

Differential Revision: D4401340

Pulled By: anirbanr-fb

fbshipit-source-id: e2d8100
2017-01-11 14:24:13 -08:00
Andrew Kryczka
fe395fb63d Allow incrementing refcount on cache handles
Summary:
Previously the only way to increment a handle's refcount was to invoke Lookup(), which (1) did hash table lookup to get cache handle, (2) incremented that handle's refcount. For a future DeleteRange optimization, I added a function, Ref(), for when the caller already has a cache handle and only needs to do (2).
Closes https://github.com/facebook/rocksdb/pull/1761

Differential Revision: D4397114

Pulled By: ajkr

fbshipit-source-id: 9addbe5
2017-01-10 16:54:20 -08:00
Dmitri Smirnov
3c233ca4ea Fix Windows environment issues
Summary:
Enable directIO on WritableFileImpl::Append
     with offset being current length of the file.
     Enable UniqueID tests on Windows, disable others but
     leeting them to compile. Unique tests are valuable to
     detect failures on different filesystems and upcoming
     ReFS.
     Clear output in WinEnv Getchildren.This is different from
     previous strategy, do not touch output on failure.
     Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746

Differential Revision: D4385681

Pulled By: IslamAbdelRahman

fbshipit-source-id: c07b702
2017-01-09 15:54:12 -08:00
Maysam Yabandeh
d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh
54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Islam AbdelRahman
ac73d7558b Add GetSupportedCompressions() convenience function
Summary:
This function will return a list of supported compression types in RocksDB
This is needed for MyRocks https://github.com/facebook/mysql-5.6/pull/446
Closes https://github.com/facebook/rocksdb/pull/1747

Differential Revision: D4385921

Pulled By: IslamAbdelRahman

fbshipit-source-id: 2f5b59f
2017-01-06 11:24:14 -08:00
Adam Retter
85ac1a320a Fix rocksdb::Status::getState
Summary:
This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState.
Closes https://github.com/facebook/rocksdb/issues/1688
Closes https://github.com/facebook/rocksdb/pull/1714

Differential Revision: D4364181

Pulled By: yiwu-arbug

fbshipit-source-id: 8e073b4
2017-01-03 18:39:14 -08:00
Siying Dong
17a4b75cc3 Always fsync the file after file copying
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728

Differential Revision: D4371083

Pulled By: siying

fbshipit-source-id: 579e14c
2016-12-28 19:09:16 -08:00
Yi Wu
ab48c165a9 Print cache options to info log
Summary:
Improve cache options logging to info log.
Also print the value of
cache_index_and_filter_blocks_with_high_priority.
Closes https://github.com/facebook/rocksdb/pull/1709

Differential Revision: D4358776

Pulled By: yiwu-arbug

fbshipit-source-id: 8f030a0
2016-12-22 14:54:19 -08:00
Aaron Gao
972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Islam AbdelRahman
989e644ed8 Remove sst_file_manager option from LITE
Summary:
Remove sst_file_manager option from LITE
Closes https://github.com/facebook/rocksdb/pull/1690

Differential Revision: D4341331

Pulled By: IslamAbdelRahman

fbshipit-source-id: 9f9328d
2016-12-21 17:54:21 -08:00
Jianpeng Ma
bd6cf7b51d WritableFileWriter: default buffer size equal min(64k,options.writabl?
Summary:
?e_file_max_buffer_size)

If we overwrite WritableFile and has a buffer which has the same
function of buf_. We hope remove the cache function of
WritableFileWriter. So using options.writable_file_max_buffer_size = 0
to disable cache function.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Closes https://github.com/facebook/rocksdb/pull/1628

Differential Revision: D4307219

Pulled By: yiwu-arbug

fbshipit-source-id: 77a6e26
2016-12-16 13:09:14 -08:00
Daniel Black
816c1e30ca gcc-7 requires include <functional> for std::function
Summary:
Fixes compile error:

In file included from ./util/statistics.h:17:0,
                 from ./util/stop_watch.h:8,
                 from ./util/perf_step_timer.h:9,
                 from ./util/iostats_context_imp.h:8,
                 from ./util/posix_logger.h:27,
                 from ./port/util_logger.h:18,
                 from ./db/auto_roll_logger.h:15,
                 from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
   typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656

Differential Revision: D4318702

Pulled By: yiwu-arbug

fbshipit-source-id: 8c5d17a
2016-12-16 11:24:18 -08:00
Daniel Black
0ab6fc167f Gcc-7 buffer size insufficient
Summary:
Bunch of commits related to insufficient buffer size. Errors in individual commits.
Closes https://github.com/facebook/rocksdb/pull/1673

Differential Revision: D4332127

Pulled By: IslamAbdelRahman

fbshipit-source-id: 878f73c
2016-12-14 19:24:26 -08:00
Daniel Black
b7239bf7e0 Gcc 7 fallthrough
Summary:
hopefully the last of the gcc-7 compile errors
Closes https://github.com/facebook/rocksdb/pull/1675

Differential Revision: D4332106

Pulled By: IslamAbdelRahman

fbshipit-source-id: 139448c
2016-12-14 19:24:25 -08:00
Daniel Black
477b6ea578 std::remove_if requires <algorithm>
Summary:
fixes error (that occurred on gcc-7):

error:

util/env_basic_test.cc: In member function 'virtual rocksdb::Status rocksdb::NormalizingEnvWrapper::GetChildren(const string&, std::vector<std::__cxx11::basic_string<char> >*)':
util/env_basic_test.cc:27:21: error: 'remove_if' is not a member of 'std'
       result->erase(std::remove_if(result->begin(), result->end(),
                     ^~~
Closes https://github.com/facebook/rocksdb/pull/1674

Differential Revision: D4331221

Pulled By: ajkr

fbshipit-source-id: 9bbdc78
2016-12-14 17:09:14 -08:00
Daniel Black
e097222e64 util/logging.cc: buffer of insufficient size (gcc-7 -Werror=format-length)
Summary:
util/logging.cc💯13: error: output may be truncated before the last format character [-Werror=format-length=]
 std::string NumberToHumanString(int64_t num) {
             ^~~~~~~~~~~~~~~~~~~
util/logging.cc:106:59: note: format output between 3 and 19 bytes into a destination of size 16
     snprintf(buf, sizeof(buf), "%" PRIi64 "K", num / 1000);
Closes https://github.com/facebook/rocksdb/pull/1653

Differential Revision: D4318687

Pulled By: yiwu-arbug

fbshipit-source-id: 3a5c931
2016-12-13 18:39:14 -08:00
Daniel Black
bfbcec2339 Gcc 7 error expansion to defined
Summary:
sorry if these gcc-7/clang-4 cleanups are getting tedious.
Closes https://github.com/facebook/rocksdb/pull/1658

Differential Revision: D4318792

Pulled By: yiwu-arbug

fbshipit-source-id: 8e85891
2016-12-13 18:39:14 -08:00
Daniel Black
c3e5ee7154 util/histogram.cc: HistogramStat::toString buffer insufficient
Summary:
Increased buffer size to 1650.

util/histogram.cc: In member function 'std::__cxx11::string rocksdb::HistogramStat::ToString() const':
util/histogram.cc:189:13: error: '%.2f' directive output truncated writing between 4 and 313 bytes into a region of size 0 [-Werror=format-length=]
 std::string HistogramStat::ToString() const {
             ^~~~~~~~~~~~~
util/histogram.cc:205:30: note: format output between 69 and 1614 bytes into a destination of size 200
            Percentile(99.99));
                              ^
cc1plus: all warnings being treated as errors
Makefile:1521: recipe for target 'util/histogram.o' failed
Closes https://github.com/facebook/rocksdb/pull/1660

Differential Revision: D4318820

Pulled By: yiwu-arbug

fbshipit-source-id: 45ae6ea
2016-12-13 14:09:12 -08:00
Andrew Kryczka
f0c509e2c8 Return finer-granularity status from Env::GetChildren*
Summary:
It'd be nice to use the error status type to distinguish
between user error and system error. For example, GetChildren can fail
listing a backup directory's contents either because a bad path was provided
(user error) or because an operation failed, e.g., a remote storage service
call failed (system error). In the former case, we want to continue and treat
the backup directory as empty; in the latter case, we want to immediately
propagate the error to the caller.

This diff uses NotFound to indicate user error and IOError to indicate
system error. Previously IOError indicated both.
Closes https://github.com/facebook/rocksdb/pull/1644

Differential Revision: D4312157

Pulled By: ajkr

fbshipit-source-id: 51b4f24
2016-12-12 12:54:13 -08:00
Manuel Ung
2005c88a75 Implement non-exclusive locks
Summary:
This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not.

Some implementation notes:
- Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`.
- Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit.
- Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now.
Closes https://github.com/facebook/rocksdb/pull/1573

Differential Revision: D4239097

Pulled By: lth

fbshipit-source-id: da7c074
2016-12-05 17:39:17 -08:00
Anton Safonov
9053fe2a5c Made delete_obsolete_files_period_micros option dynamic
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595

Differential Revision: D4246569

Pulled By: tonek

fbshipit-source-id: d23f560
2016-12-05 14:24:16 -08:00
Islam AbdelRahman
4a21b1402c Cache heap::downheap() root comparison (optimize heap cmp call)
Summary:
Reduce number of comparisons in heap by caching which child node in the first level is smallest (left_child or right_child)
So next time we can compare directly against the smallest child

I see that the total number of calls to comparator drops significantly when using this optimization

Before caching (~2mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000  --perf_level=2
readseq      :       0.338 micros/op 2959201 ops/sec;  327.4 MB/s user_key_comparison_count = 2000008
```
After caching (~1mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq      :       0.309 micros/op 3236801 ops/sec;  358.1 MB/s user_key_comparison_count = 1000011
```

It also improves
Closes https://github.com/facebook/rocksdb/pull/1600

Differential Revision: D4256027

Pulled By: IslamAbdelRahman

fbshipit-source-id: 76fcc66
2016-12-01 13:39:14 -08:00
Islam AbdelRahman
e39d080871 Fix travis (compile for clang < 3.9)
Summary:
Travis fail because it uses clang 3.6 which don't recognize
`__attribute__((__no_sanitize__("undefined")))`
Closes https://github.com/facebook/rocksdb/pull/1601

Differential Revision: D4257175

Pulled By: IslamAbdelRahman

fbshipit-source-id: fb4d1ab
2016-12-01 10:09:22 -08:00
Islam AbdelRahman
52fd1ff2c2 disable UBSAN for functions with intentional -ve shift / overflow
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow

These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577

Differential Revision: D4240801

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3e1caf6
2016-11-28 17:54:12 -08:00
Islam AbdelRahman
63c30de80d fix options_test ubsan
Summary:
Having -ve value for max_write_buffer_number does not make sense and cause us to do a left shift on a -ve value number
Closes https://github.com/facebook/rocksdb/pull/1579

Differential Revision: D4240798

Pulled By: IslamAbdelRahman

fbshipit-source-id: bd6267e
2016-11-28 16:39:14 -08:00
Mike Kolupaev
236d4c67e9 Less linear search in DBIter::Seek() when keys are overwritten a lot
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:

```
  rocksdb::MergingIterator::Next
  rocksdb::DBIter::FindNextUserEntryInternal
  rocksdb::DBIter::Seek
```

I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.

CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413

Differential Revision: D4083879

Pulled By: IslamAbdelRahman

fbshipit-source-id: a83ddae
2016-11-28 10:24:11 -08:00
Siying Dong
cd7c4143d7 Improve Write Stalling System
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.

Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562

Differential Revision: D4218519

Pulled By: siying

fbshipit-source-id: 95e4088
2016-11-23 09:24:15 -08:00
Nick Terrell
4444256ab7 Remove use of deprecated LZ4 function
Summary:
LZ4 1.7.3 emits warnings when calling the deprecated function `LZ4_compress_limitedOutput_continue()`.  Starting in r129, LZ4 introduces `LZ4_compress_fast_continue()` as a replacement, and the two functions calls are [exactly equivalent](https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L1408).
Closes https://github.com/facebook/rocksdb/pull/1532

Differential Revision: D4199240

Pulled By: siying

fbshipit-source-id: 138c2bc
2016-11-21 12:24:14 -08:00
Changli Gao
548d7fb261 Fix fd leak when using direct IOs
Summary:
We should close the fd, before overriding it. This bug was
introduced by f89caa127b
Closes https://github.com/facebook/rocksdb/pull/1553

Differential Revision: D4214101

Pulled By: siying

fbshipit-source-id: 0d65de0
2016-11-21 12:24:13 -08:00
Andrew Kryczka
fd43ee09da Range deletion microoptimizations
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548

Differential Revision: D4208169

Pulled By: ajkr

fbshipit-source-id: 2fd65cf
2016-11-21 12:24:13 -08:00
Changli Gao
a0deec960f Fix deadlock when calling getMergedHistogram
Summary:
When calling StatisticsImpl::HistogramInfo::getMergedHistogram(), if
there is a dying thread, which is calling
ThreadLocalPtr::StaticMeta::OnThreadExit() to merge its thread values to
HistogramInfo, deadlock will occur. Because the former try to hold
merge_lock then ThreadMeta::mutex_, but the later try to hold
ThreadMeta::mutex_ then merge_lock. In short, the locking order isn't
the same.

This patch addressed this issue by releasing merge_lock before folding
thread values.
Closes https://github.com/facebook/rocksdb/pull/1552

Differential Revision: D4211942

Pulled By: ajkr

fbshipit-source-id: ef89bcb
2016-11-20 18:24:12 -08:00
Manuel Ung
e63350e726 Use more efficient hash map for deadlock detection
Summary:
Currently, deadlock cycles are held in std::unordered_map. The problem with it is that it allocates/deallocates memory on every insertion/deletion. This limits throughput since we're doing this expensive operation while holding a global mutex. Fix this by using a vector which caches memory instead.

Running the deadlock stress test, this change increased throughput from 39k txns/s -> 49k txns/s. The effect is more noticeable in MyRocks.
Closes https://github.com/facebook/rocksdb/pull/1545

Differential Revision: D4205662

Pulled By: lth

fbshipit-source-id: ff990e4
2016-11-19 11:39:15 -08:00
Siying Dong
73843aa636 Direct I/O Reads Handle the last sector correctly.
Summary:
Currently, in the Direct I/O read mode, the last sector of the file, if not full, is not handled correctly. If the return value of pread is not multiplier of kSectorSize, we still go ahead and continue reading, even if the buffer is not aligned. With the commit, if the return value is not multiplier of kSectorSize, and all but the last sector has been read, we simply return.
Closes https://github.com/facebook/rocksdb/pull/1550

Differential Revision: D4209609

Pulled By: lightmark

fbshipit-source-id: cb0b439
2016-11-18 19:24:13 -08:00
Maysam Yabandeh
9d60151b04 Implement PositionedAppend for PosixWritableFile
Summary:
This patch clarifies the contract of PositionedAppend with some unit
tests and also implements it for PosixWritableFile. (Tasks: 14524071)
Closes https://github.com/facebook/rocksdb/pull/1514

Differential Revision: D4204907

Pulled By: maysamyabandeh

fbshipit-source-id: 06eabd2
2016-11-18 17:24:13 -08:00
Siying Dong
972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Yi Wu
1543d5d92e Report memory usage by memtable insert hints map.
Summary:
It is hard to measure acutal memory usage by std containers. Even
providing a custom allocator will miss count some of the usage. Here we
only do a wild guess on its memory usage.
Closes https://github.com/facebook/rocksdb/pull/1511

Differential Revision: D4179945

Pulled By: yiwu-arbug

fbshipit-source-id: 32ab929
2016-11-15 20:24:13 -08:00
Artemiy Kolesnikov
91300d01f6 Dynamic max_total_wal_size option
Summary: Closes https://github.com/facebook/rocksdb/pull/1509

Differential Revision: D4176426

Pulled By: yiwu-arbug

fbshipit-source-id: b57689d
2016-11-14 22:54:17 -08:00
Yi Wu
1ea79a78c9 Optimize sequential insert into memtable - Part 1: Interface
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.

This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
  implementation to return a hint of insert position, which can be later
  pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
  via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419

Differential Revision: D4079367

Pulled By: yiwu-arbug

fbshipit-source-id: 3555326
2016-11-13 19:09:18 -08:00
Lijun Tang
adb665e0bf Allowed delayed_write_rate option to be dynamically set.
Summary: Closes https://github.com/facebook/rocksdb/pull/1488

Differential Revision: D4157784

Pulled By: siying

fbshipit-source-id: f150081
2016-11-12 15:54:11 -08:00
Andrew Kryczka
4e20c5da20 Store internal keys in TombstoneMap
Summary:
This fixes a correctness issue where ranges with same begin key would overwrite each other.

This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator.

End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix.
Closes https://github.com/facebook/rocksdb/pull/1484

Differential Revision: D4155248

Pulled By: ajkr

fbshipit-source-id: 304b4b9
2016-11-09 15:09:18 -08:00
Andrew Kryczka
f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00
Yi Wu
437942e481 Add avoid_flush_during_shutdown DB option
Summary:
Add avoid_flush_during_shutdown DB option.
Closes https://github.com/facebook/rocksdb/pull/1451

Differential Revision: D4108643

Pulled By: yiwu-arbug

fbshipit-source-id: abdaf4d
2016-11-02 15:39:18 -07:00
Benoit Girard
2b16d664cb Change max_bytes_for_level_multiplier to double
Summary: Closes https://github.com/facebook/rocksdb/pull/1427

Differential Revision: D4094732

Pulled By: yiwu-arbug

fbshipit-source-id: b9b79e9
2016-11-01 21:09:23 -07:00
Siying Dong
c90c48d3c8 Show More DB Stats in info logs
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439

Differential Revision: D4100943

Pulled By: yiwu-arbug

fbshipit-source-id: 79f7218
2016-10-29 16:09:18 -07:00
Nipunn Koorapati
25f5742f0b Update documentation to point at gcc 4.8
Summary:
Rocksdb currently has many references to std::map.emplace_back()
which is not implemented in gcc 4.7, but valid in gcc 4.8. Confirmed that
it did not build with gcc 4.7, but builds fine with gcc 4.8
Closes https://github.com/facebook/rocksdb/pull/1272

Differential Revision: D4101385

Pulled By: IslamAbdelRahman

fbshipit-source-id: f6af453
2016-10-29 12:09:17 -07:00
Willem Jan Withagen
0aab5e55f0 FreeBSD: malloc_usable_size is in <malloc_np.h> (#1428)
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
2016-10-28 10:44:52 -07:00
Kien-hung Li
eeb27e1bbd Add handy option to turn on direct I/O in db_bench (#1424) 2016-10-28 10:36:05 -07:00
Kefu Chai
60a2bbba94 Makefile: generate util/build_version.cc from .in file (#1384)
* util/build_verion.cc.in: add this file, so cmake and make can share the
  template file for generating util/build_version.cc.
* CMakeLists.txt: also, cmake v2.8.11 does not support file(GENERATE ...),
  so we are using configure_file() for creating build_version.cc.
* Makefile: use util/build_verion.cc.in for creating build_version.cc.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
2016-10-25 11:31:39 -07:00
sdong
24495186da DBSSTTest.RateLimitedDelete: not to use real clock
Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead.

Test Plan: Run the tests and all existing tests.

Reviewers: yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65145
2016-10-24 10:35:00 -07:00
Aaron Gao
59a7c0337b Change ioptions to store user_comparator, fix bug
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.

Test Plan: make all check -j64

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65121
2016-10-21 11:31:42 -07:00
Islam AbdelRahman
b88f8e87c5 Support SST files with Global sequence numbers [reland]
Summary:
reland https://reviews.facebook.net/D62523

- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: sdong, yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65211
2016-10-18 16:59:37 -07:00
Dmitri Smirnov
b1031d6c12 Remove function local statics that interfere with memory pooling (#1392) 2016-10-14 13:09:18 -07:00
Yi Wu
e29d3b67c2 Make max_background_compactions and base_background_compactions dynamic changeable
Summary:
Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions.
I'll add more dynamic changeable options soon.

Test Plan: unit test.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64749
2016-10-14 12:25:39 -07:00
Yueh-Hsuan Chiang
8cbe3e10ca Relax the acceptable bias RateLimiterTest::Rate test be 25%
Summary:
In the current implementation of RateLimiter, the difference
between the configured rate and the actual rate might be more
than 20%, while our test only allows 15% difference.  This diff
relaxes the acceptable bias RateLimiterTest::Rate test be 25%
to make the test less flaky.

Test Plan: rate_limiter_test

Reviewers: IslamAbdelRahman, andrewkr, yiwu, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64941
2016-10-13 14:26:12 -07:00
Peter (Stig) Edwards
f8d8cf53fe Fix log_write_bench -bytes_per_sync option. (#1375)
Hello and thanks for RocksDB,
 
When log_write_bench is run with the -bytes_per_sync option, the option does not influence any *sync* behaviour.
 
> strace -e trace=write,sync_file_range ./log_write_bench -record_interval 0 -record_size 1048576 -num_records 11 -bytes_per_sync 2097152 2>&1 | egrep '^(sync|write.*XXXX)'
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
 
I suspect that this is because the bytes_per_sync option now needs to be using a `WritableFileWriter` and not a `WritableFile`.
 
With the diff below applied, it changes to:
 
> strace -e trace=write,sync_file_range ./log_write_bench -record_interval 0 -record_size 1048576 -num_records 11 -bytes_per_sync 2097152 2>&1 | egrep '^(sync|write.*XXXX)'
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0, 0x200000, 0x2)  = 0
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0x200000, 0x200000, 0x2) = 0
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0x400000, 0x200000, 0x2) = 0
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0x600000, 0x200000, 0x2) = 0
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0x800000, 0x200000, 0x2) = 0
 
( Note that the first 1MB is not synced as mentioned in util/file_reader_writer.cc::WritableFileWriter::Flush() )
 
This diff also includes the fix from https://github.com/facebook/rocksdb/pull/1373
 
> diff -du util/log_write_bench.cc.orig util/log_write_bench.cc
--- util/log_write_bench.cc.orig        2016-10-04 12:06:29.115122580 -0400
+++ util/log_write_bench.cc     2016-10-05 07:24:09.677037576 -0400
@@ -14,6 +14,7 @@
 #include <gflags/gflags.h>

 #include "rocksdb/env.h"
+#include "util/file_reader_writer.h"
 #include "util/histogram.h"
 #include "util/testharness.h"
 #include "util/testutil.h"
@@ -38,19 +39,21 @@
   env_options.bytes_per_sync = FLAGS_bytes_per_sync;
   unique_ptr<WritableFile> file;
   env->NewWritableFile(file_name, &file, env_options);
+  unique_ptr<WritableFileWriter> writer;
+  writer.reset(new WritableFileWriter(std::move(file), env_options));

   std::string record;
-  record.assign('X', FLAGS_record_size);
+  record.assign(FLAGS_record_size, 'X');

   HistogramImpl hist;

   uint64_t start_time = env->NowMicros();
   for (int i = 0; i < FLAGS_num_records; i++) {
     uint64_t start_nanos = env->NowNanos();
-    file->Append(record);
-    file->Flush();
+    writer->Append(record);
+    writer->Flush();
     if (FLAGS_enable_sync) {
-      file->Sync();
+      writer->Sync(false);
     }
     hist.Add(env->NowNanos() - start_nanos);
2016-10-11 16:45:51 -07:00
Yi Wu
d6ae6dec69 Add Statistics::getAndResetTickerCount().
Summary: A convience method to atomically get and reset ticker count. I'm wanting to use it to have a thin wrapper to the statistics object to export ticker counts to ODS for LogDevice (since they don't even use fb303).

Test Plan:
test in LogDevice shadow cluster.
https://fburl.com/461868822

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64869
2016-10-11 10:54:11 -07:00
Islam AbdelRahman
2ad68b971a Support running consistency checks in release mode
Summary:
We always run consistency checks when compiling in debug mode
allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode

Test Plan:
make check -j64
make release

Reviewers: lightmark, sdong, yiwu

Reviewed By: yiwu

Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64701
2016-10-07 17:21:45 -07:00
Islam AbdelRahman
67501cfc9a Fix -ve std::string::resize
Summary:
I saw this exception thrown because sometimes we may resize with -ve value
if we have empty max_bytes_for_level_multiplier_additional vector

Test Plan: run the tests

Reviewers: yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64791
2016-10-07 17:16:13 -07:00
Islam AbdelRahman
d062328977 Revert "Support SST files with Global sequence numbers"
This reverts commit ab01da5437.
2016-10-07 14:05:12 -07:00
Peter (Stig) Edwards
043cb62d63 Fix record_size in log_write_bench, swap args to std::string::assign. (#1373)
Hello and thank you for RocksDB,
 
I noticed when using log_write_bench that writes were always 88 bytes:
 
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88

> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
 
I think this should be:

<<    record.assign('X', FLAGS_record_size);
>>    record.assign(FLAGS_record_size, 'X');

So fill and not buffer. Otherwise I always see writes of size 88 (the decimal value for chr "X").

string& assign (const char* s, size_t n);
buffer - Copies the first n characters from the array of characters pointed by s.

string& assign (size_t n, char c);
fill   - Replaces the current value by n consecutive copies of character c.

perl -le 'print ord "X"'
88
 
With the change:
 
> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
 
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249

Thanks.

01c27be5fb
https://reviews.facebook.net/D16239
2016-10-06 10:45:31 -07:00
Islam AbdelRahman
9d6c961383 Fix Mac build 2016-10-03 18:25:10 -07:00
Islam AbdelRahman
ab01da5437 Support SST files with Global sequence numbers
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62523
2016-10-03 16:12:39 -07:00
Aaron Gao
f517d9dd09 Add SeekForPrev() to Iterator
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()

Also add tests in db_iter_test and db_iterator_test

Pass all tests
Cheers!

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64149
2016-09-27 18:20:57 -07:00
Yi Wu
9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
Islam AbdelRahman
5735b3dc2a Fix compiling under -Werror=missing-field-initializers
Summary:
MyRocks build is broken because they are using "-Werror=missing-field-initializers"
We should fix that by explicitly passing these arguments

Test Plan: Build MyRocks

Reviewers: sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64161
2016-09-20 13:02:41 -07:00
Aaron Gao
654ed9a280 loose the assertion condition of rate_limiter_test
Summary: 0.9 can make the test flaky since just found one test fail with 0.88

Test Plan: make all check

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63939
2016-09-20 12:28:59 -07:00
sdong
3edb9461b7 Avoid hard-coded sleep in EnvPosixTestWithParam.TwoPools
Summary: EnvPosixTestWithParam.TwoPools relies on explicit sleeping, so it sometimes fail. Fix it.

Test Plan: Run tests with high parallelism many times and make sure the test passes.

Reviewers: yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63417
2016-09-16 17:45:12 -07:00
Yi Wu
17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Yi Wu
8e061f9740 Refactor GetMutableOptionsFromStrings
Summary: Add mutable options info into `OptionsTypeInfo` and use it to parse mutable options map. Also support `max_bytes_for_level_multiplier_additional` in option file.

Test Plan: unit test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63843
2016-09-13 21:12:43 -07:00
Yi Wu
81747f1be6 Refactor MutableCFOptions
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63945
2016-09-13 21:11:59 -07:00
Islam AbdelRahman
ba65c816bb Support POSIX RandomRWFile
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file

I will implement it for other Envs later after finishing the whole task for AddFile()

Test Plan: unit tests

Reviewers: andrewkr, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62433
2016-09-13 12:08:22 -07:00
Islam AbdelRahman
52ee07b021 Move AddFile() tests to external_sst_file_test.cc
Summary: Simply move the tests

Test Plan: make check -j64

Reviewers: andrewkr, lightmark, yiwu, yhchiang, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62529
2016-09-07 15:41:54 -07:00
Edouard A
66a91e2607 Add NoSpace subcode to IOError (#1320)
Add a sub code to distinguish "out of space" errors from regular I/O errors
2016-09-07 12:37:45 -07:00
sdong
607628d349 Support ZSTD with finalized format
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.

Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63141
2016-09-06 12:22:16 -07:00
sdong
f7669b40ba Fix Windows Build
Summary: Fix two Windows build problems.

Test Plan: Build on Windows and run all Linux tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63189
2016-09-02 17:10:28 -07:00
Yi Wu
a88677d2cf Remove ImmutableCFOptions from public API
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63177
2016-09-02 14:16:31 -07:00
John Alexander
4fd08f4b8b Ensure Correct Behavior of StatsLevel kExceptDetailedTimers and kExceptTimeForMutex (#1308)
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).

* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
2016-09-01 19:57:55 -07:00
sdong
32149059f9 Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.

Test Plan: Add two new unit tests. Run all existing tests, including jtest.

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59829
2016-09-01 14:33:24 -07:00
Aaron Gao
4590b53a4b add stats to Cache::LookUp()
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.

Test Plan: make all check

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62193
2016-09-01 13:50:39 -07:00
Andrew Kryczka
1613fa9490 Thread-specific histogram statistics
Summary:
To reduce contention for atomics when HistogramStats are shared across
threads, this diff makes them thread-specific so updates are faster. This comes
at the expense of slower reads (much less frequent), which now require merging
all histograms. In this diff,

- Thread-specific HistogramImpl is created upon the thread's first measureTime()
- Thread-specific HistogramImpl are merged and deleted upon thread termination or ThreadLocalPtr destruction, whichever comes first
- getHistogramString() and histogramData() merge all histograms, both thread-specific and previously merged ones

Test Plan:
unit tests, ran db_bench and verified histograms look similar

before:

  $ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
  ...
  +    7.63%  db_bench     db_bench             [.] rocksdb::HistogramStat::Add

after:

  $ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
  ...
  +    0.98%  db_bench     db_bench             [.] rocksdb::HistogramStat::Add

Reviewers: sdong, MarkCallaghan, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62649
2016-08-31 14:02:09 -07:00
Yi Wu
de47e2bd4d Fix ClockCache memory leak
Summary:
Fix ClockCache memory leak found by valgrind:
# Add destructor to cleanup cached values.
# Delete key with cache handle immediately after handle is recycled, and erase table entry immediately if duplicated cache entry is inserted.

Test Plan:
    make DISABLE_JEMALLOC=1 valgrind_check

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62973
2016-08-31 08:56:34 -07:00
Yi Wu
7541c7a79e Fix cache_test valgrind_check failure
Summary: Refactor cache_test to get around gtest valgrind failure.

Test Plan:
    make valgrind_check

Reviewers: sdong, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62817
2016-08-29 10:40:00 -07:00
Islam AbdelRahman
b49b92cf28 Introduce Read amplification bitmap (read amp statistics)
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.

We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification

Test Plan: added new tests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: yiwu, leveldb, march, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58707
2016-08-26 18:55:58 -07:00
Islam AbdelRahman
e9b2af87f8 Expose ThreadPool under include/rocksdb/threadpool.h
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)

This allow us to expose ThreadPool to the user so we can use it as an option later

Test Plan: existing unit tests

Reviewers: andrewkr, yiwu, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62085
2016-08-26 10:41:35 -07:00
Andrew Kryczka
a081f798b0 Relax consistency for thread-local ticker stats
Summary: see discussion in D62337

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62577
2016-08-25 10:42:26 -07:00
Andrew Kryczka
7c95868378 Thread-specific ticker statistics
Summary:
The global atomics we previously used for tickers had poor cache performance
since they were typically updated from different threads, causing frequent
invalidations. In this diff,

- recordTick() updates a local ticker value specific to the thread in which it was called
- When a thread exits, its local ticker value is added into merged_sum
- getTickerCount() returns the sum of all threads' local ticker values and the merged_sum
- setTickerCount() resets all threads' local ticker values and sets merged_sum to the value provided by the caller.

In a next diff I will make a similar change for histogram stats.

Test Plan:
before:

  $ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
  $ perf report -g --stdio | grep recordTick
  7.59%  db_bench     db_bench             [.] rocksdb::StatisticsImpl::recordTick
  ...

after:

  $ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
  $ perf report -g --stdio | grep recordTick
  1.46%  db_bench     db_bench             [.] rocksdb::StatisticsImpl::recordTick
  ...

Reviewers: kradhakrishnan, MarkCallaghan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: yiwu, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62337
2016-08-24 15:42:31 -07:00
Yi Wu
badbff65b7 Not insert into block cache if cache is full and not holding handle
Summary:
We used to allow insert into full block cache as long as `strict_capacity_limit=false`. This diff further restrict insert to full cache if caller don't intent to hold handle to the cache entry after insert.

Hope this diff fix the assertion failure with db_stress: https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=211853102&step_id=2475070014

  db_stress: util/lru_cache.cc:278: virtual void rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*): Assertion `lru_.next == &lru_' failed.

The assertion at lru_cache.cc:278 can fail when an entry is inserted into full cache and stay in LRU list.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62325
2016-08-23 13:53:49 -07:00
Yi Wu
4a16c32ece Option to cache index/filter blocks with priority
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.

Depends on D61977.

Test Plan: See unit test.

Reviewers: lightmark, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, march, leveldb

Differential Revision: https://reviews.facebook.net/D62241
2016-08-23 13:44:13 -07:00
Andrew Kryczka
f57bc1d034 Fix lambda expression for clang/windows
Summary:
make the variables static so capturing is unnecessary since I couldn't
find a portable way to capture variables in a lambda that's converted to a
C-style pointer-to-function.

Test Plan: https://ci.appveyor.com/project/Facebook/rocksdb/build/1.0.1658

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62403
2016-08-23 13:34:56 -07:00
Andrew Kryczka
5440675c3d Fix lambda capture expression for windows
Summary:
there was an error when accessing kItersPerThread in the lambda:
https://ci.appveyor.com/project/Facebook/rocksdb/build/1.0.1654

Test Plan: doitlive

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62379
2016-08-23 10:20:00 -07:00
Andrew Kryczka
6584cec8f2 Fold function for thread-local data
Summary:
This function allows the user to provide a custom function to fold all
threads' local data. It will be used in my next diff for aggregating statistics
stored in thread-local data. Note the test case uses atomics as thread-local
values due to the synchronization requirement (documented in code).

Test Plan: unit test

Reviewers: yhchiang, sdong, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62049
2016-08-22 15:37:39 -07:00
Yi Wu
72f8cc703c LRU cache mid-point insertion
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.

Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.

In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.

Test Plan: unit test.

Reviewers: IslamAbdelRahman, sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, march, leveldb

Differential Revision: https://reviews.facebook.net/D61977
2016-08-19 16:43:31 -07:00
Wanning Jiang
78837f5d61 TableBuilder / TableReader support for range deletion
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader

Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )

Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61473
2016-08-19 15:10:31 -07:00
Yi Wu
4cc37f59e5 Introduce ClockCache
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.

Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.

Reviewers: kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61647
2016-08-19 12:28:19 -07:00
Dmitri Smirnov
3981345be1 Small nits (#1280)
* Create rate limiter using factory function in the test.

* Convert function local statics in option helper to a C array
  that does not perform dynamic memory allocation. This is helpful
  when you try to memory isolate different DB instances.
2016-08-17 00:42:35 -07:00
Yi Wu
2a2ebb6f5e Move LRUCache structs to lru_cache.h header
Summary: ... so that I can include the header and create LRUCache specific tests for D61977

Test Plan:
   make check

Reviewers: lightmark, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62145
2016-08-16 14:43:41 -07:00
Anirban Rahut
2fc2fd92a9 Single Delete Mismatch and Fallthrough statistics
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).

Test Plan: Tested the statistics using write_stress and compaction_job_stats_test

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61749
2016-08-16 08:21:43 -07:00
Andrew Kryczka
236756f2cf Make SyncPoint return immediately when disabled
Summary:
We were frequently seeing a race between SyncPoint::Process() and
SyncPoint::~SyncPoint() (e.g.,
https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=207289975&step_id=2412725431).
The issue was marked_thread_id_ gets deleted when the main thread is exiting and
simultaneously background threads may access it. We can prevent this race
condition by checking whether sync points are disabled (assuming the test terminates
with them disabled) before attempting to access that member. I do not understand
why accesses to other members (mutex_ and enabled_) are ok but anyways the
test no longer fails tsan.

Test Plan: ran tests

Reviewers: sdong, yhchiang, IslamAbdelRahman, yiwu, wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62133
2016-08-16 06:19:46 -07:00
Yueh-Hsuan Chiang
b248e98cf9 Fix a destruction order issue in ThreadStatusUpdater
Summary:
Env holds a pointer of ThreadStatusUpdater, which will be deleted when
Env is deleted.  However, in case a rocksdb database is deleted after
Env is deleted.  Then this will introduce a free-after-use of this
ThreadStatusUpdater.

This patch fix this by never deleting the ThreadStatusUpdater in Env,
which is in general safe as Env is a singleton in most cases.

Test Plan: thread_list_test

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59187
2016-08-15 09:04:55 -07:00
Aaron Gao
e408e98c8c add Name() to Cache
Summary: preparation for detecting Cache type. If SimCache, we then may trigger some command like "setSimCapacity()" with setOptions()

Test Plan: make all check

Reviewers: yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61953
2016-08-12 14:16:57 -07:00
Yueh-Hsuan Chiang
6056d6317d Improve comment and bug fix for GetOptionsFromMap functions in convenience.h
Summary:
This diff improves the documentation for GetOptionsFromMap APIs and
fixes a bug in GetOptionsFromMap functions in convenience.h
where new_options will still be changed when the function
call is not successful.

Test Plan: options_test

Reviewers: IslamAbdelRahman, kradhakrishnan, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61731
2016-08-11 14:54:29 -07:00
Willem Jan Withagen
821bcb0b39 util/arena.cc: FreeBSD: More portable use of mmap(MAP_ANON) (#1254)
From the Linux manual:
  MAP_ANONYMOUS
     The  mapping  is  not  backed  by any file; its contents
     are initialized to zero.  The fd and offset arguments are
     ignored; however, some implementations require fd to be -1
     if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable
     applications  should  ensure  this.

FreeBSD is such a case, it wil just return an error.
2016-08-10 13:52:23 -07:00
omegaga
44f5cc57a5 Add time series database (resubmitted)
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).

Test Plan: Add `date_tiered_test`.

Reviewers: dhruba, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61653
2016-08-05 15:56:22 -07:00
Dhruba Borthakur
34723b4c4a Cleanup unused variable pending_fsync_.
Summary: Cleanup unused variable pending_fsync_.

Test Plan: make check

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61581
2016-08-05 10:31:41 -07:00
omegaga
2306167d30 Fix clang build failure and refactor unit test
Summary: <endian.h> is not platform independent. Switch to our own endianness transformation function instead.

Test Plan: Pass Travis CI. Refactor tests to make sure endianness transformation runs properly.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61389
2016-08-02 15:16:39 -07:00
Aaron Gao
343304e1d3 Use StopWatch to do statistic job in db_impl_add_file.cc
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.

Test Plan: make all check -j64

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61239
2016-08-02 14:53:29 -07:00
krad
c116b47804 Persistent Read Cache (part 6) Block Cache Tier Implementation
Summary:
The patch is a continuation of part 5. It glues the abstraction for
file layout and metadata, and flush out the implementation of the API. It
adds unit tests for the implementation.

Test Plan: Run unit tests

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57549
2016-08-01 14:15:14 -07:00
Andrew Kryczka
afad5bd1c5 Simplify thread-local static initialization
Summary:
The call stack used to look like this during static initialization:

  #0  0x00000000008032d1 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:172
  #1  0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
  #2  0x000000000080310f in rocksdb::ThreadLocalPtr::StaticMeta::Mutex() () at util/thread_local.cc:141
  #3  0x0000000000803103 in rocksdb::ThreadLocalPtr::StaticMeta::InitSingletons() () at util/thread_local.cc:139
  #4  0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106

It involves outer/inner classes and the call stacks goes
outer->inner->outer->inner, which is too difficult to understand. We can avoid
a level of back-and-forth by skipping StaticMeta::InitSingletons(), which
doesn't initialize anything beyond what ThreadLocalPtr::Instance() already
initializes.

Now the call stack looks like this during static initialization:

  #0  0x00000000008032c5 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:170
  #1  0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
  #2  0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106

Test Plan:
unit tests

verify StaticMeta::mutex_ is still initialized in DefaultEnv() (StaticMeta::mutex_ is the only variable intended to be initialized via StaticMeta::InitSingletons() which I removed)

  #0  0x00000000005cee17 in rocksdb::port::Mutex::Mutex(bool) (this=0x7ffff69500b0, adaptive=false) at port/port_posix.cc:52
  #1  0x0000000000769cf8 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff6950000) at util/thread_local.cc:168
  #2  0x0000000000769a53 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:133
  #3  0x0000000000769a09 in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:105
  #4  0x0000000000647d98 in rocksdb::Env::Default() () at util/env_posix.cc:845

Reviewers: lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: arahut, IslamAbdelRahman, yiwu, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60813
2016-07-28 14:55:22 -07:00
sdong
e5b5f12b81 Change options memtable_prefix_bloom_huge_page_tlb_size => memtable_huge_page_size and cover huge page to memtable too
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60513
2016-07-26 18:15:11 -07:00
Islam AbdelRahman
f8061a237e Fix Statistics TickersNameMap miss match with Tickers enum
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)

Test Plan: added new unit test

Reviewers: sdong, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61083
2016-07-25 16:05:50 -07:00
Islam AbdelRahman
68a8e6b8fa Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>

This diff is stacked on top of D56493 and D56511

In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand

```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s

[master]
readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```

```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s

[master]
readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s

```

```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]

[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s

[master]
readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```

```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions

[FullMergeV2]
readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s

[master]
readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: lovro, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
krad
d9cfaa2b16 Persistent Read Cache (6) Persistent cache tier implentation - File layout
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.

Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.

We have specifically tested and adapted the IO path for NVM and SSD.

Persistent cache tier consists of there parts :

1) File layout

Provides the implementation for handling IO path for reading and writing data
(key/value pair).

2) Meta-data
Provides the implementation for handling the index for persistent read cache.

3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface

This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.

Test Plan: Compile and run check

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57117
2016-07-19 12:01:46 -07:00
John Alexander
9430333f84 New Statistics to track Compression/Decompression (#1197)
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).

* Prevent incrementing compression counter when compression is turned off in options.

* Prevent incrementing compression counter when compression is turned off in options.

* Added two more supported compression types to test code in db_test.cc

* Prevent incrementing compression counter when compression is turned off in options.

* Added new StatsLevel that excludes compression timing.

* Fixed casting error in coding.h

* Fixed CompressionStatsTest for new StatsLevel.

* Removed unused variable that was breaking the Linux build
2016-07-19 09:44:03 -07:00
Yi Wu
4b95253587 Refactor cache.cc
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.

Test Plan:
    make check -j64

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59655
2016-07-15 10:41:36 -07:00
Jay Edgar
15b7a4ab88 Fixed output size and removed unneeded loop
Summary: In Zlib_Compress and BZip2_Compress the calculation for size was slightly off when using compression_foramt_version 2 (which includes the decompressed size in the output).  Also there were unnecessary loops around the deflate/BZ2_bzCompress calls.  In Zlib_Compress there was also a possible exit from the function after calling deflateInit2 that didn't call deflateEnd.

Test Plan: Standard tests

Reviewers: sdong, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: sdong, IslamAbdelRahman, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60537
2016-07-13 09:10:06 -07:00
Jay Edgar
efd013d6d8 Miscellaneous performance improvements
Summary:
I was investigating performance issues in the SstFileWriter and found all of the following:

- The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time.  Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function.
- In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key.
- The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed.  I added the move constructor which also required a copy constructor and assignment operator.
- The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added.  I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value.
- The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function.
- BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key).  It is more efficient just to copy the whole new key over.
- Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on.
- FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time.
- Each PutVarint??() function has a buffer and calls std::string::append().  Two or three calls in a row could share a buffer and a single call to std::string::append().

Some of these will be helpful outside of the SstFileWriter.  I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility?  I tried it on gcc 4.8 and 4.9.

Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added.  The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas.

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59607
2016-07-12 14:15:32 -07:00
Yi Wu
296545a2c7 Fix clang analyzer errors
Summary:
Fixing erros reported by clang static analyzer.
* Removing some unused variables.
* Adding assertions to fix false positives reported by clang analyzer.
* Adding `__clang_analyzer__` macro to suppress false positive warnings.

Test Plan:
    USE_CLANG=1 OPT=-g make analyze -j64

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60549
2016-07-08 17:50:51 -07:00
Wanning Jiang
0f691c4b5b CLI option & Rename() allow overwrite
Summary: this test support CLI option to select HdfsEnv/NativeHdfsEnv now. The latter one is default. add test about when Rename(src, target) should overwrite target

Test Plan: existing test

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60399
2016-07-07 17:51:36 -07:00
Andrew Kryczka
e1b3ee8a79 Cleanup auto-roll logger flush-while-rolling test
Summary:
Use @omegaga's awesome feature to avoid use of callbacks for ensuring
SyncPoints happen in a particular thread.

Depends on D60375.

Test Plan:
  $ ./auto_roll_logger_test

Reviewers: omegaga, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, omegaga, leveldb

Differential Revision: https://reviews.facebook.net/D60471
2016-07-07 11:35:40 -07:00
omegaga
cd4178a015 Add a new feature to enforce a sync point only active on a thread
Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point.

Test Plan: Write a unit test to validate.

Reviewers: sdong, IslamAbdelRahman, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60375
2016-07-07 11:29:14 -07:00
sdong
32df9733d1 Add options.write_buffer_manager: control total memtable size across DB instances
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.

Test Plan: Add a new unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59925
2016-07-05 18:11:25 -07:00
Wanning Jiang
95d96eeeb1 remove LockFile
Summary: LockFile is unnecessary in unit test

Test Plan: env_basic_test.cc

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60285
2016-07-01 15:28:08 -07:00
Aaron Gao
cb2476a0ca fix rate limiter to avoid starvation
Summary:
The current implementation of rate limiter has the possibility to introduce resource starvation when change its limit.
This diff aims to fix this problem by consuming request bytes partially.

Test Plan:
```
./rate_limiter_test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from RateLimiterTest
[ RUN      ] RateLimiterTest.OverflowRate
[       OK ] RateLimiterTest.OverflowRate (0 ms)
[ RUN      ] RateLimiterTest.StartStop
[       OK ] RateLimiterTest.StartStop (0 ms)
[ RUN      ] RateLimiterTest.Rate
request size [1 - 1023], limit 10 KB/sec, actual rate: 10.355712 KB/sec, elapsed 2.00 seconds
request size [1 - 1023], limit 20 KB/sec, actual rate: 19.136564 KB/sec, elapsed 2.00 seconds
request size [1 - 2047], limit 20 KB/sec, actual rate: 20.783976 KB/sec, elapsed 2.10 seconds
request size [1 - 2047], limit 40 KB/sec, actual rate: 39.308144 KB/sec, elapsed 2.10 seconds
request size [1 - 4095], limit 40 KB/sec, actual rate: 40.318349 KB/sec, elapsed 2.20 seconds
request size [1 - 4095], limit 80 KB/sec, actual rate: 79.667396 KB/sec, elapsed 2.20 seconds
request size [1 - 8191], limit 80 KB/sec, actual rate: 81.807158 KB/sec, elapsed 2.30 seconds
request size [1 - 8191], limit 160 KB/sec, actual rate: 160.659761 KB/sec, elapsed 2.20 seconds
request size [1 - 16383], limit 160 KB/sec, actual rate: 160.700990 KB/sec, elapsed 3.00 seconds
request size [1 - 16383], limit 320 KB/sec, actual rate: 317.639481 KB/sec, elapsed 2.50 seconds
[       OK ] RateLimiterTest.Rate (22618 ms)
[ RUN      ] RateLimiterTest.LimitChangeTest
[COMPLETE] request size 10 KB, new limit 20KB/sec, refill period 1000 ms
[COMPLETE] request size 10 KB, new limit 5KB/sec, refill period 1000 ms
[COMPLETE] request size 20 KB, new limit 40KB/sec, refill period 1000 ms
[COMPLETE] request size 20 KB, new limit 10KB/sec, refill period 1000 ms
[COMPLETE] request size 40 KB, new limit 80KB/sec, refill period 1000 ms
[COMPLETE] request size 40 KB, new limit 20KB/sec, refill period 1000 ms
[COMPLETE] request size 80 KB, new limit 160KB/sec, refill period 1000 ms
[COMPLETE] request size 80 KB, new limit 40KB/sec, refill period 1000 ms
[COMPLETE] request size 160 KB, new limit 320KB/sec, refill period 1000 ms
[COMPLETE] request size 160 KB, new limit 80KB/sec, refill period 1000 ms
[       OK ] RateLimiterTest.LimitChangeTest (5002 ms)
[----------] 4 tests from RateLimiterTest (27620 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (27621 ms total)
[  PASSED  ] 4 tests.
```

Reviewers: sdong, IslamAbdelRahman, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60207
2016-07-01 00:16:29 -07:00
Andrew Kryczka
6b7167651f Run env_basic_test on Env::Default
Summary:
Previously we couldn't run env_basic_test on Env::Default (PosixEnv on
our platforms) since GetChildren*() behavior was inconsistent with our other
Envs. We can normalize the output of GetChildren*() such that these test cases
work on PosixEnv too.

Test Plan: ran env_basic_test

Reviewers: wanning

Reviewed By: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59943
2016-06-30 18:34:29 -07:00
Andrew Kryczka
9eb0b53954 Move env_basic_test cleanup to TearDown
Summary:
move cleanup to TearDown and handle directories, so cleanup will happen
even if a test fails in the middle.

Test Plan: ./env_basic_test

Reviewers: wanning

Reviewed By: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60243
2016-06-30 18:33:49 -07:00
Wanning Jiang
95c192475d writable file close before reset
Summary: during backup, writable file should call close() before reset()

Test Plan: backupable_db_test.cc

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60195
2016-06-29 15:33:27 -07:00
Willem Jan Withagen
b726bf5961 FreeBSD does not have std::to_string (#1190)
Submitted-by: Willem Jan Withagen <wjw@digiware.nl>
2016-06-29 07:35:17 -07:00
Wanning Jiang
3fc713ed92 delete 2nd level children for default env
Summary: ensure no 2nd level children under test_dir_

Test Plan: env_basic_test on 4 envs

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59979
2016-06-24 12:31:58 -07:00
Wanning Jiang
0d7b261236 add tests to env_basic_test.cc
Summary: test NativeHdfsEnv

Test Plan: env_basic_test.cc

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59889
2016-06-22 16:16:21 -07:00
omegaga
c4e19b77e8 Add a read option to enable background purge when cleaning up iterators
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.

Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59499
2016-06-21 18:41:23 -07:00
sdong
7b79238b65 Deprectate filter_deletes
Summary: filter_deltes is not a frequently used feature. Remove it.

Test Plan: Run all test suites.

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59427
2016-06-17 10:30:47 -07:00
sdong
4939fc3892 Bulk load mode shouldn't stop ingest
Summary: We introduced default slow down and stop condition, but didn't reset it in bulk load mode. Fix it.

Test Plan: N/A

Reviewers: igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59757
2016-06-17 09:58:50 -07:00
Yueh-Hsuan Chiang
3a2bccc845 Fixed a crash bug that incorrectly parse deprecated options in options_helper
Summary: Fixed a crash bug that incorrectly parse deprecated options in options_helper

Test Plan:
run db_bench with an old options file with memtable_prefix_bloom_probes
./db_bench --options_file=AN_OLD_OPTIONS_FILE --num=100 --benchmarks=fillseq

Reviewers: sdong, IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59787
2016-06-17 04:53:30 -07:00
Yi Wu
bc8af90e8c add option to not flush memtable on open()
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.

Test Plan: See unit test.

Reviewers: dhruba, horuff, sdong

Reviewed By: sdong

Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57813
2016-06-13 11:34:16 -07:00
Nadav Rotem
7360db39e6 Add a check mode to verify compressed block can be decompressed back
Summary:
Try to decompress compressed blocks when a special flag is set.
assert and crash in debug builds if we can't decompress the just-compressed input.

Test Plan: Run unit-tests.

Reviewers: dhruba, andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59145
2016-06-10 18:20:54 -07:00
sdong
6faddd7c55 Merge db/slice.cc into util/slice.cc
Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them.

Test Plan: Run existing tests

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59409
2016-06-10 16:37:36 -07:00
omegaga
2d05eaeb28 Fix name conflict in delete_shceduler_test and db_sst_test
Summary: delete_scheduler_test and db_sst_test share a same directory name, causing possible fails on both tests when running in parallel. Fixed by changing directory name.

Test Plan: Run the two tests in parallel: `parallel -u ./{} ::: delete_scheduler_test db_sst_test`

Reviewers: sdong, andrewkr

Reviewed By: sdong, andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59529
2016-06-10 15:39:17 -07:00
sdong
20699df843 memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate memtable_prefix_bloom_probes
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio

Test Plan: Run all existing tests

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59199
2016-06-10 12:12:10 -07:00
sdong
b2973eaaeb Remove options builder
Summary: AFIK, options builder is not used by anyone. Remove it.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, andrewkr, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59319
2016-06-09 16:56:17 -07:00
Frank Celler
1ba452226f Fix for GCC 5.4 (#1157)
GCC 5.4 will complain (see also options_parser.cc):

    /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc: In function 'rocksdb::CompactionStyle rocksdb::{anonymous}::PickCompactionStyle(size_t, int, int, uint64_t)':
    /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:29:7: error: 'log' is not a member of 'std'
           std::log(target_db_size / write_buffer_size) / std::log(kBytesForLevelMultiplier)));
           ^
    /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:29:7: note: suggested alternative:
    In file included from /usr/include/features.h:365:0,
                     from /usr/include/math.h:26,
                     from /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:6:
    /usr/include/bits/mathcalls.h:109:1: note:   'log'
     __MATHCALL_VEC (log,, (_Mdouble_ __x));
2016-06-08 08:47:20 -07:00
Willem Jan Withagen
0d65acec0c threadpool.cc: abort() lives in stdlib.h on FreeBSD (#1155) 2016-06-05 17:23:38 -07:00
Willem Jan Withagen
19dd5a61cd env_chroot.cc: FreeBSD likes stdlib.h for realpaht() and friends (#1154) 2016-06-05 17:22:55 -07:00
Andrew Kryczka
5aca977be8 env_basic_test library for testing new Envs [pluggable Env part 3]
Summary:
- Provide env_test as a static library. We will build it for future releases so internal Envs can use env_test by linking against this library.
- Add tests for CustomEnv, which is configurable via ENV_TEST_URI environment variable. It uses the URI-based Env lookup (depends on D58449).
- Refactor env_basic_test cases to use a unique/configurable directory for test files.

Test Plan:
built a test binary against librocksdb_env_test.a. It registered the
default Env with URI prefix "a://".

- verify runs all CustomEnv tests when URI with correct prefix is provided

```
$ ENV_TEST_URI="a://ok" ./tmp --gtest_filter="CustomEnv/*"
...
[  PASSED  ] 12 tests.
```

- verify runs no CustomEnv tests when URI with non-matching prefix is provided

```
$ ENV_TEST_URI="b://ok" ./tmp --gtest_filter="CustomEnv/*"
...
[  PASSED  ] 0 tests.
```

Reviewers: ldemailly, IslamAbdelRahman, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58485
2016-06-03 18:44:22 -07:00
Andrew Kryczka
6e6622abb9 Create env_basic_test [pluggable Env part 2]
Summary:
Extracted basic Env-related tests from mock_env_test and memenv_test into a
parameterized test for Envs: env_basic_test.

Depends on D58449. (The dependency is here only so I can keep this series of
diffs in a chain -- there is no dependency on that diff's code.)

Test Plan: ran tests

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58635
2016-06-03 15:13:03 -07:00
krad
27ad170712 Fix Windows build break
Summary:
Direct IO checkin breaks Windows build. Fixing the code to work for
Windows.

Test Plan: Run env_test in Windows 10 and make check in Linux

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59073
2016-06-01 18:07:59 -07:00
Vasile Paraschiv
71c7eed91c Assert boundary checks for SetPerfLevel()
Summary: Add asserts around PerfLevel enum

Test Plan: make all check -j32

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59007
2016-06-01 09:07:09 -07:00
sdong
f62fbd2c85 Handle overflow case of rate limiter's paramters
Summary: When rate_bytes_per_sec * refill_period_us_ overflows, the actual limited rate is very low. Handle this case so the rate will be large.

Test Plan: Add a unit test for it.

Reviewers: IslamAbdelRahman, andrewkr

Reviewed By: andrewkr

Subscribers: yiwu, lightmark, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58929
2016-05-27 16:15:28 -07:00
Andrew Kryczka
57461fba8c In-memory environment read beyond EOF
Summary:
Made it consistent with posix Env, which uses pread() that returns 0
(success) when an offset is given beyond EOF. The purpose of making these Envs
behave consistently is I am repurposing the in-memory Envs' tests for the basic
Env tests in D58635.

Test Plan: ran mock_env_test and memenv_test

Reviewers: IslamAbdelRahman, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58845
2016-05-27 12:10:26 -07:00
sdong
157e0633e7 MutexLock -> ThreadPoolMutexLock in util/threadpool.cc
Summary: util/threadpool.cc's function name is the same as a well-known class name. It breaks unity build. Rename it.

Test Plan: Run all existing test.

Reviewers: yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58881
2016-05-27 09:41:35 -07:00
Aaron Gao
5d660258e7 add simulator Cache as class SimCache/SimLRUCache(with test)
Summary: add class SimCache(base class with instrumentation api) and SimLRUCache(derived class with detailed implementation) which is used as an instrumented block cache that can predict hit rate for different cache size

Test Plan:
Add a test case in `db_block_cache_test.cc` called `SimCacheTest` to test basic logic of SimCache.
Also add option `-simcache_size` in db_bench. if set with a value other than -1, then the benchmark will use this value as the size of the simulator cache and finally output the simulation result.
```
[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 1000000
RocksDB:    version 4.8
Date:       Tue May 17 16:56:16 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       6.809 micros/op 146874 ops/sec;   16.2 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.343 micros/op 157665 ops/sec;   17.4 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 986559
SimCache HITs:    264760
SimCache HITRATE: 26.84%

[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 10000000
RocksDB:    version 4.8
Date:       Tue May 17 16:57:10 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       5.066 micros/op 197394 ops/sec;   21.8 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.457 micros/op 154870 ops/sec;   17.1 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 1059764
SimCache HITs:    374501
SimCache HITRATE: 35.34%

[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 100000000
RocksDB:    version 4.8
Date:       Tue May 17 16:57:32 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       5.632 micros/op 177572 ops/sec;   19.6 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.892 micros/op 145094 ops/sec;   16.1 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 1150767
SimCache HITs:    1034535
SimCache HITRATE: 89.90%
```

Reviewers: IslamAbdelRahman, andrewkr, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57999
2016-05-23 23:35:23 -07:00
krad
21f847eda5 Direct IO fix for Mac
Summary:
O_DIRECT is not available in Mac as a flag for open. The fix is to make
use of fctl after the file is opened

Test Plan: Run the tests on mac and Linux

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58665
2016-05-23 16:38:25 -07:00
Ashish Shenoy
99765ed855 Clean up the ComputeCompactionScore() API
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options

Test Plan: UT

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, lgalanis, dhruba

Differential Revision: https://reviews.facebook.net/D58653
2016-05-23 15:55:29 -07:00
krad
f89caa127b Direct IO capability for RocksDB
Summary:
This patch adds direct IO capability to RocksDB Env.

The direct IO capability is required for persistent cache since NVM is best
accessed as 4K direct IO. SSDs can leverage direct IO for reading.

Direct IO requires the offset and size be sector size aligned, and memory to
be kernel page aligned. Since neither RocksDB/Persistent read cache data
layout is aligned to sector size, the code can accommodate reading unaligned IO size
(or unaligned memory) at the cost of an alloc/copy.

The write code path expects the size and memory to be aligned.

Test Plan: Run RocksDB unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57393
2016-05-23 12:27:27 -07:00
Sage Weil
847e471db6 db/log_test: add recycle log test
This currently fails because we do not properly map a
corrupt header to the logical end of the log.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Aaron Orenstein
2073cf3775 Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions.
Summary: Reduce use of argument-dependent name lookup in RocksDB.

Test Plan: 'make check' passed.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58203
2016-05-20 07:42:18 -07:00
krad
bb98ca3c80 Implement GetUniqueId for Mac
Summary:
Persistent read cache relies on the accuracy of the GetUniqueIdFromFile
to generate a unique key for a given block of data. Currently we don't have an
implementation for Mac.

This patch adds an implementation.

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58413
2016-05-19 16:34:31 -07:00
Islam AbdelRahman
533cda90ce Add GetStringFromCompressionType to include/rocksdb/convenience.h
Summary:
Expose a simple function to convert CompressionType to it's corresponding option string

This is for a diff @yoshinorim is working on for MyRocks

Test Plan: unittest

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D58215
2016-05-18 15:03:21 -07:00
Aaron Gao
43afd72bee [rocksdb] make more options dynamic
Summary:
make more ColumnFamilyOptions dynamic:
- compression
- soft_pending_compaction_bytes_limit
- hard_pending_compaction_bytes_limit
- min_partial_merge_operands
- report_bg_io_stats
- paranoid_file_checks

Test Plan:
Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit.
All passed.

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57519
2016-05-17 13:11:56 -07:00
krad
a08c8c851a Added PersistentCache abstraction
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.

RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in  COMPRESSED or UNCOMPRESSED mode.

Blame Rev:

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55707
2016-05-15 22:17:18 -07:00
Dmitri Smirnov
aab91b8d8f Use generic threadpool for Windows environment (#1120)
Conditionally retrofit thread_posix for use with std::thread
  and reuse the same logic. Posix users continue using Posix interfaces.
  Enable XPRESS compression in test runs.
  Fix master introduced signed/unsigned mismatch.
2016-05-12 18:34:04 -07:00
Reid Horuff
a657ee9a9c [rocksdb] Recovery path sequence miscount fix
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]

The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.

Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.

We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))

So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?

Test Plan: provided test.

Reviewers: sdong

Subscribers: andrewkr, leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D57645
2016-05-10 14:06:07 -07:00
Reid Horuff
1b8a2e8fdd [rocksdb] Memtable Log Referencing and Prepared Batch Recovery
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.

modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
-  log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.

- Recovery/writepath is now aware of prepared batches and commit and rollback markers.

Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.

Reviewers: IslamAbdelRahman, sdong

Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D56919
2016-05-10 14:06:07 -07:00
Andrew Kryczka
f548da33e8 Follow symlinks in chroot directory
Summary:
On Mac OS X, the chroot directory we typically use ("/tmp") is actually
a symlink for "/private/tmp". Since we dereference symlinks in user-defined
paths, we must also dereference symlinks in chroot_dir_ such that we can perform
string comparisons on those paths.

Test Plan: ran env_test on Mac OS X and devserver

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57957
2016-05-10 09:53:52 -07:00
Islam AbdelRahman
4b31723433 Add bottommost_compression option
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.

I have also updated CompactionJobInfo to include the compression algorithm used in compaction

Test Plan:
added new unittest
existing unittests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: lightmark, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D57669
2016-05-09 15:57:19 -07:00
Andrew Kryczka
258459ed54 Properly destroy ChrootEnv in env_test
Summary: see title

Test Plan:
  $ /mnt/gvfs/third-party2/valgrind/af85c56f424cd5edfc2c97588299b44ecdec96bb/3.10.0/gcc-4.9-glibc-2.20/e9936bf/bin/valgrind --error-exitcode=2 --leak-check=full ./env_test

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57897
2016-05-09 14:38:50 -07:00
Andrew Kryczka
a9b3c47c8e Fix includes for clang on OS X
Summary:
Fix below error:

  use of undeclared identifier 'errno'

Test Plan: doitlive

Reviewers: IslamAbdelRahman, sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57849
2016-05-06 18:32:54 -07:00
Andrew Kryczka
3f16a836a4 Introduce chroot Env
Summary:
For testing backups, we needed an Env that is fully isolated from other
Envs on the same machine. Our in-memory Envs (MockEnv and InMemoryEnv) were
insufficient because they don't implement most directory operations.

This diff introduces a new Env, "ChrootEnv", that translates paths such that the
chroot directory appears to be the root directory. This way, multiple Envs can
be isolated in the filesystem by using different chroot directories. Since we
use the filesystem, all directory operations are trivially supported.

Test Plan:
I parameterized the existing EnvPosixTest so it runs tests on ChrootEnv
except the ioctl-related cases.

Reviewers: sdong, lightmark, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57543
2016-05-06 17:42:50 -07:00
Arun Sharma
04dec2a359 [ldb] Export ldb_cmd*.h
Summary:
This is needed so that rocksdb users can add more
commands to the included ldb tool by adding more custom
commands.

Test Plan: make -j ldb

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57243
2016-05-06 16:09:09 -07:00
Andrew Kryczka
0d590d9991 Make max_dict_bytes optional in options string
Summary:
For backwards compatibility with older option strings, the parser needs
to treat this argument as optional.

Test Plan:
Updated unit test to cover case where compression_opts is present but
max_dict_bytes is omitted.

Reviewers: MarkCallaghan, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57759
2016-05-06 11:27:28 -07:00
sdong
e3c6ba37dd OptimizeForSmallDb(): revert some options whose defaults were just changed
Summary: We changed default options of max_open_files and max_file_opening_threads but didn't revert it in OptimizeForSmallDb().

Test Plan: Add a unit test

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57675
2016-05-05 16:50:53 -07:00
Yi Wu
24a24f013d Enable configurable readahead for iterators
Summary:
Add an option `iterator_readahead_size` to `ReadOptions` to enable
configurable readahead for iterators similar to the corresponding
option for compaction.

Test Plan:
```
make commit_prereq
```

Reviewers: kumar.rangarajan, ott, igor, sdong

Reviewed By: sdong

Subscribers: yiwu, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55419
2016-05-04 15:25:58 -07:00
sdong
6a14f7a976 Change several option defaults
Summary:
Changing several option defaults:
 options.max_open_files changes from 5000 to -1
 options.base_background_compactions changes from max_background_compactions to 1
 options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords
 options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize

Test Plan: Write unit tests to see OldDefaults() works as expected.

Reviewers: IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56427
2016-04-28 17:50:58 -07:00
Andrew Kryczka
843d2e3137 Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.

During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.

On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.

Test Plan: new unit test

Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
Sergey Makarenko
1c80dfab24 Print memory allocation counters
Summary:
Introduced option to dump malloc statistics using new option flag.
    Added new command line option to db_bench tool to enable this
    funtionality.
    Also extended build to support environments with/without jemalloc.

Test Plan:
1) Build rocksdb using `make` command. Launch the following command
    `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
    --num=10000000` end verified that jemalloc dump is present in LOG file.
    2) Build rocksdb using `DISABLE_JEMALLOC=1  make db_bench -j32` and ran
    the same db_bench tool and found the following message in LOG file:
    "Please compile with jemalloc to enable malloc dump".
    3) Also built rocksdb using `make` command on MacOS to verify behavior
    in non-FB environment.
    Also to debug build configuration change temporary changed
    AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
    tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
    command line. For both 2) and 3) this flag was not present.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57321
2016-04-27 16:23:33 -07:00
Islam AbdelRahman
7c14abf2c7 Improve BytewiseComparatorImpl::FindShortestSeparator
Summary:
The current implementation find the first different byte and try to increment it, if it cannot it return the original key
we can improve this by keep going after the first different byte to find the first non 0xFF byte and increment it

After trying this patch on some logdevice sst files I see decrease in there index block size by 8.5%

Test Plan: existing tests and updated test

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56241
2016-04-25 23:02:14 -07:00
dx9
b71c4e613f Alpine Linux Build (#990)
* Musl libc does not provide adaptive mutex. Added feature test for PTHREAD_MUTEX_ADAPTIVE_NP.

* Musl libc does not provide backtrace(3). Added a feature check for backtrace(3).

* Fixed compiler error.

* Musl libc does not implement backtrace(3). Added platform check for libexecinfo.

* Alpine does not appear to support gcc -pg option. By default (gcc has PIE option enabled) it fails with:

gcc: error: -pie and -pg|p|profile are incompatible when linking

When -fno-PIE and -nopie are used it fails with:

/usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find gcrt1.o: No such file or directory

Added gcc -pg platform test and output PROFILING_FLAGS accordingly. Replaced pg var in Makefile with PROFILING_FLAGS.

* fix segfault when TEST_IOCTL_FRIENDLY_TMPDIR is undefined and default candidates are not suitable

* use ASSERT_DOUBLE_EQ instead of ASSERT_EQ

* When compiled with ROCKSDB_MALLOC_USABLE_SIZE UniversalCompactionFourPaths and UniversalCompactionSecondPathRatio tests fail due to premature memtable flushes on systems with 16-byte alignment. Arena runs out of block space before GenerateNewFile() completes.

Increased options.write_buffer_size.
2016-04-22 16:49:12 -07:00
Dmitri Smirnov
ee221d2de0 Introduce XPRESS compresssion on Windows. (#1081)
Comparable with Snappy on comp ratio.
  Implemented using Windows API, does not require external package.
  Avaiable since Windows 8 and server 2012.
  Use -DXPRESS=1 with CMake to enable.
2016-04-19 22:54:24 -07:00
sdong
4b6833aec1 Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too.
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.

Test Plan: Try db_bench and see the stats are printed out.

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56769
2016-04-15 10:22:18 -07:00
Jay Edgar
b345b36620 Add a minimum value for the refill bytes per period value
Summary: If the user specified a small enough value for the rate limiter's bytes per second, the calculation for the number of refill bytes per period could become zero which would effectively cause the server to hang forever.

Test Plan: Existing tests

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56631
2016-04-13 09:01:42 -07:00
sdong
63cf15bb9f Fix option settable tests
Summary: In option settable tests, bytes for pointers are not all skipped, so that they may be the same as the special character and cause false positive.

Test Plan: Run the test. Manually verify the issue is not there any more.

Reviewers: IslamAbdelRahman, andrewkr

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, yhchiang, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56553
2016-04-12 14:03:35 -07:00
sdong
a23c6052c8 Don't run DBOptionsAllFieldsSettable under valgrind
Summary: Test DBOptionsAllFieldsSettable sometimes fails under valgrind. Move option settable tests to a separate test file and disable it in valgrind..

Test Plan: Run valgrind test and make sure the test doesn't run.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, yhchiang, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56529
2016-04-11 14:24:55 -07:00
Andrew Kryczka
114a1b8792 Fix build errors for windows
Summary:
- Need to use unsigned long long for 64-bit literals on windows
- Need size_t for backup meta-file length since clang doesn't let us assign size_t to int

Test Plan: backupable_db_test and options_test

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56391
2016-04-08 13:09:19 -07:00
sdong
1518b733eb Change default number of cache shard bit to be 6 and max_file_opening_threads to be 16.
Summary: Cache shard bit 4 is sometimes too small and 6 is a more common value picked by users. Make that default. It shouldn't hurt much to change options.max_file_opening_threads default to be 16, which will reduce the worst case DB open time.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, igor, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55047
2016-04-07 13:55:10 -07:00
Andrew Kryczka
f2c43a4a27 Stderr info logger
Summary:
Adapted a stderr logger from the option tests. Moved it to a separate
header so we can reuse it, e.g., from ldb subcommands for faster debugging. This
is especially useful to make errors/warnings more visible when running
"ldb repair", which involves potential data loss.

Test Plan:
ran options_test and "ldb repair"

  $ ./ldb repair --db=./tmp/
  [WARN] **** Repaired rocksdb ./tmp/; recovered 1 files; 588bytes. Some data may have been lost. ****
  OK

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56151
2016-04-01 11:06:06 -07:00
Marton Trencseni
9b51987521 Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.

Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 10:42:39 -07:00
sdong
2feafa3db9 Change some RocksDB default options
Summary: Change some RocksDB default options to make it more friendly to server workloads.

Test Plan: Run all existing tests

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55941
2016-03-31 17:12:18 -07:00
Yueh-Hsuan Chiang
a558830f8f Fixed compile warnings in posix_logger.h and coding.h
Summary:
Fixed the following compile warnings:

/Users/yhchiang/rocksdb/util/posix_logger.h:32:11: error: unused variable 'kDebugLogChunkSize' [-Werror,-Wunused-const-variable]
const int kDebugLogChunkSize = 128 * 1024;
          ^
/Users/yhchiang/rocksdb/util/coding.h:24:20: error: unused variable 'kMaxVarint32Length' [-Werror,-Wunused-const-variable]
const unsigned int kMaxVarint32Length = 5;
                   ^
2 errors generated.

Test Plan: make clean rocksdb

Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven, kradhakrishnan, adamretter

Reviewed By: adamretter

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56223
2016-03-31 16:01:47 -07:00
Laurent Demailly
21700a5106 to/from hex refactor
Summary:
Expose the inverse of ToString(hex=true) on Slice: Slice::DecodeHex
Refactor the other implementation of to/from hex in ldb_cmd.h to use the Slice
version
(Difference between the 2 is whether 0x is expected/produced in front of the hex
string or not)
Eliminated support for invalid odd length hex string - this is now invalid
instead of having 1/2 byte set
Added (inverse of HexToString) test for LDBCommand::StringToHex which also
indirectly tests Slice::ToString(true)

After moving the original implementation from ldb_cmd.h, updated it to much simpler/efficient version
(originally/inspired from https://github.com/facebook/wdt/blob/master/util/EncryptionUtils.cpp#L140-L169 )

Test Plan: run tests

Reviewers: uddipta, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56121
2016-03-30 14:36:48 -07:00
sdong
b9d4fa890b Options settable tests to use a different special charactor
Summary: Something changed and the special charactor seems to be conflict with an exisitng value. Change it to unblock the build.

Test Plan: Run the test and make sure it passes

Reviewers: kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55845
2016-03-22 16:21:53 -07:00
sdong
b1fafcaca6 Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature."
This reverts commit 522de4f59e.

It has bug of index block cleaning up.
2016-03-21 11:50:42 -07:00
Karthikeyan Radhakrishnan
fbea4dc660 Merge pull request #1042 from SherlockNoMad/HistFix
Fix in HistogramWindowingImpl
2016-03-18 14:27:55 -07:00
Marton Trencseni
522de4f59e Adding pin_l0_filter_and_index_blocks_in_cache feature.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.

Test Plan:
Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).

DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
  Mac: OK.
  Linux: with D55287 patched in it's OK.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54801
2016-03-17 22:40:01 +00:00
SherlockNoMad
4ecc03c039 Fix in HistogramWindowingImpl 2016-03-17 14:28:41 -07:00
Andrew Kryczka
72224104d3 Forge current file for checkpoint
Summary:
This fixes a similar issue as D54711: "CURRENT" file can mutate between
GetLiveFiles() and copy to the tmp directory, in which case it would reference
the wrong manifest filename. To fix this, I forge the "CURRENT" file such that
it simply contains the filename for the manifest returned by GetLiveFiles().

- Changed CreateCheckpoint() to forge current file
- Added CreateFile() utility function
- Added test case that rolls manifest during checkpoint creation

Test Plan:
  $ ./checkpoint_test

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55065
2016-03-17 10:07:21 -07:00
dhruba borthakur
33d568611d Merge pull request #1040 from bureau14/master
Fixes warnings and ensure correct int behavior on 32-bit platforms.
2016-03-17 03:23:30 -07:00
Edouard A
02e62ebbc8 Fixes warnings and ensure correct int behavior on 32-bit platforms. 2016-03-16 22:57:57 +01:00
SherlockNoMad
f76b260eff Fix FB internal CI build failure 2016-03-15 11:38:15 -07:00
Siying Dong
774922c680 Merge pull request #1026 from SherlockNoMad/Hist
Histogram Concurrency Improvement and Time-Windowing Support
2016-03-15 11:27:54 -07:00
SherlockNoMad
f11b0df121 Fix AppVeyor build error 2016-03-15 10:57:33 -07:00
SherlockNoMad
58379bfb55 remove division from histogramwidowing impl 2016-03-14 21:41:13 -07:00
SherlockNoMad
fd664a27b5 Fix Build Error 2016-03-11 22:56:25 -08:00
SherlockNoMad
54f6b9e162 Histogram Concurrency Improvement and Time-Windowing Support 2016-03-11 16:54:25 -08:00
agiardullo
790252805d Add multithreaded transaction test
Summary: Refactored db_bench transaction stress tests so that they can be called from unit tests as well.

Test Plan: run new unit test as well as db_bench

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55203
2016-03-11 15:16:52 -08:00
Yi Wu
f71fc77b7c Cache to have an option to fail Cache::Insert() when full
Summary:
Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error.

I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed.

Test Plan: make check -j32, see tests pass.

Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54705
2016-03-10 17:35:19 -08:00
Yueh-Hsuan Chiang
bf1c4089db Use pure if-then check instead of assert in EraseColumnFamilyInfo
Summary:
Use pure if-then check instead of assert in EraseColumnFamilyInfo
when the specified column family does not found in the cf_info_map_.
So the second deletion will be no op instead of crash.

Test Plan: existing test.

Reviewers: sdong, anthony, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55023
2016-03-04 16:03:31 -08:00
Omar Sandoval
deb08b8226 Add parsing of missing DB options
Summary: There are a few options in struct DBOptions that aren't handled by options_helper.cc. Add those missing options so they can be used by GetDBOptionsFromString() and friends.

Test Plan: Updated options_test.cc, reran all tests.

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54603
2016-03-02 10:34:14 -08:00
Yi Wu
2568985ab3 IOStatsContext::ToString() add option to exclude zero counters
Summary: similar to D52809 add option to exclude zero counters.

Test Plan:
[yiwu@dev4504.prn1 ~/rocksdb] ./iostats_context_test
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from IOStatsContextTest
[ RUN      ] IOStatsContextTest.ToString
[       OK ] IOStatsContextTest.ToString (0 ms)
[----------] 1 test from IOStatsContextTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.

Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54591
2016-02-23 10:26:24 -08:00
Andrew Kryczka
b046916656 Redo SyncPoints for flush while rolling test
Summary:
There was a race condition in the test where the rolling thread
acquired the mutex before the flush thread pinned the logger. Rather than add
more complicated synchronization to fix it, I followed Siying's suggestion to
use SyncPoint in the test code.

Comments in the LoadDependency() invocation explain the reason for each of the
sync points.

Test Plan:
Ran test 1000 times for tsan/asan. Will wait for all sandcastle tests
to finish before committing since this is a tricky test.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54615
2016-02-22 21:32:19 -08:00
Yueh-Hsuan Chiang
79ca039eb4 Relax the check condition of prefix_extractor in CheckOptionsCompatibility
Summary:
Relax the check condition of prefix_extractor in CheckOptionsCompatibility
by allowing changing value from non-nullptr to nullptr or nullptr to
non-nullptr.

Test Plan:
options_test
options_util_test

Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, gunnarku

Reviewed By: gunnarku

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54477
2016-02-19 14:42:24 -08:00
sdong
c7f1a8a468 Fix LITE build thread_local_test
Summary: Recent change break thread_local_test by introducing exception, which is disabled in LITE build. Fix it by disabling exception handling in LITE build.

Test Plan: Build with both of LITE and non-LITE

Reviewers: anthony, IslamAbdelRahman, yhchiang, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54513
2016-02-19 13:57:18 -08:00
Dmitri Smirnov
d37d348da8 This addresses build issues on Windows
https://github.com/facebook/rocksdb/issues/1002
2016-02-19 12:29:54 -08:00
Islam AbdelRahman
6b2a047dfc Fix SstFileManager uninitialized data member
Summary: I have introduced max_allowed_space_ but did not initialize it

Test Plan: make check

Reviewers: sdong, yhchiang, anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54357
2016-02-18 11:25:19 -08:00
Yueh-Hsuan Chiang
a3db93c261 Remove the SyncPoint usage in the destructor of PosixEnv
Summary:
Remove the SyncPoint usage in the destructor of PosixEnv as none
of any active tests is using it.

SyncPoint is a test-only utility class, and it's a static varible.
As a result, using SyncPoint in the destructor of PosixEnv will
make default Env depends on SyncPoint.  Removing such dependency
could solve the problem crash issue only reproducable in Mac
environment.

Test Plan: OPT=-DTRAVIS V=1 make -j4 check on Mac environment

Reviewers: sdong, anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54333
2016-02-17 23:32:14 -08:00
Islam AbdelRahman
df9ba6df62 Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()

Test Plan: unit testing

Reviewers: yhchiang, anthony, andrewkr, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53763
2016-02-17 15:20:23 -08:00
Andrew Kryczka
3943d16780 Fix race conditions in auto-rolling logger
Summary:
For GetLogFileSize() and Flush(), they previously did not follow the
synchronization pattern for accessing logger_. This meant ResetLogger() could
cause logger_ destruction while the unsynchronized functions were accessing it,
causing a segfault.

Also made the mutex instance variable mutable so we can preserve
GetLogFileSize()'s const-ness.

Test Plan:
new test case, it's quite ugly because both threads need to access
one of the functions with SyncPoints (PosixLogger::Flush()), and also special
handling is needed to prevent the mutex and sync points from conflicting.

Reviewers: kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54237
2016-02-17 12:06:45 -08:00
fengjian
1336052497 fix ios build error 2016-02-17 20:22:40 +08:00
John Dennison
bd5f842bba fixes typo in options logging 2016-02-15 23:55:29 +00:00
Andrew Kryczka
1c868d6848 Fix includes for env_test
Summary:
Previously compilation failed when ROCKSDB_NO_FBCODE=1 because fcntl.h
wasn't included for open().

Related issue: https://github.com/facebook/rocksdb/issues/977

Test Plan:
verified below command works now:

  $ make clean && ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_FALLOCATE=1 make -j32 env_test

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54135
2016-02-12 14:17:14 -08:00
Yueh-Hsuan Chiang
908100399c Fixed a dependency issue of ThreadLocalPtr
Summary:
When a child thread that uses ThreadLocalPtr, ThreadLocalPtr::OnThreadExit
will be called when that child thread is destroyed.  However,
OnThreadExit will try to access a static singleton of ThreadLocalPtr,
which will be destroyed when the main thread exit.  As a result,
when a child thread that uses ThreadLocalPtr exits AFTER the main thread
exits, illegal memory access will occur.

This diff includes a test that reproduce this legacy bug.

    ==2095206==ERROR: AddressSanitizer: heap-use-after-free on address
    0x608000007fa0 at pc 0x959b79 bp 0x7f5fa7426b60 sp 0x7f5fa7426b58
    READ of size 8 at 0x608000007fa0 thread T1

This patch fix this issue by having the thread local mutex never be deleted
(but will leak small piece of memory at the end.)   The patch also describe
a better solution (thread_local) in the comment that requires gcc 4.8.1 and
in latest clang as a future work once we agree to move toward gcc 4.8.

Test Plan:
COMPILE_WITH_ASAN=1 make thread_local_test -j32
./thread_local_test --gtest_filter="*MainThreadDiesFirst"

Reviewers: anthony, hermanlee4, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53013
2016-02-10 16:56:01 -08:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Andrew Kryczka
59b3ee658f Env function for bulk metadata retrieval
Summary:
Added this new function, which returns filename, size, and modified
timestamp for each file in the provided directory. The default implementation
retrieves the metadata sequentially using existing functions. In the next diff
I'll make HdfsEnv override this function to use libhdfs's bulk get function.

This won't work on windows due to the path separator.

Test Plan:
new unit test

  $ ./env_test --gtest_filter=EnvPosixTest.ConsistentChildrenMetadata

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53781
2016-02-09 14:54:32 -08:00
Islam AbdelRahman
8e6172bc57 Add BlockBasedTableOptions::index_block_restart_interval
Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block

Test Plan: unit tests

Reviewers: yhchiang, anthony, andrewkr, sdong

Reviewed By: sdong

Subscribers: march, dhruba

Differential Revision: https://reviews.facebook.net/D53721
2016-02-05 10:22:37 -08:00
sdong
f7c0f4e3ef perf_context.cc and iostats_context.cc use different output macro (fix unity build)
Summary: Unity build because perf_context.cc and iostats_context.cc define a different OUTPUT macro. Fix it.

Test Plan: Build unity

Reviewers: kradhakrishnan, anthony, yhchiang, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: divchenko, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53769
2016-02-03 17:27:08 -08:00
Dmytro Ivchenko
aa5e3b7c04 PerfContext::ToString() add option to exclude zero counters
Test Plan: Added unit test to check w/ w/o zeros scenarios

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba

Differential Revision: https://reviews.facebook.net/D52809
2016-02-01 13:41:13 -08:00
Venkatesh Radhakrishnan
3b2a1ddd2e Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.

The watermarks are calculated based on slowdown thresholds.

Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.

Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D53409
2016-01-29 16:15:53 -08:00
Islam AbdelRahman
d6c838f1e1 Add SstFileManager (component tracking all SST file in DBs and control the deletion rate)
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size

Test Plan: unit tests

Reviewers: rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, lovro, march, dhruba

Differential Revision: https://reviews.facebook.net/D50469
2016-01-28 18:35:01 -08:00
sdong
d20915d52a Disable stats about mutex duration by default
Summary: Measuring mutex duration will measure time inside DB mutex, which breaks our best practice. Add a stat level in Statistics class. By default, disable to measure the mutex operations.

Test Plan: Add a unit test to make sure it is off by default.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53367
2016-01-26 14:56:55 -08:00
Andrew Kryczka
167bd8856d [directory includes cleanup] Finish removing util->db dependencies 2016-01-26 10:49:24 -08:00
Andrew Kryczka
acd7d58695 [directory includes cleanup] Remove util->db dependency for ThreadStatusUtil
Summary:
We can avoid the dependency by forward-declaring ColumnFamilyData and
then treating it as a black box. That means callers of ThreadStatusUtil need to
explicitly provide more options, even if they can be derived from the
ColumnFamilyData, since ThreadStatusUtil doesn't include the definition.

This is part of a series of diffs to eliminate circular dependencies between
directories (e.g., db/* files depending on util/* files and vice-versa).

Test Plan:
  $ ./db_test --gtest_filter=DBTest.GetThreadStatus
  $ make -j32 commit-prereq

Reviewers: sdong, yhchiang, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53361
2016-01-26 10:49:16 -08:00
Andrew Kryczka
46f9cd46af [directory includes cleanup] Move cross-function test points
Summary:
I split the db-specific test points out into a separate file under db/
directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it
from compiling previously; see https://reviews.facebook.net/D36825.

Test Plan:
compilation works now, below command works, will also run "make xfunc".

  $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32

Reviewers: sdong, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53343
2016-01-26 10:49:05 -08:00
sdong
fb9811ee9b Add a perf context level that doesn't measure time for mutex operations
Summary: Timing mutex operations can impact scalability of the system. Add a new perf context level that can measure time counters except for mutex.

Test Plan: Add a new unit test case to make sure it is not set.

Reviewers: IslamAbdelRahman, rven, kradhakrishnan, yhchiang, anthony

Reviewed By: anthony

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53199
2016-01-25 14:45:03 -08:00
sdong
f1ed170105 Add tests to make sure new DB or ColumnFamily options are settable through string
Summary: Add a test to fail if someone adds a DB options.

Test Plan: Run the test, run the test with valgrind. Add an option to DB option in the middle or in the end and make sure it fails.

Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53097
2016-01-22 16:18:44 -08:00
sdong
2c2b72218d Disable OptionsParserTest.BlockBasedTableOptionsAllFieldsSettable under CLANG
Summary: OptionsParserTest.BlockBasedTableOptionsAllFieldsSettable is failiong under CLANG. Disable the test to unblock the build.

Test Plan: Run it both of CLANG and GCC

Reviewers: kradhakrishnan, rven, andrewkr, anthony, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53157
2016-01-21 10:56:34 -08:00
sdong
202be23e46 Add test that verifies all options in BlockBasedTableOptions is settable through GetBlockBasedTableOptionsFromString()
Summary: Add a test OptionsParserTest.BlockBasedTableOptionsAdded, which will fail if a new option is added to BlockBasedTableOptions but is not settable through GetBlockBasedTableOptionsFromString().

Test Plan: Run the test. Also manually remove and add options and make sure it fails.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, rven, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52953
2016-01-20 21:59:28 -08:00
sdong
fdbff42391 Crash test to make kill decision for every kill point
Summary:
In crash test, when coming to each kill point, we start a random class using seed as current second. With this approach, for every second, the random number used is the same. However, in each second, there are multiple kill points with different frequency. It makes it hard to reason about chance of kill point to trigger. With this commit, we use thread local random seed to generate the random number, so that it will take different values per second, hoping it makes chances of killing much easier to reason about.

Also significantly reduce the kill odd to make sure time before kiling is similar as before.

Test Plan: Run white box crash test and see the killing happens as expected and the run time time before killing reasonable.

Reviewers: kradhakrishnan, IslamAbdelRahman, rven, yhchiang, andrewkr, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52971
2016-01-19 18:11:24 -08:00
David Bernard
df7c2f3b59 As per google coding standard replace "using" in option_builder.cc and
geodb_impl.cc
2016-01-20 00:27:29 +00:00
David Bernard
d78c6b28c4 Changes for build on solaris
Makefile adjust paths for solaris build
Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available
db_compaction_test.cc Initialise a variable to avoid a compilation error
db_impl.cc Include <alloca.h>
db_test.cc Include <alloca.h>
Environment.java recognise solaris envrionment
options_bulder.cc Make log unambiguous
geodb_impl.cc Make log and floor unambiguous
2016-01-19 04:45:21 +00:00
Gunnar Kudrjavets
aec10f734b Guard falloc.h inclusion to avoid build breaks
Summary: Depending on the order of include paths and versions of various headers we may end up in a situation where we'll encounter a build break caused by redefinition of constants. gcc-4.9-glibc-2.20 header update to include/bits/fcntl-linux.h introduced the definitions of FALLOC_FL_* constants. However, linux/falloc.h from kernel-headers also has FALLOC_FL_* constants defined. Therefore during the compilation we'll get "previously defined" errors.

Test Plan:
Both in the environment where the build break manifests (to make sure that the change fixed the problem) and in the environment where everything builds fine (to make sure that there are no regressions):

make clean
make -j 32

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52821
2016-01-14 22:47:15 -08:00
Dmitri Smirnov
ac50fd3a71 Align statistics
Use Yield macro to make it a little more portable between platforms.
2016-01-13 14:53:23 -08:00
Igor Canadi
48a8667c30 Merge pull request #929 from warrenfalk/fix32
fix a compile error on 32-bit (fixes #634)
2016-01-12 11:02:36 -08:00
sdong
9a8e3f73ed plain table reader: non-mmap mode to keep two recent buffers
Summary: In plain table reader's non-mmap mode, we only keep the most recent read buffer. However, for binary search, it is likely we come back to a location to read. To avoid one pread in such a case, we keep two read buffers. It should cover most of the cases.

Test Plan:
1. run tests
2. check the optimization works through strace when running
./table_reader_bench -mmap_read=false --num_keys2=1 -num_keys1=5000 -table_factory=plain_table --iterator --through_db

Reviewers: anthony, rven, kradhakrishnan, igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51171
2016-01-08 10:53:57 -08:00
Warren Falk
94d9df2482 fix an unused function compiler warning in crc32c in 32-bit mode 2016-01-07 13:27:20 -05:00
Warren Falk
2f01e10fa9 use static_cast in crc32c instead of c-style cast 2016-01-07 13:22:09 -05:00
Warren Falk
601f1306a1 fix shorten-64-to-32 warning in crc32c 2016-01-07 13:12:15 -05:00
Warren Falk
55b37efa15 fix a compile error on 32-bit 2016-01-07 11:51:52 -05:00
sdong
c9e2490bc6 Fix DynamicBloomTest.concurrent_with_perf to pass TSAN
Summary: TSAN fails on DynamicBloomTest.concurrent_with_perf. This change fixes it. Not sure why though.

Test Plan: Run the test with TSAN and make sure no warning shown.

Reviewers: yhchiang, IslamAbdelRahman, anthony, ngbronson, rven

Reviewed By: rven

Subscribers: rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52383
2015-12-29 16:28:45 -08:00
sdong
edf1cd497f Not generating "__attribute__((__unused__))" for padding fields if it is not CLANG
Summary: Adding "__attribute__((__unused__))" after padding fields will pass CLANG build but will fail gcc 4.8.1. Fix it by not generating it under GCC 4.8.1.

Test Plan: Build under four combinations of USE_CLANG=0,1 and ROCKSDB_FBCODE_BUILD_WITH_481=0.1.

Reviewers: yhchiang, rven, ngbronson, anthony, IslamAbdelRahman

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52371
2015-12-28 18:37:23 -08:00
sdong
11672df19a Fix CLANG errors introduced by 7d87f02799
Summary: Fix some CLANG errors introduced in 7d87f02799

Test Plan: Build with both of CLANG and gcc

Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52329
2015-12-28 10:00:58 -08:00
Siying Dong
7fafd52dce Merge pull request #900 from shuzhang1989/hdfs_env_fix
add a factory method for creating hdfs env
2015-12-28 09:28:04 -08:00
Shu Zhang
2b7c810db8 more foramt 2015-12-26 19:52:35 -08:00
Shu Zhang
b79ccbd573 indent 2015-12-26 19:50:28 -08:00
Nathan Bronson
7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
Shu Zhang
b4aa823661 format 2015-12-24 20:38:35 -08:00
Shu Zhang
4dfdd1d928 format 2015-12-24 20:32:29 -08:00
Siying Dong
298ba27ae2 Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata
Enable MS compiler warning c4244.
2015-12-23 22:59:42 -08:00
Siying Dong
7810aa802a Merge pull request #899 from zhipeng-jia/fix_clang_warning
Fix clang warnings
2015-12-23 22:58:52 -08:00
Zhipeng Jia
ec2664fefd Fix clang compile error under Linux 2015-12-24 12:41:40 +08:00
Shu Zhang
4fd23fb130 add a factory method for creating hdfs env 2015-12-23 17:26:50 -08:00
sdong
15b8902264 Change default options.delayed_write_rate
Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be.

Test Plan: Run all tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52281
2015-12-23 14:51:55 -08:00
sdong
b9f77ba12b When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131
2015-12-23 11:33:15 -08:00
Igor Canadi
8ac7fb8377 Merge pull request #863 from zhangyybuaa/fix_hdfs_error
Fix build error with hdfs
2015-12-22 09:27:51 +01:00
sdong
167fb919a5 ZSTD to use CompressionOptions.level
Summary: Now ZSTD hard code level 1. Change it to use the compression level setting.

Test Plan: Run it with hacked codes of sst_dump and show ZSTD compression sizes with different levels.

Reviewers: rven, anthony, yhchiang, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52041
2015-12-16 16:58:04 -08:00
Islam AbdelRahman
aececc209e Introduce ReadOptions::pin_data (support zero copy for keys)
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999
2015-12-16 12:08:30 -08:00
Venkatesh Radhakrishnan
030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
Dmitri Smirnov
aca403d2b5 Fix another rebase problems. 2015-12-11 17:33:40 -08:00
Dmitri Smirnov
236fe21c92 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:47:34 -08:00
Yueh-Hsuan Chiang
00d6edf6a0 Ensure the destruction order of PosixEnv and ThreadLocalPtr
Summary:
By default, RocksDB initializes the singletons of ThreadLocalPtr first, then initializes PosixEnv
via static initializer.  Destructor terminates objects in reverse order, so terminating PosixEnv
(calling pthread_mutex_lock), then ThreadLocal (calling pthread_mutex_destroy).

However, in certain case, application might initialize PosixEnv first, then ThreadLocalPtr.
This will cause core dump at the end of the program (eg. https://github.com/facebook/mysql-5.6/issues/122)

This patch fix this issue by ensuring the destruction order by moving the global static singletons
to function static singletons.  Since function static singletons are initialized when the function is first
called, this property allows us invoke to enforce the construction of the static PosixEnv and the
singletons of ThreadLocalPtr by calling the function where the ThreadLocalPtr singletons belongs
right before we initialize the static PosixEnv.

Test Plan: Verified in the MyRocks.

Reviewers: yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony, sdong, MarkCallaghan

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51789
2015-12-11 00:21:58 -08:00
charsyam
c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
sdong
56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong
d6e1035a1f A new compaction picking priority that optimizes for write amplification for random updates.
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.

Test Plan: Add a unit test and run it in valgrind too.

Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51459
2015-12-09 18:13:03 -08:00
yuslepukhin
49957f9a98 Prefer integer arithmetics
The code had conversion to double then casting to size_t
  and then casting uint32_t which caused compiler warning (VS15).
2015-12-09 14:06:23 -08:00
Siying Dong
9c227923c6 Merge pull request #788 from OpenChannelSSD/to_fb_master2
Move posix threads into a library
2015-12-08 18:06:38 -08:00
Siying Dong
fa3dbf203f Merge pull request #853 from Vaisman/enable_C4267_warning
Enable C4267 warning
2015-12-08 17:59:24 -08:00
Siying Dong
56bbecc316 Merge pull request #867 from SherlockNoMad/CacheFix
Replace malloc with new for LRU Cache Handle
2015-12-08 17:58:29 -08:00
Yueh-Hsuan Chiang
774b80e99e Resubmit the fix for a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51717
2015-12-08 17:01:02 -08:00
sdong
ea11923550 Upgrade to ZSTD 0.4.2
Summary: Change to call the new compression function.

Test Plan: build and run db_bench with the compression to make sure it compresses.

Reviewers: anthony, rven, kradhakrishnan, IslamAbdelRahman, igor, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51603
2015-12-08 16:33:26 -08:00
sdong
770dea9325 Fix occasional failure of DBTest.DynamicCompactionOptions
Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run.

Test Plan: Run the test multiple times in an environment which can cause failure.

Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51687
2015-12-07 18:38:39 -08:00
sdong
f307036bde Revert "Fix a race condition in persisting options"
This reverts commit 2fa3ed5180. It breaks RocksDB lite build
2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang
2fa3ed5180 Fix a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51609
2015-12-07 15:25:12 -08:00
Javier González
b2863017b1 Move posix threads into a library
Summary: This patch moves all posix thread logic to a separate library.
The motivation is to allow another environments to easily reuse posix
threads. HDFS wraps already posix threads; this split would simplify
this code.

Test Plan: No new functionality is added to posix Env or the threading
library, thus the current tests should suffice.
2015-12-07 12:03:38 +01:00
SherlockNoMad
3a98a7ae7f Replace malloc with new for LRU Cache Handle 2015-12-04 15:12:07 -08:00
Zhang Yangyang
4687ced5db fix ToString() not declared error 2015-12-02 21:45:28 +08:00
sdong
d27ea4c9e5 Initialize options.row_cache
Summary: options.row_cache should already been initialized as null by default. Still try to set it following current convention, because one valgrind failure reports a failure related to it.

Test Plan: Run all unit tests

Reviewers: yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51303
2015-11-30 10:30:35 -08:00
Nathan Bronson
9a9d4759b2 InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node.  This allows us to remove the pointer from the node
to the key.  As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.

For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList.  This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index).  Taking advantage of inline allocation for
those cases is left to future work.

The perf win is biggest for small values.  For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec.  For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51123
2015-11-24 15:16:02 -08:00
Vasili Svirski
41b32c6059 Enable C4267 warning
* conversion from 'size_t' to 'type', by add static_cast

Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
2015-11-24 16:33:09 +03:00
yuslepukhin
047bd22aae Build on Visual Studio 2015 Update 1 2015-11-20 15:31:47 -08:00
Dmitri Smirnov
89bacb7e7d Enable MS Warning C4804 : unsafe use of type 'bool' in operation 2015-11-18 16:23:19 -08:00
Islam AbdelRahman
4159ab8169 Merge pull request #839 from SherlockNoMad/memtableOption
Support Memtable Factory Parse in option_helper.cc
2015-11-17 17:09:49 -08:00
sdong
6170fec251 Fix build broken by previous commit of "option helper refactor"
Summary:
The commit of option helper refactor broken the build:
(1) a git merge problem
(2) some uncaught compiler warning
Fix it.

Test Plan: Make sure "make all" passes

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50943
2015-11-17 16:52:54 -08:00
Siying Dong
3a6643c2fd Merge pull request #805 from SherlockNoMad/OptionHelperFix
Option Helper Refactoring
2015-11-17 16:24:52 -08:00
SherlockNoMad
bd7be035e0 Support Memtable Factory Parse in option_helper.cc 2015-11-17 14:29:01 -08:00
Islam AbdelRahman
a163cc2d5a Lint everything
Summary:
```
arc2 lint --everything
```

run the linter on the whole code repo to fix exisitng lint issues

Test Plan: make check -j64

Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50769
2015-11-16 12:56:21 -08:00
Yueh-Hsuan Chiang
e11f676e34 Add OptionsUtil::LoadOptionsFromFile() API
Summary:
This patch adds OptionsUtil::LoadOptionsFromFile() and
OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
to construct DBOptions and ColumnFamilyOptions from a RocksDB
options file.  Note that most pointer-typed options such as
merge_operator will not be constructed.

With this API, developers no longer need to remember all the
options in order to reopen an existing rocksdb instance like
the following:

  DBOptions db_options;
  std::vector<std::string> cf_names;
  std::vector<ColumnFamilyOptions> cf_opts;

  // Load primitive-typed options from an existing DB
  OptionsUtil::LoadLatestOptionsFromDB(
      dbname, &db_options, &cf_names, &cf_opts);

  // Initialize necessary pointer-typed options
  cf_opts[0].merge_operator.reset(new MyMergeOperator());
  ...

  // Construct the vector of ColumnFamilyDescriptor
  std::vector<ColumnFamilyDescriptor> cf_descs;
  for (size_t i = 0; i < cf_opts.size(); ++i) {
    cf_descs.emplace_back(cf_names[i], cf_opts[i]);
  }

  // Open the DB
  DB* db = nullptr;
  std::vector<ColumnFamilyHandle*> cf_handles;
  auto s = DB::Open(db_options, dbname, cf_descs,
                    &handles, &db);

Test Plan:
Augment existing tests in column_family_test
options_test
db_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49095
2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang
e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00