Commit Graph

1542 Commits

Author SHA1 Message Date
Siying Dong
4a3e7d320c Change the default of delayed slowdown value to 16MB/s
Summary:
Change the default of delayed slowdown value to 16MB/s and further increase the L0 stop condition to 36 files.
Closes https://github.com/facebook/rocksdb/pull/1821

Differential Revision: D4489229

Pulled By: siying

fbshipit-source-id: 1003981
2017-02-01 20:39:17 -08:00
Siying Dong
0513e21f9b RangeSync() should work with ROCKSDB_FALLOCATE_PRESENT not set
Summary: Closes https://github.com/facebook/rocksdb/pull/1824

Differential Revision: D4493862

Pulled By: siying

fbshipit-source-id: c168446
2017-02-01 10:24:20 -08:00
Islam AbdelRahman
8b369ae5bd Cleaner default options using C++11 in-class init
Summary:
C++11 in-class initialization is cleaner and makes it the default more explicit to our users and more visible.
Use it for ColumnFamilyOptions and DBOptions
Closes https://github.com/facebook/rocksdb/pull/1822

Differential Revision: D4490473

Pulled By: IslamAbdelRahman

fbshipit-source-id: c493a87
2017-01-31 18:09:15 -08:00
Islam AbdelRahman
ec79a7b53c Dedup code in option.cc and db_options.cc
Summary:
The code in DBOptions::Dump is simply a duplicate of the code in ImmutableDBOptions::Dump and MutableDBOptions.Dump

consolidate duplicate code.

tested visually
Closes https://github.com/facebook/rocksdb/pull/1818

Differential Revision: D4486710

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7085189
2017-01-31 17:39:12 -08:00
Siying Dong
2d75cd40d3 NewLRUCache() to pick number of shard bits based on capacity if not given
Summary:
If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity.
Closes https://github.com/facebook/rocksdb/pull/1584

Differential Revision: D4242517

Pulled By: siying

fbshipit-source-id: 86b0f18
2017-01-27 06:39:12 -08:00
Andrew Kryczka
94a0c32e73 Fix LRU Ref() for handles with external references only
Summary:
For case !handle->InCache() && handle->refs >= 1 (the third case mentioned in lru_cache.h), the key was overwritten by Insert(). In this case, the refcount can still be incremented, and the cache handle will never enter LRU list. Fix Ref() logic for this case.
Closes https://github.com/facebook/rocksdb/pull/1808

Differential Revision: D4467656

Pulled By: ajkr

fbshipit-source-id: c0784d8
2017-01-26 10:54:15 -08:00
Andrew Kryczka
17c1180603 Generalize Env registration framework
Summary:
The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config.

Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file.

This diff generalizes the Env registration logic to work with arbitrary types.

- Generalized registration and instantiation code by templating them
- The entire implementation is in a header file as that's Google style guide's recommendation for template definitions
- Pattern match with std::regex_match rather than checking prefix, which was the previous behavior
- Rename functions/files to be non-Env-specific
Closes https://github.com/facebook/rocksdb/pull/1776

Differential Revision: D4421933

Pulled By: ajkr

fbshipit-source-id: 34647d1
2017-01-25 16:09:14 -08:00
sdong
07dddd5f7e EnvPosixTestWithParam should wait for all threads to finish
Summary:
If we don't wait for the threads to finish after each run, the thread queue may not be empty while the next test starts to run, which can cause unexpected behaviors.

Also make some of the relaxed read/write more restrict.
Closes https://github.com/facebook/rocksdb/pull/1590

Reviewed By: AsyncDBConnMarkedDownDBException

Differential Revision: D4245922

Pulled By: AsyncDBConnMarkedDownDBException

fbshipit-source-id: f83b74b
2017-01-25 15:54:13 -08:00
Hyeonseok Oh
f2b4939da4 fixed typo
Summary:
I fixed exisit -> exist
Closes https://github.com/facebook/rocksdb/pull/1799

Differential Revision: D4451466

Pulled By: yiwu-arbug

fbshipit-source-id: b447c3a
2017-01-23 12:54:13 -08:00
Siying Dong
0e8dfd6062 Fix OptimizeForPointLookup()
Summary:
If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead.
Closes https://github.com/facebook/rocksdb/pull/1791

Differential Revision: D4442836

Pulled By: siying

fbshipit-source-id: bf6c9cd
2017-01-20 10:54:12 -08:00
Yi Wu
9239103cd4 Flush job should release reference current version if sync log failed
Summary:
Fix the bug when sync log fail, FlushJob::Run() will not be execute and
reference to cfd->current() will not be release.
Closes https://github.com/facebook/rocksdb/pull/1792

Differential Revision: D4441316

Pulled By: yiwu-arbug

fbshipit-source-id: 5523e28
2017-01-19 23:09:15 -08:00
Yi Wu
602c13a964 Remove fadvise with direct IO read
Summary:
Remove the logic since we don't use buffer cache with direct IO. Resolve
read regression we currently have.
Closes https://github.com/facebook/rocksdb/pull/1782

Differential Revision: D4430408

Pulled By: yiwu-arbug

fbshipit-source-id: 5557bba
2017-01-18 12:09:10 -08:00
Kefu Chai
e8a096000b util/thread_local.h: silence a clang-build warning
Summary:
otherwise clang complains with

/home/jenkins/workspace/ceph-master/src/rocksdb/util/thread_local.h:205:5:
error: macro expansion producing 'defined' has undefined behavior
[-Werror,-Wexpansion-to-defined]
^
/home/jenkins/workspace/ceph-master/src/rocksdb/util/thread_local.h:22:4:
note: expanded from macro 'ROCKSDB_SUPPORT_THREAD_LOCAL'
!defined(OS_WIN) && !defined(OS_MACOSX) && !defined(IOS_CROSS_COMPILE)
^`

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/1757

Differential Revision: D4394140

Pulled By: siying

fbshipit-source-id: f0beda0
2017-01-15 13:24:16 -08:00
Aaron Gao
3e6899d116 change UseDirectIO() to use_direct_io()
Summary:
also change variable name `direct_io_` to `use_direct_io_` in WritableFile to make it consistent with read path.
Closes https://github.com/facebook/rocksdb/pull/1770

Differential Revision: D4416435

Pulled By: lightmark

fbshipit-source-id: 4143c53
2017-01-13 12:09:15 -08:00
Aaron Gao
d4e07a8459 fix warning of unused direct io helper functions
Summary:
add build guard
Closes https://github.com/facebook/rocksdb/pull/1771

Differential Revision: D4410779

Pulled By: siying

fbshipit-source-id: 3796c30
2017-01-12 12:39:14 -08:00
Aaron Gao
dc2584eea0 direct reads refactor
Summary:
direct IO reads refactoring
remove unnecessary classes and unified interfaces
tested with db_bench

need more change for options and ON/OFF for different files.
Since disabled is default, it should be fine now
Closes https://github.com/facebook/rocksdb/pull/1636

Differential Revision: D4307189

Pulled By: lightmark

fbshipit-source-id: 6991e22
2017-01-11 16:54:12 -08:00
Anirban Rahut
62384ebe9c Guarding extra fallocate call with TRAVIS because its not working pro…
Summary:
…perly on travis

 There is some old code in PosixWritableFile::Close(), which
truncates the file to the measured size and then does an extra fallocate
with KEEP_SIZE. This is commented as a failsafe because in some
cases ftruncate doesn't do the right job (I don't know of an instance of
this btw). However doing an fallocate with KEEP_SIZE should not increase
the file size. However on Travis Worker which is Docker (likely AUFS )
its not working. There are comments on web that show that the AUFS
author had initially not implemented fallocate, and then did it later.
So not sure what is the quality of the implementation.
Closes https://github.com/facebook/rocksdb/pull/1765

Differential Revision: D4401340

Pulled By: anirbanr-fb

fbshipit-source-id: e2d8100
2017-01-11 14:24:13 -08:00
Andrew Kryczka
fe395fb63d Allow incrementing refcount on cache handles
Summary:
Previously the only way to increment a handle's refcount was to invoke Lookup(), which (1) did hash table lookup to get cache handle, (2) incremented that handle's refcount. For a future DeleteRange optimization, I added a function, Ref(), for when the caller already has a cache handle and only needs to do (2).
Closes https://github.com/facebook/rocksdb/pull/1761

Differential Revision: D4397114

Pulled By: ajkr

fbshipit-source-id: 9addbe5
2017-01-10 16:54:20 -08:00
Dmitri Smirnov
3c233ca4ea Fix Windows environment issues
Summary:
Enable directIO on WritableFileImpl::Append
     with offset being current length of the file.
     Enable UniqueID tests on Windows, disable others but
     leeting them to compile. Unique tests are valuable to
     detect failures on different filesystems and upcoming
     ReFS.
     Clear output in WinEnv Getchildren.This is different from
     previous strategy, do not touch output on failure.
     Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746

Differential Revision: D4385681

Pulled By: IslamAbdelRahman

fbshipit-source-id: c07b702
2017-01-09 15:54:12 -08:00
Maysam Yabandeh
d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh
54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Islam AbdelRahman
ac73d7558b Add GetSupportedCompressions() convenience function
Summary:
This function will return a list of supported compression types in RocksDB
This is needed for MyRocks https://github.com/facebook/mysql-5.6/pull/446
Closes https://github.com/facebook/rocksdb/pull/1747

Differential Revision: D4385921

Pulled By: IslamAbdelRahman

fbshipit-source-id: 2f5b59f
2017-01-06 11:24:14 -08:00
Adam Retter
85ac1a320a Fix rocksdb::Status::getState
Summary:
This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState.
Closes https://github.com/facebook/rocksdb/issues/1688
Closes https://github.com/facebook/rocksdb/pull/1714

Differential Revision: D4364181

Pulled By: yiwu-arbug

fbshipit-source-id: 8e073b4
2017-01-03 18:39:14 -08:00
Siying Dong
17a4b75cc3 Always fsync the file after file copying
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728

Differential Revision: D4371083

Pulled By: siying

fbshipit-source-id: 579e14c
2016-12-28 19:09:16 -08:00
Yi Wu
ab48c165a9 Print cache options to info log
Summary:
Improve cache options logging to info log.
Also print the value of
cache_index_and_filter_blocks_with_high_priority.
Closes https://github.com/facebook/rocksdb/pull/1709

Differential Revision: D4358776

Pulled By: yiwu-arbug

fbshipit-source-id: 8f030a0
2016-12-22 14:54:19 -08:00
Aaron Gao
972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Islam AbdelRahman
989e644ed8 Remove sst_file_manager option from LITE
Summary:
Remove sst_file_manager option from LITE
Closes https://github.com/facebook/rocksdb/pull/1690

Differential Revision: D4341331

Pulled By: IslamAbdelRahman

fbshipit-source-id: 9f9328d
2016-12-21 17:54:21 -08:00
Jianpeng Ma
bd6cf7b51d WritableFileWriter: default buffer size equal min(64k,options.writabl?
Summary:
?e_file_max_buffer_size)

If we overwrite WritableFile and has a buffer which has the same
function of buf_. We hope remove the cache function of
WritableFileWriter. So using options.writable_file_max_buffer_size = 0
to disable cache function.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Closes https://github.com/facebook/rocksdb/pull/1628

Differential Revision: D4307219

Pulled By: yiwu-arbug

fbshipit-source-id: 77a6e26
2016-12-16 13:09:14 -08:00
Daniel Black
816c1e30ca gcc-7 requires include <functional> for std::function
Summary:
Fixes compile error:

In file included from ./util/statistics.h:17:0,
                 from ./util/stop_watch.h:8,
                 from ./util/perf_step_timer.h:9,
                 from ./util/iostats_context_imp.h:8,
                 from ./util/posix_logger.h:27,
                 from ./port/util_logger.h:18,
                 from ./db/auto_roll_logger.h:15,
                 from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
   typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656

Differential Revision: D4318702

Pulled By: yiwu-arbug

fbshipit-source-id: 8c5d17a
2016-12-16 11:24:18 -08:00
Daniel Black
0ab6fc167f Gcc-7 buffer size insufficient
Summary:
Bunch of commits related to insufficient buffer size. Errors in individual commits.
Closes https://github.com/facebook/rocksdb/pull/1673

Differential Revision: D4332127

Pulled By: IslamAbdelRahman

fbshipit-source-id: 878f73c
2016-12-14 19:24:26 -08:00
Daniel Black
b7239bf7e0 Gcc 7 fallthrough
Summary:
hopefully the last of the gcc-7 compile errors
Closes https://github.com/facebook/rocksdb/pull/1675

Differential Revision: D4332106

Pulled By: IslamAbdelRahman

fbshipit-source-id: 139448c
2016-12-14 19:24:25 -08:00
Daniel Black
477b6ea578 std::remove_if requires <algorithm>
Summary:
fixes error (that occurred on gcc-7):

error:

util/env_basic_test.cc: In member function 'virtual rocksdb::Status rocksdb::NormalizingEnvWrapper::GetChildren(const string&, std::vector<std::__cxx11::basic_string<char> >*)':
util/env_basic_test.cc:27:21: error: 'remove_if' is not a member of 'std'
       result->erase(std::remove_if(result->begin(), result->end(),
                     ^~~
Closes https://github.com/facebook/rocksdb/pull/1674

Differential Revision: D4331221

Pulled By: ajkr

fbshipit-source-id: 9bbdc78
2016-12-14 17:09:14 -08:00
Daniel Black
e097222e64 util/logging.cc: buffer of insufficient size (gcc-7 -Werror=format-length)
Summary:
util/logging.cc💯13: error: output may be truncated before the last format character [-Werror=format-length=]
 std::string NumberToHumanString(int64_t num) {
             ^~~~~~~~~~~~~~~~~~~
util/logging.cc:106:59: note: format output between 3 and 19 bytes into a destination of size 16
     snprintf(buf, sizeof(buf), "%" PRIi64 "K", num / 1000);
Closes https://github.com/facebook/rocksdb/pull/1653

Differential Revision: D4318687

Pulled By: yiwu-arbug

fbshipit-source-id: 3a5c931
2016-12-13 18:39:14 -08:00
Daniel Black
bfbcec2339 Gcc 7 error expansion to defined
Summary:
sorry if these gcc-7/clang-4 cleanups are getting tedious.
Closes https://github.com/facebook/rocksdb/pull/1658

Differential Revision: D4318792

Pulled By: yiwu-arbug

fbshipit-source-id: 8e85891
2016-12-13 18:39:14 -08:00
Daniel Black
c3e5ee7154 util/histogram.cc: HistogramStat::toString buffer insufficient
Summary:
Increased buffer size to 1650.

util/histogram.cc: In member function 'std::__cxx11::string rocksdb::HistogramStat::ToString() const':
util/histogram.cc:189:13: error: '%.2f' directive output truncated writing between 4 and 313 bytes into a region of size 0 [-Werror=format-length=]
 std::string HistogramStat::ToString() const {
             ^~~~~~~~~~~~~
util/histogram.cc:205:30: note: format output between 69 and 1614 bytes into a destination of size 200
            Percentile(99.99));
                              ^
cc1plus: all warnings being treated as errors
Makefile:1521: recipe for target 'util/histogram.o' failed
Closes https://github.com/facebook/rocksdb/pull/1660

Differential Revision: D4318820

Pulled By: yiwu-arbug

fbshipit-source-id: 45ae6ea
2016-12-13 14:09:12 -08:00
Andrew Kryczka
f0c509e2c8 Return finer-granularity status from Env::GetChildren*
Summary:
It'd be nice to use the error status type to distinguish
between user error and system error. For example, GetChildren can fail
listing a backup directory's contents either because a bad path was provided
(user error) or because an operation failed, e.g., a remote storage service
call failed (system error). In the former case, we want to continue and treat
the backup directory as empty; in the latter case, we want to immediately
propagate the error to the caller.

This diff uses NotFound to indicate user error and IOError to indicate
system error. Previously IOError indicated both.
Closes https://github.com/facebook/rocksdb/pull/1644

Differential Revision: D4312157

Pulled By: ajkr

fbshipit-source-id: 51b4f24
2016-12-12 12:54:13 -08:00
Manuel Ung
2005c88a75 Implement non-exclusive locks
Summary:
This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not.

Some implementation notes:
- Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`.
- Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit.
- Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now.
Closes https://github.com/facebook/rocksdb/pull/1573

Differential Revision: D4239097

Pulled By: lth

fbshipit-source-id: da7c074
2016-12-05 17:39:17 -08:00
Anton Safonov
9053fe2a5c Made delete_obsolete_files_period_micros option dynamic
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595

Differential Revision: D4246569

Pulled By: tonek

fbshipit-source-id: d23f560
2016-12-05 14:24:16 -08:00
Islam AbdelRahman
4a21b1402c Cache heap::downheap() root comparison (optimize heap cmp call)
Summary:
Reduce number of comparisons in heap by caching which child node in the first level is smallest (left_child or right_child)
So next time we can compare directly against the smallest child

I see that the total number of calls to comparator drops significantly when using this optimization

Before caching (~2mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000  --perf_level=2
readseq      :       0.338 micros/op 2959201 ops/sec;  327.4 MB/s user_key_comparison_count = 2000008
```
After caching (~1mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq      :       0.309 micros/op 3236801 ops/sec;  358.1 MB/s user_key_comparison_count = 1000011
```

It also improves
Closes https://github.com/facebook/rocksdb/pull/1600

Differential Revision: D4256027

Pulled By: IslamAbdelRahman

fbshipit-source-id: 76fcc66
2016-12-01 13:39:14 -08:00
Islam AbdelRahman
e39d080871 Fix travis (compile for clang < 3.9)
Summary:
Travis fail because it uses clang 3.6 which don't recognize
`__attribute__((__no_sanitize__("undefined")))`
Closes https://github.com/facebook/rocksdb/pull/1601

Differential Revision: D4257175

Pulled By: IslamAbdelRahman

fbshipit-source-id: fb4d1ab
2016-12-01 10:09:22 -08:00
Islam AbdelRahman
52fd1ff2c2 disable UBSAN for functions with intentional -ve shift / overflow
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow

These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577

Differential Revision: D4240801

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3e1caf6
2016-11-28 17:54:12 -08:00
Islam AbdelRahman
63c30de80d fix options_test ubsan
Summary:
Having -ve value for max_write_buffer_number does not make sense and cause us to do a left shift on a -ve value number
Closes https://github.com/facebook/rocksdb/pull/1579

Differential Revision: D4240798

Pulled By: IslamAbdelRahman

fbshipit-source-id: bd6267e
2016-11-28 16:39:14 -08:00
Mike Kolupaev
236d4c67e9 Less linear search in DBIter::Seek() when keys are overwritten a lot
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:

```
  rocksdb::MergingIterator::Next
  rocksdb::DBIter::FindNextUserEntryInternal
  rocksdb::DBIter::Seek
```

I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.

CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413

Differential Revision: D4083879

Pulled By: IslamAbdelRahman

fbshipit-source-id: a83ddae
2016-11-28 10:24:11 -08:00
Siying Dong
cd7c4143d7 Improve Write Stalling System
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.

Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562

Differential Revision: D4218519

Pulled By: siying

fbshipit-source-id: 95e4088
2016-11-23 09:24:15 -08:00
Nick Terrell
4444256ab7 Remove use of deprecated LZ4 function
Summary:
LZ4 1.7.3 emits warnings when calling the deprecated function `LZ4_compress_limitedOutput_continue()`.  Starting in r129, LZ4 introduces `LZ4_compress_fast_continue()` as a replacement, and the two functions calls are [exactly equivalent](https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L1408).
Closes https://github.com/facebook/rocksdb/pull/1532

Differential Revision: D4199240

Pulled By: siying

fbshipit-source-id: 138c2bc
2016-11-21 12:24:14 -08:00
Changli Gao
548d7fb261 Fix fd leak when using direct IOs
Summary:
We should close the fd, before overriding it. This bug was
introduced by f89caa127b
Closes https://github.com/facebook/rocksdb/pull/1553

Differential Revision: D4214101

Pulled By: siying

fbshipit-source-id: 0d65de0
2016-11-21 12:24:13 -08:00
Andrew Kryczka
fd43ee09da Range deletion microoptimizations
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548

Differential Revision: D4208169

Pulled By: ajkr

fbshipit-source-id: 2fd65cf
2016-11-21 12:24:13 -08:00
Changli Gao
a0deec960f Fix deadlock when calling getMergedHistogram
Summary:
When calling StatisticsImpl::HistogramInfo::getMergedHistogram(), if
there is a dying thread, which is calling
ThreadLocalPtr::StaticMeta::OnThreadExit() to merge its thread values to
HistogramInfo, deadlock will occur. Because the former try to hold
merge_lock then ThreadMeta::mutex_, but the later try to hold
ThreadMeta::mutex_ then merge_lock. In short, the locking order isn't
the same.

This patch addressed this issue by releasing merge_lock before folding
thread values.
Closes https://github.com/facebook/rocksdb/pull/1552

Differential Revision: D4211942

Pulled By: ajkr

fbshipit-source-id: ef89bcb
2016-11-20 18:24:12 -08:00
Manuel Ung
e63350e726 Use more efficient hash map for deadlock detection
Summary:
Currently, deadlock cycles are held in std::unordered_map. The problem with it is that it allocates/deallocates memory on every insertion/deletion. This limits throughput since we're doing this expensive operation while holding a global mutex. Fix this by using a vector which caches memory instead.

Running the deadlock stress test, this change increased throughput from 39k txns/s -> 49k txns/s. The effect is more noticeable in MyRocks.
Closes https://github.com/facebook/rocksdb/pull/1545

Differential Revision: D4205662

Pulled By: lth

fbshipit-source-id: ff990e4
2016-11-19 11:39:15 -08:00
Siying Dong
73843aa636 Direct I/O Reads Handle the last sector correctly.
Summary:
Currently, in the Direct I/O read mode, the last sector of the file, if not full, is not handled correctly. If the return value of pread is not multiplier of kSectorSize, we still go ahead and continue reading, even if the buffer is not aligned. With the commit, if the return value is not multiplier of kSectorSize, and all but the last sector has been read, we simply return.
Closes https://github.com/facebook/rocksdb/pull/1550

Differential Revision: D4209609

Pulled By: lightmark

fbshipit-source-id: cb0b439
2016-11-18 19:24:13 -08:00
Maysam Yabandeh
9d60151b04 Implement PositionedAppend for PosixWritableFile
Summary:
This patch clarifies the contract of PositionedAppend with some unit
tests and also implements it for PosixWritableFile. (Tasks: 14524071)
Closes https://github.com/facebook/rocksdb/pull/1514

Differential Revision: D4204907

Pulled By: maysamyabandeh

fbshipit-source-id: 06eabd2
2016-11-18 17:24:13 -08:00
Siying Dong
972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Yi Wu
1543d5d92e Report memory usage by memtable insert hints map.
Summary:
It is hard to measure acutal memory usage by std containers. Even
providing a custom allocator will miss count some of the usage. Here we
only do a wild guess on its memory usage.
Closes https://github.com/facebook/rocksdb/pull/1511

Differential Revision: D4179945

Pulled By: yiwu-arbug

fbshipit-source-id: 32ab929
2016-11-15 20:24:13 -08:00
Artemiy Kolesnikov
91300d01f6 Dynamic max_total_wal_size option
Summary: Closes https://github.com/facebook/rocksdb/pull/1509

Differential Revision: D4176426

Pulled By: yiwu-arbug

fbshipit-source-id: b57689d
2016-11-14 22:54:17 -08:00
Yi Wu
1ea79a78c9 Optimize sequential insert into memtable - Part 1: Interface
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.

This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
  implementation to return a hint of insert position, which can be later
  pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
  via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419

Differential Revision: D4079367

Pulled By: yiwu-arbug

fbshipit-source-id: 3555326
2016-11-13 19:09:18 -08:00
Lijun Tang
adb665e0bf Allowed delayed_write_rate option to be dynamically set.
Summary: Closes https://github.com/facebook/rocksdb/pull/1488

Differential Revision: D4157784

Pulled By: siying

fbshipit-source-id: f150081
2016-11-12 15:54:11 -08:00
Andrew Kryczka
4e20c5da20 Store internal keys in TombstoneMap
Summary:
This fixes a correctness issue where ranges with same begin key would overwrite each other.

This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator.

End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix.
Closes https://github.com/facebook/rocksdb/pull/1484

Differential Revision: D4155248

Pulled By: ajkr

fbshipit-source-id: 304b4b9
2016-11-09 15:09:18 -08:00
Andrew Kryczka
f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00
Yi Wu
437942e481 Add avoid_flush_during_shutdown DB option
Summary:
Add avoid_flush_during_shutdown DB option.
Closes https://github.com/facebook/rocksdb/pull/1451

Differential Revision: D4108643

Pulled By: yiwu-arbug

fbshipit-source-id: abdaf4d
2016-11-02 15:39:18 -07:00
Benoit Girard
2b16d664cb Change max_bytes_for_level_multiplier to double
Summary: Closes https://github.com/facebook/rocksdb/pull/1427

Differential Revision: D4094732

Pulled By: yiwu-arbug

fbshipit-source-id: b9b79e9
2016-11-01 21:09:23 -07:00
Siying Dong
c90c48d3c8 Show More DB Stats in info logs
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439

Differential Revision: D4100943

Pulled By: yiwu-arbug

fbshipit-source-id: 79f7218
2016-10-29 16:09:18 -07:00
Nipunn Koorapati
25f5742f0b Update documentation to point at gcc 4.8
Summary:
Rocksdb currently has many references to std::map.emplace_back()
which is not implemented in gcc 4.7, but valid in gcc 4.8. Confirmed that
it did not build with gcc 4.7, but builds fine with gcc 4.8
Closes https://github.com/facebook/rocksdb/pull/1272

Differential Revision: D4101385

Pulled By: IslamAbdelRahman

fbshipit-source-id: f6af453
2016-10-29 12:09:17 -07:00
Willem Jan Withagen
0aab5e55f0 FreeBSD: malloc_usable_size is in <malloc_np.h> (#1428)
Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
2016-10-28 10:44:52 -07:00
Kien-hung Li
eeb27e1bbd Add handy option to turn on direct I/O in db_bench (#1424) 2016-10-28 10:36:05 -07:00
Kefu Chai
60a2bbba94 Makefile: generate util/build_version.cc from .in file (#1384)
* util/build_verion.cc.in: add this file, so cmake and make can share the
  template file for generating util/build_version.cc.
* CMakeLists.txt: also, cmake v2.8.11 does not support file(GENERATE ...),
  so we are using configure_file() for creating build_version.cc.
* Makefile: use util/build_verion.cc.in for creating build_version.cc.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
2016-10-25 11:31:39 -07:00
sdong
24495186da DBSSTTest.RateLimitedDelete: not to use real clock
Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead.

Test Plan: Run the tests and all existing tests.

Reviewers: yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65145
2016-10-24 10:35:00 -07:00
Aaron Gao
59a7c0337b Change ioptions to store user_comparator, fix bug
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.

Test Plan: make all check -j64

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65121
2016-10-21 11:31:42 -07:00
Islam AbdelRahman
b88f8e87c5 Support SST files with Global sequence numbers [reland]
Summary:
reland https://reviews.facebook.net/D62523

- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: sdong, yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65211
2016-10-18 16:59:37 -07:00
Dmitri Smirnov
b1031d6c12 Remove function local statics that interfere with memory pooling (#1392) 2016-10-14 13:09:18 -07:00
Yi Wu
e29d3b67c2 Make max_background_compactions and base_background_compactions dynamic changeable
Summary:
Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions.
I'll add more dynamic changeable options soon.

Test Plan: unit test.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64749
2016-10-14 12:25:39 -07:00
Yueh-Hsuan Chiang
8cbe3e10ca Relax the acceptable bias RateLimiterTest::Rate test be 25%
Summary:
In the current implementation of RateLimiter, the difference
between the configured rate and the actual rate might be more
than 20%, while our test only allows 15% difference.  This diff
relaxes the acceptable bias RateLimiterTest::Rate test be 25%
to make the test less flaky.

Test Plan: rate_limiter_test

Reviewers: IslamAbdelRahman, andrewkr, yiwu, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64941
2016-10-13 14:26:12 -07:00
Peter (Stig) Edwards
f8d8cf53fe Fix log_write_bench -bytes_per_sync option. (#1375)
Hello and thanks for RocksDB,
 
When log_write_bench is run with the -bytes_per_sync option, the option does not influence any *sync* behaviour.
 
> strace -e trace=write,sync_file_range ./log_write_bench -record_interval 0 -record_size 1048576 -num_records 11 -bytes_per_sync 2097152 2>&1 | egrep '^(sync|write.*XXXX)'
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
 
I suspect that this is because the bytes_per_sync option now needs to be using a `WritableFileWriter` and not a `WritableFile`.
 
With the diff below applied, it changes to:
 
> strace -e trace=write,sync_file_range ./log_write_bench -record_interval 0 -record_size 1048576 -num_records 11 -bytes_per_sync 2097152 2>&1 | egrep '^(sync|write.*XXXX)'
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0, 0x200000, 0x2)  = 0
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0x200000, 0x200000, 0x2) = 0
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0x400000, 0x200000, 0x2) = 0
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0x600000, 0x200000, 0x2) = 0
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 1048576) = 1048576
sync_file_range(0x3, 0x800000, 0x200000, 0x2) = 0
 
( Note that the first 1MB is not synced as mentioned in util/file_reader_writer.cc::WritableFileWriter::Flush() )
 
This diff also includes the fix from https://github.com/facebook/rocksdb/pull/1373
 
> diff -du util/log_write_bench.cc.orig util/log_write_bench.cc
--- util/log_write_bench.cc.orig        2016-10-04 12:06:29.115122580 -0400
+++ util/log_write_bench.cc     2016-10-05 07:24:09.677037576 -0400
@@ -14,6 +14,7 @@
 #include <gflags/gflags.h>

 #include "rocksdb/env.h"
+#include "util/file_reader_writer.h"
 #include "util/histogram.h"
 #include "util/testharness.h"
 #include "util/testutil.h"
@@ -38,19 +39,21 @@
   env_options.bytes_per_sync = FLAGS_bytes_per_sync;
   unique_ptr<WritableFile> file;
   env->NewWritableFile(file_name, &file, env_options);
+  unique_ptr<WritableFileWriter> writer;
+  writer.reset(new WritableFileWriter(std::move(file), env_options));

   std::string record;
-  record.assign('X', FLAGS_record_size);
+  record.assign(FLAGS_record_size, 'X');

   HistogramImpl hist;

   uint64_t start_time = env->NowMicros();
   for (int i = 0; i < FLAGS_num_records; i++) {
     uint64_t start_nanos = env->NowNanos();
-    file->Append(record);
-    file->Flush();
+    writer->Append(record);
+    writer->Flush();
     if (FLAGS_enable_sync) {
-      file->Sync();
+      writer->Sync(false);
     }
     hist.Add(env->NowNanos() - start_nanos);
2016-10-11 16:45:51 -07:00
Yi Wu
d6ae6dec69 Add Statistics::getAndResetTickerCount().
Summary: A convience method to atomically get and reset ticker count. I'm wanting to use it to have a thin wrapper to the statistics object to export ticker counts to ODS for LogDevice (since they don't even use fb303).

Test Plan:
test in LogDevice shadow cluster.
https://fburl.com/461868822

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64869
2016-10-11 10:54:11 -07:00
Islam AbdelRahman
2ad68b971a Support running consistency checks in release mode
Summary:
We always run consistency checks when compiling in debug mode
allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode

Test Plan:
make check -j64
make release

Reviewers: lightmark, sdong, yiwu

Reviewed By: yiwu

Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D64701
2016-10-07 17:21:45 -07:00
Islam AbdelRahman
67501cfc9a Fix -ve std::string::resize
Summary:
I saw this exception thrown because sometimes we may resize with -ve value
if we have empty max_bytes_for_level_multiplier_additional vector

Test Plan: run the tests

Reviewers: yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64791
2016-10-07 17:16:13 -07:00
Islam AbdelRahman
d062328977 Revert "Support SST files with Global sequence numbers"
This reverts commit ab01da5437.
2016-10-07 14:05:12 -07:00
Peter (Stig) Edwards
043cb62d63 Fix record_size in log_write_bench, swap args to std::string::assign. (#1373)
Hello and thank you for RocksDB,
 
I noticed when using log_write_bench that writes were always 88 bytes:
 
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88
write(3, "\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371\371"..., 88) = 88

> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 88) = 88
 
I think this should be:

<<    record.assign('X', FLAGS_record_size);
>>    record.assign(FLAGS_record_size, 'X');

So fill and not buffer. Otherwise I always see writes of size 88 (the decimal value for chr "X").

string& assign (const char* s, size_t n);
buffer - Copies the first n characters from the array of characters pointed by s.

string& assign (size_t n, char c);
fill   - Replaces the current value by n consecutive copies of character c.

perl -le 'print ord "X"'
88
 
With the change:
 
> strace -e trace=write ./log_write_bench -record_size 4096 -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 4096) = 4096
 
> strace -e trace=write ./log_write_bench -num_records 2 2>&1 | head -n 2
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249
write(3, "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 249) = 249

Thanks.

01c27be5fb
https://reviews.facebook.net/D16239
2016-10-06 10:45:31 -07:00
Islam AbdelRahman
9d6c961383 Fix Mac build 2016-10-03 18:25:10 -07:00
Islam AbdelRahman
ab01da5437 Support SST files with Global sequence numbers
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62523
2016-10-03 16:12:39 -07:00
Aaron Gao
f517d9dd09 Add SeekForPrev() to Iterator
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()

Also add tests in db_iter_test and db_iterator_test

Pass all tests
Cheers!

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64149
2016-09-27 18:20:57 -07:00
Yi Wu
9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
Islam AbdelRahman
5735b3dc2a Fix compiling under -Werror=missing-field-initializers
Summary:
MyRocks build is broken because they are using "-Werror=missing-field-initializers"
We should fix that by explicitly passing these arguments

Test Plan: Build MyRocks

Reviewers: sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64161
2016-09-20 13:02:41 -07:00
Aaron Gao
654ed9a280 loose the assertion condition of rate_limiter_test
Summary: 0.9 can make the test flaky since just found one test fail with 0.88

Test Plan: make all check

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63939
2016-09-20 12:28:59 -07:00
sdong
3edb9461b7 Avoid hard-coded sleep in EnvPosixTestWithParam.TwoPools
Summary: EnvPosixTestWithParam.TwoPools relies on explicit sleeping, so it sometimes fail. Fix it.

Test Plan: Run tests with high parallelism many times and make sure the test passes.

Reviewers: yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63417
2016-09-16 17:45:12 -07:00
Yi Wu
17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Yi Wu
8e061f9740 Refactor GetMutableOptionsFromStrings
Summary: Add mutable options info into `OptionsTypeInfo` and use it to parse mutable options map. Also support `max_bytes_for_level_multiplier_additional` in option file.

Test Plan: unit test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63843
2016-09-13 21:12:43 -07:00
Yi Wu
81747f1be6 Refactor MutableCFOptions
Summary:
* Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions.
* Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable.

Test Plan: existing unit tests.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63945
2016-09-13 21:11:59 -07:00
Islam AbdelRahman
ba65c816bb Support POSIX RandomRWFile
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file

I will implement it for other Envs later after finishing the whole task for AddFile()

Test Plan: unit tests

Reviewers: andrewkr, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62433
2016-09-13 12:08:22 -07:00
Islam AbdelRahman
52ee07b021 Move AddFile() tests to external_sst_file_test.cc
Summary: Simply move the tests

Test Plan: make check -j64

Reviewers: andrewkr, lightmark, yiwu, yhchiang, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62529
2016-09-07 15:41:54 -07:00
Edouard A
66a91e2607 Add NoSpace subcode to IOError (#1320)
Add a sub code to distinguish "out of space" errors from regular I/O errors
2016-09-07 12:37:45 -07:00
sdong
607628d349 Support ZSTD with finalized format
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.

Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63141
2016-09-06 12:22:16 -07:00
sdong
f7669b40ba Fix Windows Build
Summary: Fix two Windows build problems.

Test Plan: Build on Windows and run all Linux tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63189
2016-09-02 17:10:28 -07:00
Yi Wu
a88677d2cf Remove ImmutableCFOptions from public API
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63177
2016-09-02 14:16:31 -07:00
John Alexander
4fd08f4b8b Ensure Correct Behavior of StatsLevel kExceptDetailedTimers and kExceptTimeForMutex (#1308)
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).

* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
2016-09-01 19:57:55 -07:00
sdong
32149059f9 Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.

Test Plan: Add two new unit tests. Run all existing tests, including jtest.

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59829
2016-09-01 14:33:24 -07:00
Aaron Gao
4590b53a4b add stats to Cache::LookUp()
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.

Test Plan: make all check

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62193
2016-09-01 13:50:39 -07:00
Andrew Kryczka
1613fa9490 Thread-specific histogram statistics
Summary:
To reduce contention for atomics when HistogramStats are shared across
threads, this diff makes them thread-specific so updates are faster. This comes
at the expense of slower reads (much less frequent), which now require merging
all histograms. In this diff,

- Thread-specific HistogramImpl is created upon the thread's first measureTime()
- Thread-specific HistogramImpl are merged and deleted upon thread termination or ThreadLocalPtr destruction, whichever comes first
- getHistogramString() and histogramData() merge all histograms, both thread-specific and previously merged ones

Test Plan:
unit tests, ran db_bench and verified histograms look similar

before:

  $ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
  ...
  +    7.63%  db_bench     db_bench             [.] rocksdb::HistogramStat::Add

after:

  $ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
  ...
  +    0.98%  db_bench     db_bench             [.] rocksdb::HistogramStat::Add

Reviewers: sdong, MarkCallaghan, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62649
2016-08-31 14:02:09 -07:00
Yi Wu
de47e2bd4d Fix ClockCache memory leak
Summary:
Fix ClockCache memory leak found by valgrind:
# Add destructor to cleanup cached values.
# Delete key with cache handle immediately after handle is recycled, and erase table entry immediately if duplicated cache entry is inserted.

Test Plan:
    make DISABLE_JEMALLOC=1 valgrind_check

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62973
2016-08-31 08:56:34 -07:00
Yi Wu
7541c7a79e Fix cache_test valgrind_check failure
Summary: Refactor cache_test to get around gtest valgrind failure.

Test Plan:
    make valgrind_check

Reviewers: sdong, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62817
2016-08-29 10:40:00 -07:00
Islam AbdelRahman
b49b92cf28 Introduce Read amplification bitmap (read amp statistics)
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.

We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification

Test Plan: added new tests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: yiwu, leveldb, march, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58707
2016-08-26 18:55:58 -07:00
Islam AbdelRahman
e9b2af87f8 Expose ThreadPool under include/rocksdb/threadpool.h
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)

This allow us to expose ThreadPool to the user so we can use it as an option later

Test Plan: existing unit tests

Reviewers: andrewkr, yiwu, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62085
2016-08-26 10:41:35 -07:00
Andrew Kryczka
a081f798b0 Relax consistency for thread-local ticker stats
Summary: see discussion in D62337

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62577
2016-08-25 10:42:26 -07:00
Andrew Kryczka
7c95868378 Thread-specific ticker statistics
Summary:
The global atomics we previously used for tickers had poor cache performance
since they were typically updated from different threads, causing frequent
invalidations. In this diff,

- recordTick() updates a local ticker value specific to the thread in which it was called
- When a thread exits, its local ticker value is added into merged_sum
- getTickerCount() returns the sum of all threads' local ticker values and the merged_sum
- setTickerCount() resets all threads' local ticker values and sets merged_sum to the value provided by the caller.

In a next diff I will make a similar change for histogram stats.

Test Plan:
before:

  $ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
  $ perf report -g --stdio | grep recordTick
  7.59%  db_bench     db_bench             [.] rocksdb::StatisticsImpl::recordTick
  ...

after:

  $ TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench --benchmarks=readwhilewriting --statistics --num=1000000 --use_existing_db --threads=64 --cache_size=250000000 --compression_type=lz4
  $ perf report -g --stdio | grep recordTick
  1.46%  db_bench     db_bench             [.] rocksdb::StatisticsImpl::recordTick
  ...

Reviewers: kradhakrishnan, MarkCallaghan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: yiwu, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62337
2016-08-24 15:42:31 -07:00
Yi Wu
badbff65b7 Not insert into block cache if cache is full and not holding handle
Summary:
We used to allow insert into full block cache as long as `strict_capacity_limit=false`. This diff further restrict insert to full cache if caller don't intent to hold handle to the cache entry after insert.

Hope this diff fix the assertion failure with db_stress: https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=211853102&step_id=2475070014

  db_stress: util/lru_cache.cc:278: virtual void rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*): Assertion `lru_.next == &lru_' failed.

The assertion at lru_cache.cc:278 can fail when an entry is inserted into full cache and stay in LRU list.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62325
2016-08-23 13:53:49 -07:00
Yi Wu
4a16c32ece Option to cache index/filter blocks with priority
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.

Depends on D61977.

Test Plan: See unit test.

Reviewers: lightmark, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, march, leveldb

Differential Revision: https://reviews.facebook.net/D62241
2016-08-23 13:44:13 -07:00
Andrew Kryczka
f57bc1d034 Fix lambda expression for clang/windows
Summary:
make the variables static so capturing is unnecessary since I couldn't
find a portable way to capture variables in a lambda that's converted to a
C-style pointer-to-function.

Test Plan: https://ci.appveyor.com/project/Facebook/rocksdb/build/1.0.1658

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62403
2016-08-23 13:34:56 -07:00
Andrew Kryczka
5440675c3d Fix lambda capture expression for windows
Summary:
there was an error when accessing kItersPerThread in the lambda:
https://ci.appveyor.com/project/Facebook/rocksdb/build/1.0.1654

Test Plan: doitlive

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62379
2016-08-23 10:20:00 -07:00
Andrew Kryczka
6584cec8f2 Fold function for thread-local data
Summary:
This function allows the user to provide a custom function to fold all
threads' local data. It will be used in my next diff for aggregating statistics
stored in thread-local data. Note the test case uses atomics as thread-local
values due to the synchronization requirement (documented in code).

Test Plan: unit test

Reviewers: yhchiang, sdong, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62049
2016-08-22 15:37:39 -07:00
Yi Wu
72f8cc703c LRU cache mid-point insertion
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.

Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.

In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.

Test Plan: unit test.

Reviewers: IslamAbdelRahman, sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, march, leveldb

Differential Revision: https://reviews.facebook.net/D61977
2016-08-19 16:43:31 -07:00
Wanning Jiang
78837f5d61 TableBuilder / TableReader support for range deletion
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader

Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )

Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61473
2016-08-19 15:10:31 -07:00
Yi Wu
4cc37f59e5 Introduce ClockCache
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.

Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.

Reviewers: kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61647
2016-08-19 12:28:19 -07:00
Dmitri Smirnov
3981345be1 Small nits (#1280)
* Create rate limiter using factory function in the test.

* Convert function local statics in option helper to a C array
  that does not perform dynamic memory allocation. This is helpful
  when you try to memory isolate different DB instances.
2016-08-17 00:42:35 -07:00
Yi Wu
2a2ebb6f5e Move LRUCache structs to lru_cache.h header
Summary: ... so that I can include the header and create LRUCache specific tests for D61977

Test Plan:
   make check

Reviewers: lightmark, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62145
2016-08-16 14:43:41 -07:00
Anirban Rahut
2fc2fd92a9 Single Delete Mismatch and Fallthrough statistics
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).

Test Plan: Tested the statistics using write_stress and compaction_job_stats_test

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61749
2016-08-16 08:21:43 -07:00
Andrew Kryczka
236756f2cf Make SyncPoint return immediately when disabled
Summary:
We were frequently seeing a race between SyncPoint::Process() and
SyncPoint::~SyncPoint() (e.g.,
https://our.intern.facebook.com/intern/sandcastle/log/?instance_id=207289975&step_id=2412725431).
The issue was marked_thread_id_ gets deleted when the main thread is exiting and
simultaneously background threads may access it. We can prevent this race
condition by checking whether sync points are disabled (assuming the test terminates
with them disabled) before attempting to access that member. I do not understand
why accesses to other members (mutex_ and enabled_) are ok but anyways the
test no longer fails tsan.

Test Plan: ran tests

Reviewers: sdong, yhchiang, IslamAbdelRahman, yiwu, wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62133
2016-08-16 06:19:46 -07:00
Yueh-Hsuan Chiang
b248e98cf9 Fix a destruction order issue in ThreadStatusUpdater
Summary:
Env holds a pointer of ThreadStatusUpdater, which will be deleted when
Env is deleted.  However, in case a rocksdb database is deleted after
Env is deleted.  Then this will introduce a free-after-use of this
ThreadStatusUpdater.

This patch fix this by never deleting the ThreadStatusUpdater in Env,
which is in general safe as Env is a singleton in most cases.

Test Plan: thread_list_test

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59187
2016-08-15 09:04:55 -07:00
Aaron Gao
e408e98c8c add Name() to Cache
Summary: preparation for detecting Cache type. If SimCache, we then may trigger some command like "setSimCapacity()" with setOptions()

Test Plan: make all check

Reviewers: yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61953
2016-08-12 14:16:57 -07:00
Yueh-Hsuan Chiang
6056d6317d Improve comment and bug fix for GetOptionsFromMap functions in convenience.h
Summary:
This diff improves the documentation for GetOptionsFromMap APIs and
fixes a bug in GetOptionsFromMap functions in convenience.h
where new_options will still be changed when the function
call is not successful.

Test Plan: options_test

Reviewers: IslamAbdelRahman, kradhakrishnan, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61731
2016-08-11 14:54:29 -07:00
Willem Jan Withagen
821bcb0b39 util/arena.cc: FreeBSD: More portable use of mmap(MAP_ANON) (#1254)
From the Linux manual:
  MAP_ANONYMOUS
     The  mapping  is  not  backed  by any file; its contents
     are initialized to zero.  The fd and offset arguments are
     ignored; however, some implementations require fd to be -1
     if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable
     applications  should  ensure  this.

FreeBSD is such a case, it wil just return an error.
2016-08-10 13:52:23 -07:00
omegaga
44f5cc57a5 Add time series database (resubmitted)
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).

Test Plan: Add `date_tiered_test`.

Reviewers: dhruba, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61653
2016-08-05 15:56:22 -07:00
Dhruba Borthakur
34723b4c4a Cleanup unused variable pending_fsync_.
Summary: Cleanup unused variable pending_fsync_.

Test Plan: make check

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61581
2016-08-05 10:31:41 -07:00
omegaga
2306167d30 Fix clang build failure and refactor unit test
Summary: <endian.h> is not platform independent. Switch to our own endianness transformation function instead.

Test Plan: Pass Travis CI. Refactor tests to make sure endianness transformation runs properly.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61389
2016-08-02 15:16:39 -07:00
Aaron Gao
343304e1d3 Use StopWatch to do statistic job in db_impl_add_file.cc
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.

Test Plan: make all check -j64

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61239
2016-08-02 14:53:29 -07:00
krad
c116b47804 Persistent Read Cache (part 6) Block Cache Tier Implementation
Summary:
The patch is a continuation of part 5. It glues the abstraction for
file layout and metadata, and flush out the implementation of the API. It
adds unit tests for the implementation.

Test Plan: Run unit tests

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57549
2016-08-01 14:15:14 -07:00
Andrew Kryczka
afad5bd1c5 Simplify thread-local static initialization
Summary:
The call stack used to look like this during static initialization:

  #0  0x00000000008032d1 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:172
  #1  0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
  #2  0x000000000080310f in rocksdb::ThreadLocalPtr::StaticMeta::Mutex() () at util/thread_local.cc:141
  #3  0x0000000000803103 in rocksdb::ThreadLocalPtr::StaticMeta::InitSingletons() () at util/thread_local.cc:139
  #4  0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106

It involves outer/inner classes and the call stacks goes
outer->inner->outer->inner, which is too difficult to understand. We can avoid
a level of back-and-forth by skipping StaticMeta::InitSingletons(), which
doesn't initialize anything beyond what ThreadLocalPtr::Instance() already
initializes.

Now the call stack looks like this during static initialization:

  #0  0x00000000008032c5 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff683b300) at util/thread_local.cc:170
  #1  0x00000000008030a7 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:135
  #2  0x000000000080305d in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:106

Test Plan:
unit tests

verify StaticMeta::mutex_ is still initialized in DefaultEnv() (StaticMeta::mutex_ is the only variable intended to be initialized via StaticMeta::InitSingletons() which I removed)

  #0  0x00000000005cee17 in rocksdb::port::Mutex::Mutex(bool) (this=0x7ffff69500b0, adaptive=false) at port/port_posix.cc:52
  #1  0x0000000000769cf8 in rocksdb::ThreadLocalPtr::StaticMeta::StaticMeta() (this=0x7ffff6950000) at util/thread_local.cc:168
  #2  0x0000000000769a53 in rocksdb::ThreadLocalPtr::Instance() () at util/thread_local.cc:133
  #3  0x0000000000769a09 in rocksdb::ThreadLocalPtr::InitSingletons() () at util/thread_local.cc:105
  #4  0x0000000000647d98 in rocksdb::Env::Default() () at util/env_posix.cc:845

Reviewers: lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: arahut, IslamAbdelRahman, yiwu, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60813
2016-07-28 14:55:22 -07:00
sdong
e5b5f12b81 Change options memtable_prefix_bloom_huge_page_tlb_size => memtable_huge_page_size and cover huge page to memtable too
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60513
2016-07-26 18:15:11 -07:00
Islam AbdelRahman
f8061a237e Fix Statistics TickersNameMap miss match with Tickers enum
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)

Test Plan: added new unit test

Reviewers: sdong, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61083
2016-07-25 16:05:50 -07:00
Islam AbdelRahman
68a8e6b8fa Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>

This diff is stacked on top of D56493 and D56511

In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand

```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s

[master]
readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```

```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s

[master]
readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s

```

```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]

[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s

[master]
readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```

```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions

[FullMergeV2]
readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s

[master]
readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: lovro, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
krad
d9cfaa2b16 Persistent Read Cache (6) Persistent cache tier implentation - File layout
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.

Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.

We have specifically tested and adapted the IO path for NVM and SSD.

Persistent cache tier consists of there parts :

1) File layout

Provides the implementation for handling IO path for reading and writing data
(key/value pair).

2) Meta-data
Provides the implementation for handling the index for persistent read cache.

3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface

This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.

Test Plan: Compile and run check

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57117
2016-07-19 12:01:46 -07:00
John Alexander
9430333f84 New Statistics to track Compression/Decompression (#1197)
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).

* Prevent incrementing compression counter when compression is turned off in options.

* Prevent incrementing compression counter when compression is turned off in options.

* Added two more supported compression types to test code in db_test.cc

* Prevent incrementing compression counter when compression is turned off in options.

* Added new StatsLevel that excludes compression timing.

* Fixed casting error in coding.h

* Fixed CompressionStatsTest for new StatsLevel.

* Removed unused variable that was breaking the Linux build
2016-07-19 09:44:03 -07:00
Yi Wu
4b95253587 Refactor cache.cc
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.

Test Plan:
    make check -j64

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59655
2016-07-15 10:41:36 -07:00
Jay Edgar
15b7a4ab88 Fixed output size and removed unneeded loop
Summary: In Zlib_Compress and BZip2_Compress the calculation for size was slightly off when using compression_foramt_version 2 (which includes the decompressed size in the output).  Also there were unnecessary loops around the deflate/BZ2_bzCompress calls.  In Zlib_Compress there was also a possible exit from the function after calling deflateInit2 that didn't call deflateEnd.

Test Plan: Standard tests

Reviewers: sdong, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: sdong, IslamAbdelRahman, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60537
2016-07-13 09:10:06 -07:00
Jay Edgar
efd013d6d8 Miscellaneous performance improvements
Summary:
I was investigating performance issues in the SstFileWriter and found all of the following:

- The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time.  Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function.
- In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key.
- The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed.  I added the move constructor which also required a copy constructor and assignment operator.
- The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added.  I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value.
- The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function.
- BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key).  It is more efficient just to copy the whole new key over.
- Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on.
- FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time.
- Each PutVarint??() function has a buffer and calls std::string::append().  Two or three calls in a row could share a buffer and a single call to std::string::append().

Some of these will be helpful outside of the SstFileWriter.  I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility?  I tried it on gcc 4.8 and 4.9.

Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added.  The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas.

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59607
2016-07-12 14:15:32 -07:00
Yi Wu
296545a2c7 Fix clang analyzer errors
Summary:
Fixing erros reported by clang static analyzer.
* Removing some unused variables.
* Adding assertions to fix false positives reported by clang analyzer.
* Adding `__clang_analyzer__` macro to suppress false positive warnings.

Test Plan:
    USE_CLANG=1 OPT=-g make analyze -j64

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60549
2016-07-08 17:50:51 -07:00
Wanning Jiang
0f691c4b5b CLI option & Rename() allow overwrite
Summary: this test support CLI option to select HdfsEnv/NativeHdfsEnv now. The latter one is default. add test about when Rename(src, target) should overwrite target

Test Plan: existing test

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60399
2016-07-07 17:51:36 -07:00
Andrew Kryczka
e1b3ee8a79 Cleanup auto-roll logger flush-while-rolling test
Summary:
Use @omegaga's awesome feature to avoid use of callbacks for ensuring
SyncPoints happen in a particular thread.

Depends on D60375.

Test Plan:
  $ ./auto_roll_logger_test

Reviewers: omegaga, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, omegaga, leveldb

Differential Revision: https://reviews.facebook.net/D60471
2016-07-07 11:35:40 -07:00
omegaga
cd4178a015 Add a new feature to enforce a sync point only active on a thread
Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point.

Test Plan: Write a unit test to validate.

Reviewers: sdong, IslamAbdelRahman, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60375
2016-07-07 11:29:14 -07:00
sdong
32df9733d1 Add options.write_buffer_manager: control total memtable size across DB instances
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.

Test Plan: Add a new unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59925
2016-07-05 18:11:25 -07:00
Wanning Jiang
95d96eeeb1 remove LockFile
Summary: LockFile is unnecessary in unit test

Test Plan: env_basic_test.cc

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60285
2016-07-01 15:28:08 -07:00
Aaron Gao
cb2476a0ca fix rate limiter to avoid starvation
Summary:
The current implementation of rate limiter has the possibility to introduce resource starvation when change its limit.
This diff aims to fix this problem by consuming request bytes partially.

Test Plan:
```
./rate_limiter_test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from RateLimiterTest
[ RUN      ] RateLimiterTest.OverflowRate
[       OK ] RateLimiterTest.OverflowRate (0 ms)
[ RUN      ] RateLimiterTest.StartStop
[       OK ] RateLimiterTest.StartStop (0 ms)
[ RUN      ] RateLimiterTest.Rate
request size [1 - 1023], limit 10 KB/sec, actual rate: 10.355712 KB/sec, elapsed 2.00 seconds
request size [1 - 1023], limit 20 KB/sec, actual rate: 19.136564 KB/sec, elapsed 2.00 seconds
request size [1 - 2047], limit 20 KB/sec, actual rate: 20.783976 KB/sec, elapsed 2.10 seconds
request size [1 - 2047], limit 40 KB/sec, actual rate: 39.308144 KB/sec, elapsed 2.10 seconds
request size [1 - 4095], limit 40 KB/sec, actual rate: 40.318349 KB/sec, elapsed 2.20 seconds
request size [1 - 4095], limit 80 KB/sec, actual rate: 79.667396 KB/sec, elapsed 2.20 seconds
request size [1 - 8191], limit 80 KB/sec, actual rate: 81.807158 KB/sec, elapsed 2.30 seconds
request size [1 - 8191], limit 160 KB/sec, actual rate: 160.659761 KB/sec, elapsed 2.20 seconds
request size [1 - 16383], limit 160 KB/sec, actual rate: 160.700990 KB/sec, elapsed 3.00 seconds
request size [1 - 16383], limit 320 KB/sec, actual rate: 317.639481 KB/sec, elapsed 2.50 seconds
[       OK ] RateLimiterTest.Rate (22618 ms)
[ RUN      ] RateLimiterTest.LimitChangeTest
[COMPLETE] request size 10 KB, new limit 20KB/sec, refill period 1000 ms
[COMPLETE] request size 10 KB, new limit 5KB/sec, refill period 1000 ms
[COMPLETE] request size 20 KB, new limit 40KB/sec, refill period 1000 ms
[COMPLETE] request size 20 KB, new limit 10KB/sec, refill period 1000 ms
[COMPLETE] request size 40 KB, new limit 80KB/sec, refill period 1000 ms
[COMPLETE] request size 40 KB, new limit 20KB/sec, refill period 1000 ms
[COMPLETE] request size 80 KB, new limit 160KB/sec, refill period 1000 ms
[COMPLETE] request size 80 KB, new limit 40KB/sec, refill period 1000 ms
[COMPLETE] request size 160 KB, new limit 320KB/sec, refill period 1000 ms
[COMPLETE] request size 160 KB, new limit 80KB/sec, refill period 1000 ms
[       OK ] RateLimiterTest.LimitChangeTest (5002 ms)
[----------] 4 tests from RateLimiterTest (27620 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (27621 ms total)
[  PASSED  ] 4 tests.
```

Reviewers: sdong, IslamAbdelRahman, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60207
2016-07-01 00:16:29 -07:00
Andrew Kryczka
6b7167651f Run env_basic_test on Env::Default
Summary:
Previously we couldn't run env_basic_test on Env::Default (PosixEnv on
our platforms) since GetChildren*() behavior was inconsistent with our other
Envs. We can normalize the output of GetChildren*() such that these test cases
work on PosixEnv too.

Test Plan: ran env_basic_test

Reviewers: wanning

Reviewed By: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59943
2016-06-30 18:34:29 -07:00
Andrew Kryczka
9eb0b53954 Move env_basic_test cleanup to TearDown
Summary:
move cleanup to TearDown and handle directories, so cleanup will happen
even if a test fails in the middle.

Test Plan: ./env_basic_test

Reviewers: wanning

Reviewed By: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60243
2016-06-30 18:33:49 -07:00
Wanning Jiang
95c192475d writable file close before reset
Summary: during backup, writable file should call close() before reset()

Test Plan: backupable_db_test.cc

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60195
2016-06-29 15:33:27 -07:00
Willem Jan Withagen
b726bf5961 FreeBSD does not have std::to_string (#1190)
Submitted-by: Willem Jan Withagen <wjw@digiware.nl>
2016-06-29 07:35:17 -07:00
Wanning Jiang
3fc713ed92 delete 2nd level children for default env
Summary: ensure no 2nd level children under test_dir_

Test Plan: env_basic_test on 4 envs

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59979
2016-06-24 12:31:58 -07:00
Wanning Jiang
0d7b261236 add tests to env_basic_test.cc
Summary: test NativeHdfsEnv

Test Plan: env_basic_test.cc

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59889
2016-06-22 16:16:21 -07:00
omegaga
c4e19b77e8 Add a read option to enable background purge when cleaning up iterators
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.

Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59499
2016-06-21 18:41:23 -07:00
sdong
7b79238b65 Deprectate filter_deletes
Summary: filter_deltes is not a frequently used feature. Remove it.

Test Plan: Run all test suites.

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59427
2016-06-17 10:30:47 -07:00
sdong
4939fc3892 Bulk load mode shouldn't stop ingest
Summary: We introduced default slow down and stop condition, but didn't reset it in bulk load mode. Fix it.

Test Plan: N/A

Reviewers: igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59757
2016-06-17 09:58:50 -07:00
Yueh-Hsuan Chiang
3a2bccc845 Fixed a crash bug that incorrectly parse deprecated options in options_helper
Summary: Fixed a crash bug that incorrectly parse deprecated options in options_helper

Test Plan:
run db_bench with an old options file with memtable_prefix_bloom_probes
./db_bench --options_file=AN_OLD_OPTIONS_FILE --num=100 --benchmarks=fillseq

Reviewers: sdong, IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59787
2016-06-17 04:53:30 -07:00
Yi Wu
bc8af90e8c add option to not flush memtable on open()
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.

Test Plan: See unit test.

Reviewers: dhruba, horuff, sdong

Reviewed By: sdong

Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57813
2016-06-13 11:34:16 -07:00
Nadav Rotem
7360db39e6 Add a check mode to verify compressed block can be decompressed back
Summary:
Try to decompress compressed blocks when a special flag is set.
assert and crash in debug builds if we can't decompress the just-compressed input.

Test Plan: Run unit-tests.

Reviewers: dhruba, andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59145
2016-06-10 18:20:54 -07:00
sdong
6faddd7c55 Merge db/slice.cc into util/slice.cc
Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them.

Test Plan: Run existing tests

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59409
2016-06-10 16:37:36 -07:00
omegaga
2d05eaeb28 Fix name conflict in delete_shceduler_test and db_sst_test
Summary: delete_scheduler_test and db_sst_test share a same directory name, causing possible fails on both tests when running in parallel. Fixed by changing directory name.

Test Plan: Run the two tests in parallel: `parallel -u ./{} ::: delete_scheduler_test db_sst_test`

Reviewers: sdong, andrewkr

Reviewed By: sdong, andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59529
2016-06-10 15:39:17 -07:00
sdong
20699df843 memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate memtable_prefix_bloom_probes
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio

Test Plan: Run all existing tests

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59199
2016-06-10 12:12:10 -07:00
sdong
b2973eaaeb Remove options builder
Summary: AFIK, options builder is not used by anyone. Remove it.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, andrewkr, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59319
2016-06-09 16:56:17 -07:00
Frank Celler
1ba452226f Fix for GCC 5.4 (#1157)
GCC 5.4 will complain (see also options_parser.cc):

    /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc: In function 'rocksdb::CompactionStyle rocksdb::{anonymous}::PickCompactionStyle(size_t, int, int, uint64_t)':
    /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:29:7: error: 'log' is not a member of 'std'
           std::log(target_db_size / write_buffer_size) / std::log(kBytesForLevelMultiplier)));
           ^
    /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:29:7: note: suggested alternative:
    In file included from /usr/include/features.h:365:0,
                     from /usr/include/math.h:26,
                     from /home/abuild/rpmbuild/BUILD/arangodb-3.0.0r1/3rdParty/rocksdb/rocksdb/util/options_builder.cc:6:
    /usr/include/bits/mathcalls.h:109:1: note:   'log'
     __MATHCALL_VEC (log,, (_Mdouble_ __x));
2016-06-08 08:47:20 -07:00
Willem Jan Withagen
0d65acec0c threadpool.cc: abort() lives in stdlib.h on FreeBSD (#1155) 2016-06-05 17:23:38 -07:00
Willem Jan Withagen
19dd5a61cd env_chroot.cc: FreeBSD likes stdlib.h for realpaht() and friends (#1154) 2016-06-05 17:22:55 -07:00
Andrew Kryczka
5aca977be8 env_basic_test library for testing new Envs [pluggable Env part 3]
Summary:
- Provide env_test as a static library. We will build it for future releases so internal Envs can use env_test by linking against this library.
- Add tests for CustomEnv, which is configurable via ENV_TEST_URI environment variable. It uses the URI-based Env lookup (depends on D58449).
- Refactor env_basic_test cases to use a unique/configurable directory for test files.

Test Plan:
built a test binary against librocksdb_env_test.a. It registered the
default Env with URI prefix "a://".

- verify runs all CustomEnv tests when URI with correct prefix is provided

```
$ ENV_TEST_URI="a://ok" ./tmp --gtest_filter="CustomEnv/*"
...
[  PASSED  ] 12 tests.
```

- verify runs no CustomEnv tests when URI with non-matching prefix is provided

```
$ ENV_TEST_URI="b://ok" ./tmp --gtest_filter="CustomEnv/*"
...
[  PASSED  ] 0 tests.
```

Reviewers: ldemailly, IslamAbdelRahman, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58485
2016-06-03 18:44:22 -07:00
Andrew Kryczka
6e6622abb9 Create env_basic_test [pluggable Env part 2]
Summary:
Extracted basic Env-related tests from mock_env_test and memenv_test into a
parameterized test for Envs: env_basic_test.

Depends on D58449. (The dependency is here only so I can keep this series of
diffs in a chain -- there is no dependency on that diff's code.)

Test Plan: ran tests

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58635
2016-06-03 15:13:03 -07:00
krad
27ad170712 Fix Windows build break
Summary:
Direct IO checkin breaks Windows build. Fixing the code to work for
Windows.

Test Plan: Run env_test in Windows 10 and make check in Linux

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59073
2016-06-01 18:07:59 -07:00
Vasile Paraschiv
71c7eed91c Assert boundary checks for SetPerfLevel()
Summary: Add asserts around PerfLevel enum

Test Plan: make all check -j32

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59007
2016-06-01 09:07:09 -07:00
sdong
f62fbd2c85 Handle overflow case of rate limiter's paramters
Summary: When rate_bytes_per_sec * refill_period_us_ overflows, the actual limited rate is very low. Handle this case so the rate will be large.

Test Plan: Add a unit test for it.

Reviewers: IslamAbdelRahman, andrewkr

Reviewed By: andrewkr

Subscribers: yiwu, lightmark, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58929
2016-05-27 16:15:28 -07:00
Andrew Kryczka
57461fba8c In-memory environment read beyond EOF
Summary:
Made it consistent with posix Env, which uses pread() that returns 0
(success) when an offset is given beyond EOF. The purpose of making these Envs
behave consistently is I am repurposing the in-memory Envs' tests for the basic
Env tests in D58635.

Test Plan: ran mock_env_test and memenv_test

Reviewers: IslamAbdelRahman, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58845
2016-05-27 12:10:26 -07:00
sdong
157e0633e7 MutexLock -> ThreadPoolMutexLock in util/threadpool.cc
Summary: util/threadpool.cc's function name is the same as a well-known class name. It breaks unity build. Rename it.

Test Plan: Run all existing test.

Reviewers: yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58881
2016-05-27 09:41:35 -07:00
Aaron Gao
5d660258e7 add simulator Cache as class SimCache/SimLRUCache(with test)
Summary: add class SimCache(base class with instrumentation api) and SimLRUCache(derived class with detailed implementation) which is used as an instrumented block cache that can predict hit rate for different cache size

Test Plan:
Add a test case in `db_block_cache_test.cc` called `SimCacheTest` to test basic logic of SimCache.
Also add option `-simcache_size` in db_bench. if set with a value other than -1, then the benchmark will use this value as the size of the simulator cache and finally output the simulation result.
```
[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 1000000
RocksDB:    version 4.8
Date:       Tue May 17 16:56:16 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       6.809 micros/op 146874 ops/sec;   16.2 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.343 micros/op 157665 ops/sec;   17.4 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 986559
SimCache HITs:    264760
SimCache HITRATE: 26.84%

[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 10000000
RocksDB:    version 4.8
Date:       Tue May 17 16:57:10 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       5.066 micros/op 197394 ops/sec;   21.8 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.457 micros/op 154870 ops/sec;   17.1 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 1059764
SimCache HITs:    374501
SimCache HITRATE: 35.34%

[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 100000000
RocksDB:    version 4.8
Date:       Tue May 17 16:57:32 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       5.632 micros/op 177572 ops/sec;   19.6 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.892 micros/op 145094 ops/sec;   16.1 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 1150767
SimCache HITs:    1034535
SimCache HITRATE: 89.90%
```

Reviewers: IslamAbdelRahman, andrewkr, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57999
2016-05-23 23:35:23 -07:00
krad
21f847eda5 Direct IO fix for Mac
Summary:
O_DIRECT is not available in Mac as a flag for open. The fix is to make
use of fctl after the file is opened

Test Plan: Run the tests on mac and Linux

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58665
2016-05-23 16:38:25 -07:00
Ashish Shenoy
99765ed855 Clean up the ComputeCompactionScore() API
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options

Test Plan: UT

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, lgalanis, dhruba

Differential Revision: https://reviews.facebook.net/D58653
2016-05-23 15:55:29 -07:00
krad
f89caa127b Direct IO capability for RocksDB
Summary:
This patch adds direct IO capability to RocksDB Env.

The direct IO capability is required for persistent cache since NVM is best
accessed as 4K direct IO. SSDs can leverage direct IO for reading.

Direct IO requires the offset and size be sector size aligned, and memory to
be kernel page aligned. Since neither RocksDB/Persistent read cache data
layout is aligned to sector size, the code can accommodate reading unaligned IO size
(or unaligned memory) at the cost of an alloc/copy.

The write code path expects the size and memory to be aligned.

Test Plan: Run RocksDB unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57393
2016-05-23 12:27:27 -07:00
Sage Weil
847e471db6 db/log_test: add recycle log test
This currently fails because we do not properly map a
corrupt header to the logical end of the log.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Aaron Orenstein
2073cf3775 Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions.
Summary: Reduce use of argument-dependent name lookup in RocksDB.

Test Plan: 'make check' passed.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58203
2016-05-20 07:42:18 -07:00
krad
bb98ca3c80 Implement GetUniqueId for Mac
Summary:
Persistent read cache relies on the accuracy of the GetUniqueIdFromFile
to generate a unique key for a given block of data. Currently we don't have an
implementation for Mac.

This patch adds an implementation.

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58413
2016-05-19 16:34:31 -07:00
Islam AbdelRahman
533cda90ce Add GetStringFromCompressionType to include/rocksdb/convenience.h
Summary:
Expose a simple function to convert CompressionType to it's corresponding option string

This is for a diff @yoshinorim is working on for MyRocks

Test Plan: unittest

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D58215
2016-05-18 15:03:21 -07:00
Aaron Gao
43afd72bee [rocksdb] make more options dynamic
Summary:
make more ColumnFamilyOptions dynamic:
- compression
- soft_pending_compaction_bytes_limit
- hard_pending_compaction_bytes_limit
- min_partial_merge_operands
- report_bg_io_stats
- paranoid_file_checks

Test Plan:
Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit.
All passed.

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57519
2016-05-17 13:11:56 -07:00
krad
a08c8c851a Added PersistentCache abstraction
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.

RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in  COMPRESSED or UNCOMPRESSED mode.

Blame Rev:

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55707
2016-05-15 22:17:18 -07:00
Dmitri Smirnov
aab91b8d8f Use generic threadpool for Windows environment (#1120)
Conditionally retrofit thread_posix for use with std::thread
  and reuse the same logic. Posix users continue using Posix interfaces.
  Enable XPRESS compression in test runs.
  Fix master introduced signed/unsigned mismatch.
2016-05-12 18:34:04 -07:00
Reid Horuff
a657ee9a9c [rocksdb] Recovery path sequence miscount fix
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]

The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.

Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.

We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))

So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?

Test Plan: provided test.

Reviewers: sdong

Subscribers: andrewkr, leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D57645
2016-05-10 14:06:07 -07:00
Reid Horuff
1b8a2e8fdd [rocksdb] Memtable Log Referencing and Prepared Batch Recovery
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.

modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
-  log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.

- Recovery/writepath is now aware of prepared batches and commit and rollback markers.

Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.

Reviewers: IslamAbdelRahman, sdong

Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D56919
2016-05-10 14:06:07 -07:00
Andrew Kryczka
f548da33e8 Follow symlinks in chroot directory
Summary:
On Mac OS X, the chroot directory we typically use ("/tmp") is actually
a symlink for "/private/tmp". Since we dereference symlinks in user-defined
paths, we must also dereference symlinks in chroot_dir_ such that we can perform
string comparisons on those paths.

Test Plan: ran env_test on Mac OS X and devserver

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57957
2016-05-10 09:53:52 -07:00
Islam AbdelRahman
4b31723433 Add bottommost_compression option
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.

I have also updated CompactionJobInfo to include the compression algorithm used in compaction

Test Plan:
added new unittest
existing unittests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: lightmark, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D57669
2016-05-09 15:57:19 -07:00
Andrew Kryczka
258459ed54 Properly destroy ChrootEnv in env_test
Summary: see title

Test Plan:
  $ /mnt/gvfs/third-party2/valgrind/af85c56f424cd5edfc2c97588299b44ecdec96bb/3.10.0/gcc-4.9-glibc-2.20/e9936bf/bin/valgrind --error-exitcode=2 --leak-check=full ./env_test

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57897
2016-05-09 14:38:50 -07:00
Andrew Kryczka
a9b3c47c8e Fix includes for clang on OS X
Summary:
Fix below error:

  use of undeclared identifier 'errno'

Test Plan: doitlive

Reviewers: IslamAbdelRahman, sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57849
2016-05-06 18:32:54 -07:00
Andrew Kryczka
3f16a836a4 Introduce chroot Env
Summary:
For testing backups, we needed an Env that is fully isolated from other
Envs on the same machine. Our in-memory Envs (MockEnv and InMemoryEnv) were
insufficient because they don't implement most directory operations.

This diff introduces a new Env, "ChrootEnv", that translates paths such that the
chroot directory appears to be the root directory. This way, multiple Envs can
be isolated in the filesystem by using different chroot directories. Since we
use the filesystem, all directory operations are trivially supported.

Test Plan:
I parameterized the existing EnvPosixTest so it runs tests on ChrootEnv
except the ioctl-related cases.

Reviewers: sdong, lightmark, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57543
2016-05-06 17:42:50 -07:00
Arun Sharma
04dec2a359 [ldb] Export ldb_cmd*.h
Summary:
This is needed so that rocksdb users can add more
commands to the included ldb tool by adding more custom
commands.

Test Plan: make -j ldb

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57243
2016-05-06 16:09:09 -07:00
Andrew Kryczka
0d590d9991 Make max_dict_bytes optional in options string
Summary:
For backwards compatibility with older option strings, the parser needs
to treat this argument as optional.

Test Plan:
Updated unit test to cover case where compression_opts is present but
max_dict_bytes is omitted.

Reviewers: MarkCallaghan, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57759
2016-05-06 11:27:28 -07:00
sdong
e3c6ba37dd OptimizeForSmallDb(): revert some options whose defaults were just changed
Summary: We changed default options of max_open_files and max_file_opening_threads but didn't revert it in OptimizeForSmallDb().

Test Plan: Add a unit test

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57675
2016-05-05 16:50:53 -07:00
Yi Wu
24a24f013d Enable configurable readahead for iterators
Summary:
Add an option `iterator_readahead_size` to `ReadOptions` to enable
configurable readahead for iterators similar to the corresponding
option for compaction.

Test Plan:
```
make commit_prereq
```

Reviewers: kumar.rangarajan, ott, igor, sdong

Reviewed By: sdong

Subscribers: yiwu, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55419
2016-05-04 15:25:58 -07:00
sdong
6a14f7a976 Change several option defaults
Summary:
Changing several option defaults:
 options.max_open_files changes from 5000 to -1
 options.base_background_compactions changes from max_background_compactions to 1
 options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords
 options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize

Test Plan: Write unit tests to see OldDefaults() works as expected.

Reviewers: IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56427
2016-04-28 17:50:58 -07:00
Andrew Kryczka
843d2e3137 Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.

During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.

On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.

Test Plan: new unit test

Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
Sergey Makarenko
1c80dfab24 Print memory allocation counters
Summary:
Introduced option to dump malloc statistics using new option flag.
    Added new command line option to db_bench tool to enable this
    funtionality.
    Also extended build to support environments with/without jemalloc.

Test Plan:
1) Build rocksdb using `make` command. Launch the following command
    `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
    --num=10000000` end verified that jemalloc dump is present in LOG file.
    2) Build rocksdb using `DISABLE_JEMALLOC=1  make db_bench -j32` and ran
    the same db_bench tool and found the following message in LOG file:
    "Please compile with jemalloc to enable malloc dump".
    3) Also built rocksdb using `make` command on MacOS to verify behavior
    in non-FB environment.
    Also to debug build configuration change temporary changed
    AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
    tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
    command line. For both 2) and 3) this flag was not present.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57321
2016-04-27 16:23:33 -07:00
Islam AbdelRahman
7c14abf2c7 Improve BytewiseComparatorImpl::FindShortestSeparator
Summary:
The current implementation find the first different byte and try to increment it, if it cannot it return the original key
we can improve this by keep going after the first different byte to find the first non 0xFF byte and increment it

After trying this patch on some logdevice sst files I see decrease in there index block size by 8.5%

Test Plan: existing tests and updated test

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56241
2016-04-25 23:02:14 -07:00
dx9
b71c4e613f Alpine Linux Build (#990)
* Musl libc does not provide adaptive mutex. Added feature test for PTHREAD_MUTEX_ADAPTIVE_NP.

* Musl libc does not provide backtrace(3). Added a feature check for backtrace(3).

* Fixed compiler error.

* Musl libc does not implement backtrace(3). Added platform check for libexecinfo.

* Alpine does not appear to support gcc -pg option. By default (gcc has PIE option enabled) it fails with:

gcc: error: -pie and -pg|p|profile are incompatible when linking

When -fno-PIE and -nopie are used it fails with:

/usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find gcrt1.o: No such file or directory

Added gcc -pg platform test and output PROFILING_FLAGS accordingly. Replaced pg var in Makefile with PROFILING_FLAGS.

* fix segfault when TEST_IOCTL_FRIENDLY_TMPDIR is undefined and default candidates are not suitable

* use ASSERT_DOUBLE_EQ instead of ASSERT_EQ

* When compiled with ROCKSDB_MALLOC_USABLE_SIZE UniversalCompactionFourPaths and UniversalCompactionSecondPathRatio tests fail due to premature memtable flushes on systems with 16-byte alignment. Arena runs out of block space before GenerateNewFile() completes.

Increased options.write_buffer_size.
2016-04-22 16:49:12 -07:00
Dmitri Smirnov
ee221d2de0 Introduce XPRESS compresssion on Windows. (#1081)
Comparable with Snappy on comp ratio.
  Implemented using Windows API, does not require external package.
  Avaiable since Windows 8 and server 2012.
  Use -DXPRESS=1 with CMake to enable.
2016-04-19 22:54:24 -07:00
sdong
4b6833aec1 Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too.
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.

Test Plan: Try db_bench and see the stats are printed out.

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56769
2016-04-15 10:22:18 -07:00
Jay Edgar
b345b36620 Add a minimum value for the refill bytes per period value
Summary: If the user specified a small enough value for the rate limiter's bytes per second, the calculation for the number of refill bytes per period could become zero which would effectively cause the server to hang forever.

Test Plan: Existing tests

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56631
2016-04-13 09:01:42 -07:00
sdong
63cf15bb9f Fix option settable tests
Summary: In option settable tests, bytes for pointers are not all skipped, so that they may be the same as the special character and cause false positive.

Test Plan: Run the test. Manually verify the issue is not there any more.

Reviewers: IslamAbdelRahman, andrewkr

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, yhchiang, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56553
2016-04-12 14:03:35 -07:00
sdong
a23c6052c8 Don't run DBOptionsAllFieldsSettable under valgrind
Summary: Test DBOptionsAllFieldsSettable sometimes fails under valgrind. Move option settable tests to a separate test file and disable it in valgrind..

Test Plan: Run valgrind test and make sure the test doesn't run.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, yhchiang, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56529
2016-04-11 14:24:55 -07:00
Andrew Kryczka
114a1b8792 Fix build errors for windows
Summary:
- Need to use unsigned long long for 64-bit literals on windows
- Need size_t for backup meta-file length since clang doesn't let us assign size_t to int

Test Plan: backupable_db_test and options_test

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56391
2016-04-08 13:09:19 -07:00
sdong
1518b733eb Change default number of cache shard bit to be 6 and max_file_opening_threads to be 16.
Summary: Cache shard bit 4 is sometimes too small and 6 is a more common value picked by users. Make that default. It shouldn't hurt much to change options.max_file_opening_threads default to be 16, which will reduce the worst case DB open time.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, igor, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55047
2016-04-07 13:55:10 -07:00