Commit Graph

2720 Commits

Author SHA1 Message Date
Andrew Kryczka
b0029bc7fa Test merge op covered by range deletion in memtable
Summary:
It's a test case for #1797. Also got rid of kTypeDeletion in the conditional since we treat it the same as kTypeRangeDeletion.
Closes https://github.com/facebook/rocksdb/pull/1800

Differential Revision: D4451300

Pulled By: ajkr

fbshipit-source-id: b39dda1
2017-01-24 13:39:11 -08:00
Andrew Kryczka
d438e1ec17 Test range deletion block outlives table reader
Summary:
This test ensures RangeDelAggregator can still access blocks even if it outlives the table readers that created them (detailed description in comments).

I plan to optimize away the extra cache lookup we currently do in BlockBasedTable::NewRangeTombstoneIterator(), as it is ~5% CPU in my random read benchmark in a database with 1k tombstones. This test will help make sure nothing breaks in the process.
Closes https://github.com/facebook/rocksdb/pull/1739

Differential Revision: D4375954

Pulled By: ajkr

fbshipit-source-id: aef9357
2017-01-24 13:24:14 -08:00
Andrew Kryczka
9da4d542fe Range deletions unsupported in tailing iterator
Summary:
change the iterator status to NotSupported as soon as a range tombstone
is encountered by a ForwardIterator.
Closes https://github.com/facebook/rocksdb/pull/1593

Differential Revision: D4246294

Pulled By: ajkr

fbshipit-source-id: aef9f49
2017-01-23 13:39:12 -08:00
Hyeonseok Oh
f2b4939da4 fixed typo
Summary:
I fixed exisit -> exist
Closes https://github.com/facebook/rocksdb/pull/1799

Differential Revision: D4451466

Pulled By: yiwu-arbug

fbshipit-source-id: b447c3a
2017-01-23 12:54:13 -08:00
yinqiwen
973f1b78fd memtable: delete merge value for range deleteion
Summary: Closes https://github.com/facebook/rocksdb/pull/1797

Differential Revision: D4448004

Pulled By: ajkr

fbshipit-source-id: 3ffc27c
2017-01-23 12:24:14 -08:00
Vitaliy Liptchinsky
753ff84a3d Fix get approx size
Summary:
Fixing GetApproximateSize bug for the case of computing stats for mem tables only.
Closes https://github.com/facebook/rocksdb/pull/1795

Differential Revision: D4445507

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3905846
2017-01-20 15:54:12 -08:00
Jay Lee
537da370da c: allow set savepoint to writebatch
Summary:
Allow set SavePoint to WriteBatch in C ABI.
Closes https://github.com/facebook/rocksdb/pull/1698

Differential Revision: D4378556

Pulled By: yiwu-arbug

fbshipit-source-id: afca746
2017-01-20 13:24:13 -08:00
Changli Gao
5ac97314e7 Fix std::out_of_range when DBOptions::keep_log_file_num is zero
Summary:
We should validate this option, otherwise we may see
std::out_of_range thrown at: db/db_impl.cc:1124

1123     for (unsigned int i = 0; i <= end; i++) {
1124       std::string& to_delete = old_info_log_files.at(i);
1125       std::string full_path_to_delete =
1126           (immutable_db_options_.db_log_dir.empty()
Closes https://github.com/facebook/rocksdb/pull/1722

Differential Revision: D4379495

Pulled By: yiwu-arbug

fbshipit-source-id: e136552
2017-01-20 13:24:12 -08:00
Shu Zhang
3c0852d1da Make ingest external file backward compatible
Summary: Closes https://github.com/facebook/rocksdb/pull/1783

Differential Revision: D4443463

Pulled By: IslamAbdelRahman

fbshipit-source-id: 39d21d6
2017-01-20 12:09:19 -08:00
Siying Dong
0e8dfd6062 Fix OptimizeForPointLookup()
Summary:
If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead.
Closes https://github.com/facebook/rocksdb/pull/1791

Differential Revision: D4442836

Pulled By: siying

fbshipit-source-id: bf6c9cd
2017-01-20 10:54:12 -08:00
Vitaliy Liptchinsky
e840213d6e Change DB::GetApproximateSizes for more flexibility needed for MyRocks
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787

Differential Revision: D4441111

Pulled By: IslamAbdelRahman

fbshipit-source-id: c11f4c3
2017-01-20 09:39:11 -08:00
Yi Wu
9239103cd4 Flush job should release reference current version if sync log failed
Summary:
Fix the bug when sync log fail, FlushJob::Run() will not be execute and
reference to cfd->current() will not be release.
Closes https://github.com/facebook/rocksdb/pull/1792

Differential Revision: D4441316

Pulled By: yiwu-arbug

fbshipit-source-id: 5523e28
2017-01-19 23:09:15 -08:00
Islam AbdelRahman
da54d36a96 Disable IngestExternalFile in ReadOnly mode
Summary:
Disable IngestExternalFile() in read only mode
Closes https://github.com/facebook/rocksdb/pull/1781

Differential Revision: D4439179

Pulled By: IslamAbdelRahman

fbshipit-source-id: b7e46e7
2017-01-19 15:54:19 -08:00
Reid Horuff
5cf176ca15 Fix for 2PC causing WAL to grow too large
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*

To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768

Differential Revision: D4403265

Pulled By: reidHoruff

fbshipit-source-id: ce800ff
2017-01-19 15:39:12 -08:00
Andrew Kryczka
f9d18e22d2 Fix DeleteRange file boundary correctness issue with max_compaction_bytes
Summary:
Cockroachdb exposed this bug in #1778. The bug happens when a compaction's output files are ended due to exceeding max_compaction_bytes. In that case we weren't taking into account the next file's start key when deciding how far to extend the current file's max_key. This caused the non-overlapping key-range invariant to be violated.

Note this was correctly handled for the usual case of cutting compaction output, which is file size exceeding max_output_file_size. I am not sure why these are two separate code paths, but we can consider refactoring it to prevent such errors in the future.
Closes https://github.com/facebook/rocksdb/pull/1784

Differential Revision: D4430235

Pulled By: ajkr

fbshipit-source-id: 80af748
2017-01-18 11:54:22 -08:00
Islam AbdelRahman
3ce091fd73 Add KEEP_DB env var option
Summary:
When debugging tests, it's useful to preserve the DB to investigate it and check the logs
This will allow us to set KEEP_DB=1 to preserve the DB
Closes https://github.com/facebook/rocksdb/pull/1759

Differential Revision: D4393826

Pulled By: IslamAbdelRahman

fbshipit-source-id: 1bff689
2017-01-17 13:54:20 -08:00
Siying Dong
77b4806625 Fix 2PC with concurrent memtable insert
Summary:
If concurrent memtable insert is enabled, and one prepare command and a normal command are grouped into a commit group, the sequence ID will be calculated incorrectly.
Closes https://github.com/facebook/rocksdb/pull/1730

Differential Revision: D4371081

Pulled By: siying

fbshipit-source-id: cd40c6d
2017-01-17 11:24:28 -08:00
Mike Kolupaev
d18dd2c41f Abort compactions more reliably when closing DB
Summary:
DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper).

There's also some minor code cleanup along the way.
Closes https://github.com/facebook/rocksdb/pull/1639

Differential Revision: D4306571

Pulled By: yiwu-arbug

fbshipit-source-id: f050890
2017-01-11 15:09:21 -08:00
Changli Gao
9f246298e2 Performance: Iterate vector by reference
Summary: Closes https://github.com/facebook/rocksdb/pull/1763

Differential Revision: D4398796

Pulled By: yiwu-arbug

fbshipit-source-id: b82636d
2017-01-11 10:54:37 -08:00
Dmitri Smirnov
3c233ca4ea Fix Windows environment issues
Summary:
Enable directIO on WritableFileImpl::Append
     with offset being current length of the file.
     Enable UniqueID tests on Windows, disable others but
     leeting them to compile. Unique tests are valuable to
     detect failures on different filesystems and upcoming
     ReFS.
     Clear output in WinEnv Getchildren.This is different from
     previous strategy, do not touch output on failure.
     Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746

Differential Revision: D4385681

Pulled By: IslamAbdelRahman

fbshipit-source-id: c07b702
2017-01-09 15:54:12 -08:00
Maysam Yabandeh
d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh
54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Andrew Kryczka
b104b87814 Maintain position in range deletions map
Summary:
When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key.

- When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key.
- Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map.
- When the iterator seeks, we use kBinarySearch for repositioning
- After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used.
- Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete().
Closes https://github.com/facebook/rocksdb/pull/1701

Differential Revision: D4350318

Pulled By: ajkr

fbshipit-source-id: 5129b76
2017-01-05 10:39:12 -08:00
siddontang
653ac1f9c6 C API: support total_order_mode
Summary: Closes https://github.com/facebook/rocksdb/pull/1687

Differential Revision: D4349210

Pulled By: IslamAbdelRahman

fbshipit-source-id: 32d0fbd
2017-01-03 18:39:14 -08:00
Adam Retter
85ac1a320a Fix rocksdb::Status::getState
Summary:
This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState.
Closes https://github.com/facebook/rocksdb/issues/1688
Closes https://github.com/facebook/rocksdb/pull/1714

Differential Revision: D4364181

Pulled By: yiwu-arbug

fbshipit-source-id: 8e073b4
2017-01-03 18:39:14 -08:00
Islam AbdelRahman
76711b6e77 Make ExternalSSTFileTest::CompactionDeadlock more deterministic
Summary:
It's not always true that `ASSERT_EQ(running_threads.load(), 2);`
Closes https://github.com/facebook/rocksdb/pull/1736

Differential Revision: D4374091

Pulled By: IslamAbdelRahman

fbshipit-source-id: 4f70bbd
2017-01-03 18:09:20 -08:00
Islam AbdelRahman
c963460dbc Fix tests under GCC_481
Summary:
This fix the issue with tests failing under GCC 481, I am not sure what is the exact reason
Closes https://github.com/facebook/rocksdb/pull/1735

Differential Revision: D4374094

Pulled By: IslamAbdelRahman

fbshipit-source-id: b3625bc
2017-01-03 17:54:12 -08:00
Vincent Lee
e425ec1162 utilities/backupable: backup should limit the copy size of wal.
Summary:
Since the backup work as snapshot, we should only copy
 the bytes of the wal while we get the alive files.
Closes https://github.com/facebook/rocksdb/pull/1733

Differential Revision: D4373457

Pulled By: ajkr

fbshipit-source-id: 389318f
2016-12-31 10:54:20 -08:00
Maysam Yabandeh
0712d541d1 Delegate Cleanables
Summary:
Cleanable objects will perform the registered cleanups when
they are destructed. We however rather to delay this cleaning like when
we are gathering the merge operands. Current approach is to create the
Cleanable object on heap (instead of on stack) and delay deleting it.

By allowing Cleanables to delegate their cleanups to another cleanable
object we can delay the cleaning without however the need to craete the
cleanable object on heap and keeping it around. This patch applies this
technique for the cleanups of BlockIter and shows improved performance
for some in-memory benchmarks:
+1.8% for merge worklaod, +6.4% for non-merge workload when the merge
operator is specified.
https://our.intern.facebook.com/intern/tasks?t=15168163

Non-merge benchmark:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none

Reading random with no merge operator specified:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="read
Closes https://github.com/facebook/rocksdb/pull/1711

Differential Revision: D4361163

Pulled By: maysamyabandeh

fbshipit-source-id: 9801e07
2016-12-29 15:54:19 -08:00
Islam AbdelRahman
d58ef52ba6 Allow SstFileWriter to Fadvise the file away from page cache
Summary:
Add `fadvise_trigger` option to `SstFileWriter`

If fadvise_trigger is passed with a non-zero value, SstFileWriter will invalidate the os page cache every `fadvise_trigger` bytes for the sst file
Closes https://github.com/facebook/rocksdb/pull/1731

Differential Revision: D4371246

Pulled By: IslamAbdelRahman

fbshipit-source-id: 91caff1
2016-12-29 15:09:19 -08:00
Siying Dong
17a4b75cc3 Always fsync the file after file copying
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728

Differential Revision: D4371083

Pulled By: siying

fbshipit-source-id: 579e14c
2016-12-28 19:09:16 -08:00
leipeng
a738af8f84 db/pinned_iterators_manager.h: bugfix
Summary:
std::unique(beg, end) returns an iterator of unique_end, data behind unique_end should not be accessed.
Closes https://github.com/facebook/rocksdb/pull/1726

Differential Revision: D4371076

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5564450
2016-12-28 18:54:57 -08:00
Siying Dong
438f22bc56 Fix bug of Checkpoint loses recent transactions with 2PC
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724

Differential Revision: D4368319

Pulled By: siying

fbshipit-source-id: cc2c746
2016-12-28 12:24:16 -08:00
Aaron Gao
972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Islam AbdelRahman
989e644ed8 Remove sst_file_manager option from LITE
Summary:
Remove sst_file_manager option from LITE
Closes https://github.com/facebook/rocksdb/pull/1690

Differential Revision: D4341331

Pulled By: IslamAbdelRahman

fbshipit-source-id: 9f9328d
2016-12-21 17:54:21 -08:00
Islam AbdelRahman
1beef6569a Fix c_test
Summary:
addfile phase in c_test could fail because in previous steps we did a DeleteRange.
Fix the test by simply moving the addfile phase before DeleteRange
Closes https://github.com/facebook/rocksdb/pull/1672

Differential Revision: D4328896

Pulled By: IslamAbdelRahman

fbshipit-source-id: 1d946df
2016-12-21 17:39:14 -08:00
Andrew Kryczka
50e305de98 Collapse range deletions
Summary:
Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach.

Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map.
Closes https://github.com/facebook/rocksdb/pull/1614

Differential Revision: D4270397

Pulled By: ajkr

fbshipit-source-id: 93092c7
2016-12-19 16:54:12 -08:00
Yi Wu
5d1457dbbf Dump persistent cache options
Summary:
Dump persistent cache options
Closes https://github.com/facebook/rocksdb/pull/1679

Differential Revision: D4337019

Pulled By: yiwu-arbug

fbshipit-source-id: 3812f8a
2016-12-19 14:09:12 -08:00
Daniel Black
342370f1d3 Simplify MemTable::Update
Summary:
As suggested by testn in #1650

The Add is at the end of the function. Having a fallthough
will result in it being added twice.
Closes https://github.com/facebook/rocksdb/pull/1676

Differential Revision: D4331906

Pulled By: yiwu-arbug

fbshipit-source-id: 895c4a0
2016-12-17 00:09:13 -08:00
Ding Ma
1a136c1f13 Expose file size
Summary:
add a new function to SstFileWriter that will tell the user how big is there file right now.
Closes https://github.com/facebook/rocksdb/pull/1686

Differential Revision: D4338868

Pulled By: mdyuki1016

fbshipit-source-id: c1ee16a
2016-12-16 18:39:12 -08:00
Andrew Kryczka
fbff4628a9 Reduce compaction iterator status checks
Summary:
seems it's expensive to check status since the underlying merge iterator checks status of all its children. so only do it when it's really necessary to get the status before invoking Next(), i.e., when we're advancing to get the first key in the next file.
Closes https://github.com/facebook/rocksdb/pull/1691

Differential Revision: D4343446

Pulled By: siying

fbshipit-source-id: 70ab315
2016-12-16 17:39:09 -08:00
Daniel Black
816c1e30ca gcc-7 requires include <functional> for std::function
Summary:
Fixes compile error:

In file included from ./util/statistics.h:17:0,
                 from ./util/stop_watch.h:8,
                 from ./util/perf_step_timer.h:9,
                 from ./util/iostats_context_imp.h:8,
                 from ./util/posix_logger.h:27,
                 from ./port/util_logger.h:18,
                 from ./db/auto_roll_logger.h:15,
                 from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
   typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656

Differential Revision: D4318702

Pulled By: yiwu-arbug

fbshipit-source-id: 8c5d17a
2016-12-16 11:24:18 -08:00
Yi Wu
c270735861 Iterator should be in corrupted status if merge operator return false
Summary:
Iterator should be in corrupted status if merge operator return false.
Also add test to make sure if max_successive_merges is hit during write,
data will not be lost.
Closes https://github.com/facebook/rocksdb/pull/1665

Differential Revision: D4322695

Pulled By: yiwu-arbug

fbshipit-source-id: b327b05
2016-12-16 11:09:16 -08:00
siddontang
8f5d24ae68 C API: support get usage and pinned_usage for cache
Summary: Closes https://github.com/facebook/rocksdb/pull/1671

Differential Revision: D4327453

Pulled By: yiwu-arbug

fbshipit-source-id: bcdbc65
2016-12-15 17:24:17 -08:00
Daniel Black
cfc34d7c4e Missing break in case in DBTestBase::CurrentOptions
Summary:
Found by gcc-7 compile error.

This appeared to be a fault as these options seems too different.
Closes https://github.com/facebook/rocksdb/pull/1667

Differential Revision: D4324174

Pulled By: yiwu-arbug

fbshipit-source-id: 0f65383
2016-12-13 18:39:14 -08:00
Daniel Black
bfbcec2339 Gcc 7 error expansion to defined
Summary:
sorry if these gcc-7/clang-4 cleanups are getting tedious.
Closes https://github.com/facebook/rocksdb/pull/1658

Differential Revision: D4318792

Pulled By: yiwu-arbug

fbshipit-source-id: 8e85891
2016-12-13 18:39:14 -08:00
Daniel Black
67adc937b6 intentional fallthough (prevents gcc-7/clang-4 error)
Summary:
db/memtable.cc: In member function 'void rocksdb::MemTable::Update(rocksdb::SequenceNumber, const rocksdb::Slice&, const rocksdb::Slice&)':
db/memtable.cc:736:11: error: this statement may fall through [-Werror=implicit-fallthrough=]
           }
           ^
db/memtable.cc:738:9: note: here
         default:
         ^~~~~~~
cc1plus: all warnings being treated as errors

closes #1650
Closes https://github.com/facebook/rocksdb/pull/1655

Differential Revision: D4318696

Pulled By: yiwu-arbug

fbshipit-source-id: 1a8981c
2016-12-13 14:39:17 -08:00
Islam AbdelRahman
1a146f89c7 break Flush wait for dropped CF
Summary:
In FlushJob we dont do the Flush if the CF is dropped
https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188

but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped.
Closes https://github.com/facebook/rocksdb/pull/1664

Differential Revision: D4321032

Pulled By: IslamAbdelRahman

fbshipit-source-id: 6e2b25d
2016-12-13 14:09:12 -08:00
Yi Wu
36d42e65d0 Disable test to unblock travis build
Summary:
The two tests keep failing in travis. Disable them and will fix later.
Closes https://github.com/facebook/rocksdb/pull/1648

Differential Revision: D4316389

Pulled By: yiwu-arbug

fbshipit-source-id: 0a370e7
2016-12-13 11:54:14 -08:00
siddontang
b57dd9262a C API: support writebatch delete range
Summary:
Seem that writebatch delete range can work now, so I add C API for later use.

Btw, can we use this feature in production now?
Closes https://github.com/facebook/rocksdb/pull/1647

Differential Revision: D4314534

Pulled By: ajkr

fbshipit-source-id: e835165
2016-12-13 11:24:18 -08:00
Islam AbdelRahman
2ba59b5a1e Disallow ingesting files into dropped CFs
Summary:
This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF.

Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF
Closes https://github.com/facebook/rocksdb/pull/1657

Differential Revision: D4318657

Pulled By: IslamAbdelRahman

fbshipit-source-id: ed6ea2b
2016-12-13 00:54:14 -08:00
Jonathan Lee
2cabdb8f44 Increase buffer size
Summary:
When compiling with GCC>=7.0.0, "db/internal_stats.cc" fails to compile as the data being written to the buffer potentially exceeds its size.

This fix simply doubles the size of the buffer, thus accommodating the max possible data size.
Closes https://github.com/facebook/rocksdb/pull/1635

Differential Revision: D4302162

Pulled By: yiwu-arbug

fbshipit-source-id: c76ad59
2016-12-09 11:54:22 -08:00
Jonathan Lee
4a17b47bb5 Remove unnecessary header include
Summary:
Remove "util/testharness.h" from list of includes for "db/db_filesnapshot.cc", as it wasn't being used and thus caused an extraneous dependency on gtest.
Closes https://github.com/facebook/rocksdb/pull/1634

Differential Revision: D4302146

Pulled By: yiwu-arbug

fbshipit-source-id: e900c0b
2016-12-09 11:54:21 -08:00
Mike Kolupaev
8c2b921fdf Fixed a crash in debug build in flush_job.cc
Summary:
It was doing `&range_del_iters[0]` on an empty vector. Even though the resulting pointer is never dereferenced, it's still bad for two reasons:
* the practical reason: it crashes with `std::out_of_range` exception in our debug build,
* the "C++ standard lawyer" reason: it's undefined behavior because, in `std::vector` implementation, it probably "dereferences" a null pointer, which is invalid even though it doesn't actually read the pointed memory, just converts a pointer into a reference (and then flush_job.cc converts it back to pointer); nullptr references are undefined behavior.
Closes https://github.com/facebook/rocksdb/pull/1612

Differential Revision: D4265625

Pulled By: al13n321

fbshipit-source-id: db26fb9
2016-12-09 10:39:12 -08:00
Islam AbdelRahman
20ce081fae Fix issue where IngestExternalFile insert blocks in block cache with g_seqno=0
Summary:
When we Ingest an external file we open it to read some metadata and first/last key
during doing that we insert blocks into the block cache with global_seqno = 0

If we move the file (did not copy it) into the DB, we will use these blocks with the wrong seqno in the read path
Closes https://github.com/facebook/rocksdb/pull/1627

Differential Revision: D4293332

Pulled By: yiwu-arbug

fbshipit-source-id: 3ce5523
2016-12-08 13:39:18 -08:00
zhangjinpeng1987
45c7ce1377 CompactRangeOptions C API
Summary:
Add C API for CompactRangeOptions.
Closes https://github.com/facebook/rocksdb/pull/1596

Differential Revision: D4252339

Pulled By: yiwu-arbug

fbshipit-source-id: f768f93
2016-12-07 17:54:14 -08:00
Andrew Kryczka
b821984d31 DeleteRange read path end-to-end tests
Summary: Closes https://github.com/facebook/rocksdb/pull/1592

Differential Revision: D4246260

Pulled By: ajkr

fbshipit-source-id: ce03fa2
2016-12-07 12:54:17 -08:00
Artemiy Kolesnikov
2f4fc539c6 Compaction::IsTrivialMove relaxing
Summary:
IsTrivialMove returns true if no input file overlaps with output_level+1 with more than max_compaction_bytes_ bytes.
Closes https://github.com/facebook/rocksdb/pull/1619

Differential Revision: D4278338

Pulled By: yiwu-arbug

fbshipit-source-id: 994c001
2016-12-07 11:54:11 -08:00
Islam AbdelRahman
ed8fbdb560 Add EventListener::OnExternalFileIngested() event
Summary:
Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events
Closes https://github.com/facebook/rocksdb/pull/1623

Differential Revision: D4285844

Pulled By: IslamAbdelRahman

fbshipit-source-id: 0b95a88
2016-12-06 14:09:17 -08:00
Mike Kolupaev
beb36d9c1e Fixed CompactionFilter::Decision::kRemoveAndSkipUntil
Summary:
Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure:

  db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed

The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't.
Closes https://github.com/facebook/rocksdb/pull/1611

Differential Revision: D4265622

Pulled By: al13n321

fbshipit-source-id: 747b10f
2016-12-05 15:24:11 -08:00
Islam AbdelRahman
67f37cf198 Allow user to specify a CF for SST files generated by SstFileWriter
Summary:
Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF.
This allow us to persist the CF id in the generated file
Closes https://github.com/facebook/rocksdb/pull/1615

Differential Revision: D4270422

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7fb954e
2016-12-05 14:24:16 -08:00
Anton Safonov
9053fe2a5c Made delete_obsolete_files_period_micros option dynamic
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595

Differential Revision: D4246569

Pulled By: tonek

fbshipit-source-id: d23f560
2016-12-05 14:24:16 -08:00
Islam AbdelRahman
edde954e7b fix clang build
Summary:
override is missing for FilterV2
Closes https://github.com/facebook/rocksdb/pull/1606

Differential Revision: D4263832

Pulled By: IslamAbdelRahman

fbshipit-source-id: d8b337a
2016-12-01 18:39:10 -08:00
Islam AbdelRahman
e39d080871 Fix travis (compile for clang < 3.9)
Summary:
Travis fail because it uses clang 3.6 which don't recognize
`__attribute__((__no_sanitize__("undefined")))`
Closes https://github.com/facebook/rocksdb/pull/1601

Differential Revision: D4257175

Pulled By: IslamAbdelRahman

fbshipit-source-id: fb4d1ab
2016-12-01 10:09:22 -08:00
fangchenliaohui
b77007df8b Bug: paralle_group status updated in WriteThread::CompleteParallelWorker
Summary:
Multi-write thread may update the status of the parallel_group in
WriteThread::CompleteParallelWorker if the status of Writer is not ok!
When copy write status to the paralle_group, the write thread just hold the
mutex of the the writer processed by itself. it is useless. The thread
should held the the leader of the parallel_group instead.
Closes https://github.com/facebook/rocksdb/pull/1598

Differential Revision: D4252335

Pulled By: siying

fbshipit-source-id: 3864cf7
2016-12-01 09:54:11 -08:00
Mike Kolupaev
247d0979aa Support for range skips in compaction filter
Summary:
This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API.

Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :)

The diff is not as big as it may seem, about half of the lines are a test.
Closes https://github.com/facebook/rocksdb/pull/1599

Differential Revision: D4252092

Pulled By: al13n321

fbshipit-source-id: 41e1e48
2016-12-01 07:09:15 -08:00
Panagiotis Ktistakis
96fcefbf1d c api: expose option for dynamic level size target
Summary: Closes https://github.com/facebook/rocksdb/pull/1587

Differential Revision: D4245923

Pulled By: yiwu-arbug

fbshipit-source-id: 6ee7291
2016-11-30 11:24:14 -08:00
zhangjinpeng1987
00197cff39 Add C API to set base_backgroud_compactions
Summary:
Add C API to set base_backgroud_compactions
Closes https://github.com/facebook/rocksdb/pull/1571

Differential Revision: D4245709

Pulled By: yiwu-arbug

fbshipit-source-id: 792c6b8
2016-11-30 11:09:13 -08:00
Andrew Kryczka
5b219eccb5 deleterange end-to-end test improvements for lite/robustness
Summary: Closes https://github.com/facebook/rocksdb/pull/1591

Differential Revision: D4246019

Pulled By: ajkr

fbshipit-source-id: 0c4aa37
2016-11-29 12:24:13 -08:00
Andrew Kryczka
e333528991 DeleteRange write path end-to-end tests
Summary: Closes https://github.com/facebook/rocksdb/pull/1578

Differential Revision: D4241171

Pulled By: ajkr

fbshipit-source-id: ce5fd83
2016-11-29 11:09:22 -08:00
Siying Dong
7784980fcd Fix mis-reporting of compaction read bytes to the base level
Summary:
In dynamic leveled compaction, when calculating read bytes, output level bytes may be wronglyl calculated as input level inputs. Fix it.
Closes https://github.com/facebook/rocksdb/pull/1475

Differential Revision: D4148412

Pulled By: siying

fbshipit-source-id: f2f475a
2016-11-29 11:09:22 -08:00
Islam AbdelRahman
3c6b49ed66 Fix implicit conversion between int64_t to int
Summary:
Make conversion explicit, implicit conversion breaks the build
Closes https://github.com/facebook/rocksdb/pull/1589

Differential Revision: D4245158

Pulled By: IslamAbdelRahman

fbshipit-source-id: aaec00d
2016-11-29 10:54:15 -08:00
Siying Dong
b3b875657f Remove unused assignment in db/db_iter.cc
Summary:
"make analyze" complains the assignment is not useful. Remove it.
Closes https://github.com/facebook/rocksdb/pull/1581

Differential Revision: D4241697

Pulled By: siying

fbshipit-source-id: 178f67a
2016-11-29 09:09:14 -08:00
Andrew Kryczka
4f6e89b1d0 Fix range deletion covering key in same SST file
Summary:
AddTombstones() needs to be before t->Get(), oops :'(
Closes https://github.com/facebook/rocksdb/pull/1576

Differential Revision: D4241041

Pulled By: ajkr

fbshipit-source-id: 781ceea
2016-11-28 22:54:13 -08:00
Islam AbdelRahman
a2bf265a39 Avoid intentional overflow in GetL0ThresholdSpeedupCompaction
Summary:
99c052a34f fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve.
UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t
Closes https://github.com/facebook/rocksdb/pull/1582

Differential Revision: D4241525

Pulled By: IslamAbdelRahman

fbshipit-source-id: b3ae21f
2016-11-28 18:39:13 -08:00
Islam AbdelRahman
52fd1ff2c2 disable UBSAN for functions with intentional -ve shift / overflow
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow

These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577

Differential Revision: D4240801

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3e1caf6
2016-11-28 17:54:12 -08:00
Islam AbdelRahman
1886c435b9 Fix CompactionJob::Install division by zero
Summary:
Fix CompactionJob::Install division by zero
Closes https://github.com/facebook/rocksdb/pull/1580

Differential Revision: D4240794

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7286721
2016-11-28 16:54:16 -08:00
Islam AbdelRahman
13e66a8f51 Fix compaction_job.cc division by zero
Summary:
Fix division by zero in compaction_job.cc
Closes https://github.com/facebook/rocksdb/pull/1575

Differential Revision: D4240818

Pulled By: IslamAbdelRahman

fbshipit-source-id: a8bc757
2016-11-28 16:39:13 -08:00
Andrew Kryczka
01eabf7375 Fix double-counted deletion stat
Summary:
Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted.
Closes https://github.com/facebook/rocksdb/pull/1574

Differential Revision: D4241181

Pulled By: ajkr

fbshipit-source-id: c9aaa15
2016-11-28 15:54:12 -08:00
Andrew Kryczka
7ffb10fc1a DeleteRange compaction statistics
Summary:
- "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them
- "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them
- s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction
- Move the above class into a separate .h file to avoid circular dependency.
Closes https://github.com/facebook/rocksdb/pull/1520

Differential Revision: D4187179

Pulled By: ajkr

fbshipit-source-id: 10c2103
2016-11-28 11:54:12 -08:00
Mike Kolupaev
236d4c67e9 Less linear search in DBIter::Seek() when keys are overwritten a lot
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:

```
  rocksdb::MergingIterator::Next
  rocksdb::DBIter::FindNextUserEntryInternal
  rocksdb::DBIter::Seek
```

I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.

CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413

Differential Revision: D4083879

Pulled By: IslamAbdelRahman

fbshipit-source-id: a83ddae
2016-11-28 10:24:11 -08:00
Siying Dong
cd7c4143d7 Improve Write Stalling System
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.

Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562

Differential Revision: D4218519

Pulled By: siying

fbshipit-source-id: 95e4088
2016-11-23 09:24:15 -08:00
Yi Wu
dfb6fe6755 Unified InlineSkipList::Insert algorithm with hinting
Summary:
This PR is based on nbronson's diff with small
modifications to wire it up with existing interface. Comparing to
previous version, this approach works better for inserting keys in
decreasing order or updating the same key, and impose less restriction
to the prefix extractor.

---- Summary from original diff ----

This diff introduces a single InlineSkipList::Insert that unifies
the existing sequential insert optimization (prev_), concurrent insertion,
and insertion using externally-managed insertion point hints.

There's a deep symmetry between insertion hints (cursors) and the
concurrent algorithm.  In both cases we have partial information from
the recent past that is likely but not certain to be accurate.  This diff
introduces the struct InlineSkipList::Splice, which encodes predecessor
and successor information in the same form that was previously only used
within a single call to InsertConcurrently.  Splice holds information
about an insertion point that can be used to levera
Closes https://github.com/facebook/rocksdb/pull/1561

Differential Revision: D4217283

Pulled By: yiwu-arbug

fbshipit-source-id: 33ee437
2016-11-22 14:09:13 -08:00
Andrew Kryczka
734e4acafb Eliminate redundant cache lookup with range deletion
Summary:
When we introduced range deletion block, TableCache::Get() and TableCache::NewIterator() each did two table cache lookups, one for range deletion block iterator and another for getting the table reader to which the Get()/NewIterator() is delegated. This extra cache lookup was very CPU-intensive (about 10% overhead in a read-heavy benchmark). We can avoid it by reusing the Cache::Handle created for range deletion block iterator to get the file reader.
Closes https://github.com/facebook/rocksdb/pull/1537

Differential Revision: D4201167

Pulled By: ajkr

fbshipit-source-id: d33ffd8
2016-11-21 21:24:11 -08:00
Maysam Yabandeh
182b940e70 Add WriteOptions.no_slowdown
Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527

Differential Revision: D4191405

Pulled By: maysamyabandeh

fbshipit-source-id: 7f3ce3f
2016-11-21 18:09:13 -08:00
Karthikeyan Radhakrishnan
4118e13330 Persistent Cache: Expose stats to user via public API
Summary:
Exposing persistent cache stats (counters) to the user via public API.
Closes https://github.com/facebook/rocksdb/pull/1485

Differential Revision: D4155274

Pulled By: siying

fbshipit-source-id: 30a9f50
2016-11-21 17:39:13 -08:00
Andrew Kryczka
fd43ee09da Range deletion microoptimizations
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548

Differential Revision: D4208169

Pulled By: ajkr

fbshipit-source-id: 2fd65cf
2016-11-21 12:24:13 -08:00
Andrew Kryczka
fe349db57b Remove Arena in RangeDelAggregator
Summary:
The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator.
Closes https://github.com/facebook/rocksdb/pull/1547

Differential Revision: D4207781

Pulled By: ajkr

fbshipit-source-id: 9d1c130
2016-11-19 14:24:12 -08:00
Andrew Kryczka
3f62215210 Lazily initialize RangeDelAggregator's map and pinning manager
Summary:
Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction.

Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist.
Closes https://github.com/facebook/rocksdb/pull/1539

Differential Revision: D4203488

Pulled By: ajkr

fbshipit-source-id: 3b36279
2016-11-18 17:09:11 -08:00
Andrew Kryczka
635a7bd1ad refactor TableCache Get/NewIterator for single exit points
Summary:
these functions were too complicated to change with exit points everywhere, so refactored them.

btw, please review urgently, this is a prereq to fix the 5.0 perf regression
Closes https://github.com/facebook/rocksdb/pull/1534

Differential Revision: D4198972

Pulled By: ajkr

fbshipit-source-id: 04ebfb7
2016-11-17 14:39:13 -08:00
Siying Dong
a4eb7387b2 Allow plain table to store index on file with bloom filter disabled
Summary:
Currently plain table bloom filter is required if storing metadata on file. Remove the constraint.
Closes https://github.com/facebook/rocksdb/pull/1525

Differential Revision: D4190977

Pulled By: siying

fbshipit-source-id: be60442
2016-11-17 11:09:13 -08:00
Yi Wu
36e4762ce0 Remove Ticker::SEQUENCE_NUMBER
Summary:
Remove the ticker count because:
* Having to reset the ticker count in WriteImpl is ineffiecent;
* It doesn't make sense to have it as a ticker count if multiple db
  instance share a statistics object.
Closes https://github.com/facebook/rocksdb/pull/1531

Differential Revision: D4194442

Pulled By: yiwu-arbug

fbshipit-source-id: e2110a9
2016-11-16 22:39:09 -08:00
Andrew Kryczka
760ef68a69 fix deleterange asan issue
Summary:
pinned_iters_mgr_ pins iterators allocated with arena_, so we should order the
instance variable declarations such that the pinned iterators have their destructors
executed before the arena is destroyed.
Closes https://github.com/facebook/rocksdb/pull/1528

Differential Revision: D4191984

Pulled By: ajkr

fbshipit-source-id: 1386f20
2016-11-16 14:09:07 -08:00
Siying Dong
972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Yi Wu
1543d5d92e Report memory usage by memtable insert hints map.
Summary:
It is hard to measure acutal memory usage by std containers. Even
providing a custom allocator will miss count some of the usage. Here we
only do a wild guess on its memory usage.
Closes https://github.com/facebook/rocksdb/pull/1511

Differential Revision: D4179945

Pulled By: yiwu-arbug

fbshipit-source-id: 32ab929
2016-11-15 20:24:13 -08:00
Andrew Kryczka
48e8baebc0 Decouple data iterator and range deletion iterator in TableCache
Summary:
Previously we used TableCache::NewIterator() for multiple purposes (data
block iterator and range deletion iterator), and returned non-ok status in
the data block iterator. In one case where the caller only used the range
deletion block iterator (9e7cf3469b/db/version_set.cc (L965-L973)),
we didn't check/free the data block iterator containing non-ok status, which
caused a valgrind error.

So, this diff decouples creation of data block and range deletion block iterators,
and updates the callers accordingly. Both functions can return non-ok status
in an InternalIterator. Since the non-ok status is returned in an iterator that the
callers will definitely use, it should be more usable/less error-prone.
Closes https://github.com/facebook/rocksdb/pull/1513

Differential Revision: D4181423

Pulled By: ajkr

fbshipit-source-id: 835b8f5
2016-11-15 17:24:28 -08:00
Andrew Kryczka
661e4c9267 DeleteRange unsupported in non-block-based tables
Summary:
Return an error from DeleteRange() (or Write() if the user is using the
low-level WriteBatch API) if an unsupported table type is configured.
Closes https://github.com/facebook/rocksdb/pull/1519

Differential Revision: D4185933

Pulled By: ajkr

fbshipit-source-id: abcdf84
2016-11-15 15:24:16 -08:00
Andrew Kryczka
489d142808 DeleteRange interface
Summary:
Expose DeleteRange() interface since we think the implementation is functionally correct now.
Closes https://github.com/facebook/rocksdb/pull/1503

Differential Revision: D4171921

Pulled By: ajkr

fbshipit-source-id: 5e21c98
2016-11-15 15:24:16 -08:00
Islam AbdelRahman
eba99c28e4 Fix min_write_buffer_number_to_merge = 0 bug
Summary:
It's possible that we set min_write_buffer_number_to_merge to 0.
This should never happen
Closes https://github.com/facebook/rocksdb/pull/1515

Differential Revision: D4183356

Pulled By: yiwu-arbug

fbshipit-source-id: c9d39d7
2016-11-15 13:54:08 -08:00
Artemiy Kolesnikov
91300d01f6 Dynamic max_total_wal_size option
Summary: Closes https://github.com/facebook/rocksdb/pull/1509

Differential Revision: D4176426

Pulled By: yiwu-arbug

fbshipit-source-id: b57689d
2016-11-14 22:54:17 -08:00
Andrew Kryczka
ec2f64794b Consider subcompaction boundaries when updating file boundaries for range deletion
Summary:
Adjusted AddToBuilder() to take lower_bound and upper_bound, which serve two purposes: (1) only range deletions overlapping with the interval [lower_bound, upper_bound) will be added to the output file, and (2) the output file's boundaries will not be extended before lower_bound or after upper_bound. Our computation of lower_bound/upper_bound consider both subcompaction boundaries and previous/next files within the subcompaction.

Test cases are here (level subcompactions: https://gist.github.com/ajkr/63c7eae3e9667c5ebdc0a7efb74ac332, and universal subcompactions: https://gist.github.com/ajkr/5a62af77c4ebe4052a1955c496d51fdb) but can't be included in this diff as they depend on committing the API first. They fail before this change and pass after.
Closes https://github.com/facebook/rocksdb/pull/1501

Reviewed By: yhchiang

Differential Revision: D4171685

Pulled By: ajkr

fbshipit-source-id: ee99db8
2016-11-14 20:24:21 -08:00
Andrew Kryczka
3b192f6186 Handle full final subcompaction output file with range deletions
Summary:
This conditional should only open a new file that's dedicated to range deletions when it's the sole output of the subcompaction. Previously, we created such a file whenever the table builder was nullptr, which would've also been the case whenever the CompactionIterator's final key coincided with the final output table becoming full.
Closes https://github.com/facebook/rocksdb/pull/1507

Differential Revision: D4174613

Pulled By: ajkr

fbshipit-source-id: 9ffacea
2016-11-14 17:54:20 -08:00
Andrew Kryczka
6c57952002 Make range deletion inclusive-exclusive
Summary:
This makes it easier to implement future optimizations like range collapsing.
Closes https://github.com/facebook/rocksdb/pull/1504

Differential Revision: D4172214

Pulled By: ajkr

fbshipit-source-id: ac4942f
2016-11-14 17:39:13 -08:00
Yi Wu
1ea79a78c9 Optimize sequential insert into memtable - Part 1: Interface
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.

This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
  implementation to return a hint of insert position, which can be later
  pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
  via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419

Differential Revision: D4079367

Pulled By: yiwu-arbug

fbshipit-source-id: 3555326
2016-11-13 19:09:18 -08:00
Yi Wu
df5eeb85ca Optimize sequential insert into memtable - Part 2: Implementation
Summary:
Implement a insert hint into skip-list to hint insert position. This is
to optimize for the write workload where there are multiple stream of
sequential writes. For example, there is a stream of keys of a1, a2,
a3... but also b1, b2, b2... Each stream are not neccessary strictly
sequential, but can get reorder a little bit. User can specify a prefix
extractor and the `SkipListRep` can thus maintan a hint for each of the
stream for fast insert into memtable.

This is the internal implementation part. See #1419 for the interface part.
See inline comments for details.
Closes https://github.com/facebook/rocksdb/pull/1449

Differential Revision: D4106781

Pulled By: yiwu-arbug

fbshipit-source-id: f4d48c4
2016-11-13 13:09:16 -08:00
Islam AbdelRahman
5ed650857d Fix SstFileWriter destructor
Summary:
If user did not call SstFileWriter::Finish() or called Finish() but it failed.
We need to abandon the builder, to avoid destructing it while it's open
Closes https://github.com/facebook/rocksdb/pull/1502

Differential Revision: D4171660

Pulled By: IslamAbdelRahman

fbshipit-source-id: ab6f434
2016-11-12 20:11:19 -08:00
Lijun Tang
adb665e0bf Allowed delayed_write_rate option to be dynamically set.
Summary: Closes https://github.com/facebook/rocksdb/pull/1488

Differential Revision: D4157784

Pulled By: siying

fbshipit-source-id: f150081
2016-11-12 15:54:11 -08:00
Maysam Yabandeh
361010d447 Exporting compaction stats in the form of a map
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477

Differential Revision: D4149836

Pulled By: maysamyabandeh

fbshipit-source-id: b3df19f
2016-11-11 20:54:14 -08:00
Aaron Gao
b39b2ee12f do not call get() in recovery mode
Summary:
This is a previous fix that has a typo
Closes https://github.com/facebook/rocksdb/pull/1487

Differential Revision: D4157381

Pulled By: lightmark

fbshipit-source-id: f079be8
2016-11-10 11:24:20 -08:00
Reid Horuff
1ca5f6d132 Fix 2PC Recovery SeqId Miscount
Summary:
Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL.
Closes https://github.com/facebook/rocksdb/pull/1486

Differential Revision: D4156064

Pulled By: reidHoruff

fbshipit-source-id: a6da8d9
2016-11-10 11:09:22 -08:00
Andrew Kryczka
c90fef88b1 fix open failure with empty wal
Summary: Closes https://github.com/facebook/rocksdb/pull/1490

Differential Revision: D4158821

Pulled By: IslamAbdelRahman

fbshipit-source-id: 59b73f4
2016-11-09 22:24:26 -08:00
Andrew Kryczka
4e20c5da20 Store internal keys in TombstoneMap
Summary:
This fixes a correctness issue where ranges with same begin key would overwrite each other.

This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator.

End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix.
Closes https://github.com/facebook/rocksdb/pull/1484

Differential Revision: D4155248

Pulled By: ajkr

fbshipit-source-id: 304b4b9
2016-11-09 15:09:18 -08:00
Yueh-Hsuan Chiang
a9fb346e4a Fix RocksDB Lite build failure in c_test.cc
Summary:
Fix the following RocksDB Lite build failure in c_test.cc

db/c_test.c:1051:3: error: implicit declaration of function 'fprintf' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
  fprintf(stderr, "SKIPPED\n");
  ^
db/c_test.c:1051:3: error: declaration of built-in function 'fprintf' requires inclusion of the header <stdio.h> [-Werror,-Wbuiltin-requires-header]
db/c_test.c:1051:11: error: use of undeclared identifier 'stderr'
  fprintf(stderr, "SKIPPED\n");
          ^
3 errors generated.
Closes https://github.com/facebook/rocksdb/pull/1479

Differential Revision: D4151160

Pulled By: yhchiang

fbshipit-source-id: a471a30
2016-11-09 12:24:18 -08:00
Reid Horuff
d133b08f68 Use correct sequence number when creating memtable
Summary:
copied from: 5ebfd2623a

Opening existing RocksDB attempts recovery from log files, which uses
wrong sequence number to create the memtable. This is a regression
introduced in change a400336.

This change includes a test demonstrating the problem, without the fix
the test fails with "Operation failed. Try again.: Transaction could not
check for conflicts for operation at SequenceNumber 1 as the MemTable
only contains changes newer than SequenceNumber 2.  Increasing the value
of the max_write_buffer_number_to_maintain option could reduce the
frequency of this error"

This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname
and me.
Closes https://github.com/facebook/rocksdb/pull/1458

Differential Revision: D4143791

Pulled By: reidHoruff

fbshipit-source-id: 5a25033
2016-11-09 12:24:17 -08:00
Islam AbdelRahman
9bd191d2f4 Fix deadlock between (WriterThread/Compaction/IngestExternalFile)
Summary:
A deadlock is possible if this happen

(1) Writer thread is stopped because it's waiting for compaction to finish
(2) Compaction is waiting for current IngestExternalFile() calls to finish
(3) IngestExternalFile() is waiting to be able to acquire the writer thread
(4) WriterThread is held by stopped writes that are waiting for compactions to finish

This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread.

This patch include a unittest to reproduce the described scenario
Closes https://github.com/facebook/rocksdb/pull/1480

Differential Revision: D4151646

Pulled By: IslamAbdelRahman

fbshipit-source-id: 09b39db
2016-11-09 10:54:10 -08:00
Islam AbdelRahman
193221e0a1 Fix Forward Iterator Seek()/SeekToFirst()
Summary:
In ForwardIterator::SeekInternal(), we may end up passing empty Slice representing an internal key to InternalKeyComparator::Compare.
and when we try to extract the user key from this empty Slice, we will create a slice with size = 0 - 8 ( which will overflow and cause us to read invalid memory as well )

Scenarios to reproduce these issues are in the unit tests
Closes https://github.com/facebook/rocksdb/pull/1467

Differential Revision: D4136660

Pulled By: lightmark

fbshipit-source-id: 151e128
2016-11-08 13:54:31 -08:00
Aaron Gao
e48f3f8b9e remove tabs and duplicate #include in c api
Summary:
fix lint error about tabs and duplicate includes.
Closes https://github.com/facebook/rocksdb/pull/1476

Differential Revision: D4149646

Pulled By: lightmark

fbshipit-source-id: 2e0a632
2016-11-08 13:54:31 -08:00
Jay Lee
a7875272d7 c: support seek_for_prev
Summary:
support seek_for_prev in c abi.
Closes https://github.com/facebook/rocksdb/pull/1457

Differential Revision: D4135360

Pulled By: lightmark

fbshipit-source-id: 61256b0
2016-11-08 12:54:13 -08:00
Andrew Kryczka
9e7cf3469b DeleteRange user iterator support
Summary:
Note: reviewed in  https://reviews.facebook.net/D65115

- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464

Differential Revision: D4131753

Pulled By: ajkr

fbshipit-source-id: be86559
2016-11-04 12:09:22 -07:00
Andrew Kryczka
f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00
zhangjinpeng1987
879f366366 Add C api for RateLimiter
Summary:
Add C api for RateLimiter.
Closes https://github.com/facebook/rocksdb/pull/1455

Differential Revision: D4116362

Pulled By: yiwu-arbug

fbshipit-source-id: cb05a8d
2016-11-03 11:09:17 -07:00
Yi Wu
437942e481 Add avoid_flush_during_shutdown DB option
Summary:
Add avoid_flush_during_shutdown DB option.
Closes https://github.com/facebook/rocksdb/pull/1451

Differential Revision: D4108643

Pulled By: yiwu-arbug

fbshipit-source-id: abdaf4d
2016-11-02 15:39:18 -07:00
Benoit Girard
2b16d664cb Change max_bytes_for_level_multiplier to double
Summary: Closes https://github.com/facebook/rocksdb/pull/1427

Differential Revision: D4094732

Pulled By: yiwu-arbug

fbshipit-source-id: b9b79e9
2016-11-01 21:09:23 -07:00
Jay Lee
16fb04434f expose IngestExternalFile to c abi
Summary:
IngestExternalFile is very useful when doing bulk load. This pr expose this API to c so many bindings can benefit from it too.
Closes https://github.com/facebook/rocksdb/pull/1454

Differential Revision: D4113420

Pulled By: yiwu-arbug

fbshipit-source-id: 307c6ae
2016-11-01 17:09:39 -07:00
Andrew Kryczka
40a2e406f8 DeleteRange flush support
Summary:
Changed BuildTable() (used for flush) to (1) add range
tombstones to the aggregator, which is used by CompactionIterator to
determine which keys can be removed; and (2) add aggregator's range
tombstones to the table that is output for the flush.
Closes https://github.com/facebook/rocksdb/pull/1438

Differential Revision: D4100025

Pulled By: ajkr

fbshipit-source-id: cb01a70
2016-10-31 20:54:18 -07:00
Vladislav Vaintroub
d5555d95a3 Fix MSVC compile error in 32 bit compilation
Summary:
Passing std::atomic<uint64_t> variables to ASSERT_EQ()
results in compile error
C2718 'const T1': actual parameter with requested alignment of 8 won't be aligned.

VS2015 defines std::atomic as specially aligned type ( with 'alignas'),
however the compiler does not like declspec(align)ed  function
arguments.

Worked around by casting std::atomic<uint64_t> types to uint64_t
in ASSERT_EQ.
Closes https://github.com/facebook/rocksdb/pull/1450

Differential Revision: D4106788

Pulled By: yiwu-arbug

fbshipit-source-id: 5fb42c3
2016-10-31 17:24:18 -07:00
Siying Dong
da61f348d3 Print compression and Fast CRC support info as Header level
Summary:
Currently the compression suppport and fast CRC support information is printed as info level. They should be in the same level as options, which is header level.

Also add ZSTD to this printing.
Closes https://github.com/facebook/rocksdb/pull/1448

Differential Revision: D4106608

Pulled By: yiwu-arbug

fbshipit-source-id: cb9a076
2016-10-31 16:09:13 -07:00
Siying Dong
c90c48d3c8 Show More DB Stats in info logs
Summary:
DB Stats now are truncated if there are too many CFs. Extend the buffer size to allow more to be printed out. Also, separate out malloc to another log line.
Closes https://github.com/facebook/rocksdb/pull/1439

Differential Revision: D4100943

Pulled By: yiwu-arbug

fbshipit-source-id: 79f7218
2016-10-29 16:09:18 -07:00
Siying Dong
1b295ac8ae DBTest.GetThreadStatus: Wait for test results for longer
Summary:
The current 10 millisecond waiting for test results may not be sufficient in some test environments. Increase it to 60 seconds and check the results for every 1 milliseond.

Already reviewed: https://reviews.facebook.net/D65457
Closes https://github.com/facebook/rocksdb/pull/1437

Differential Revision: D4099443

Pulled By: siying

fbshipit-source-id: cf1f205
2016-10-29 16:09:18 -07:00
Aaron Gao
b50a81a2bb Add a test for tailing_iterator
Summary:
A bug that tailingIterator->Seek(target) skips records.

I think the bug is in the SeekInternal starting at lines 387:
search_left_bound > search_right_bound
There are only 2 cases this can happen:
(1) target key is smaller than left most file
(2) target key is larger than right most file

The comment is wrong, there is another possibility that at the higher level there is a big gap such that the file in the lower level fits completely in the gap and then
indexer->GetNextLevelIndex returns search_left_bound > search_right_bound I think pointing on the files after and before the gap.
details: https://github.com/facebook/rocksdb/issues/1372

fixed this bug with test case added.
Closes https://github.com/facebook/rocksdb/pull/1436

Reviewed By: IslamAbdelRahman

Differential Revision: D4099313

Pulled By: lightmark

fbshipit-source-id: 6a675b3
2016-10-28 18:24:14 -07:00
Siying Dong
04751d5345 L0 compression should follow options.compression_per_level if not empty
Summary:
Currently, we don't use options.compression_per_level[0] as the compression style for L0 compression type, unless it is None. This behavior
 doesn't look like on purpose. This diff will make sure L0 compress using the style of options.compression_per_level[0].

Reviewed and accepted in: https://reviews.facebook.net/D65607
Closes https://github.com/facebook/rocksdb/pull/1435

Differential Revision: D4099368

Pulled By: siying

fbshipit-source-id: cfbbdcd
2016-10-28 17:39:20 -07:00
Andrew Kryczka
2946cadc46 Improve RangeDelAggregator documentation
Summary:
as requested in D62259
Closes https://github.com/facebook/rocksdb/pull/1434

Differential Revision: D4099047

Pulled By: ajkr

fbshipit-source-id: a258cfb
2016-10-28 15:54:21 -07:00
Aaron Gao
bc429de490 revert fractional cascading in farward iterator
Summary: As offline discussion with Siying, revert this since it has bug with seek.

Test Plan: make check -j64

Reviewers: yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65559
2016-10-28 10:25:39 -07:00
Andrew Kryczka
b9bc7a2aa4 Use skiplist rep for range tombstone memtable
Summary: somehow missed committing this update in D62217

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65361
2016-10-27 10:07:28 -07:00
Siying Dong
9ee84067f6 Disable DBTest.RepeatedWritesToSameKey (#1420)
Summary:
The verification condition of the test DBTest.RepeatedWritesToSameKey doesn't hold anymore after 3ce3bb3da2.
Disable the test for now before we find a way to replace it.

Test Plan: Run the test and make sure it is disabled.
2016-10-25 10:23:50 -07:00
Aaron Gao
9de2f75216 revert Prev() in MergingIterator to use previous code in non-prefix-seek mode
Summary: Siying suggested to keep old code for normal mode prev() for safety

Test Plan: make check -j64

Reviewers: yiwu, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65439
2016-10-24 13:13:01 -07:00
sdong
24495186da DBSSTTest.RateLimitedDelete: not to use real clock
Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead.

Test Plan: Run the tests and all existing tests.

Reviewers: yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65145
2016-10-24 10:35:00 -07:00
sdong
1168cb810a Fix a bug that may cause a deleted row to appear again
Summary:
The previous fix of reappearing of a deleted row 0ce258f9b3 missed a corner case, which can be reproduced using test CompactionPickerTest.OverlappingUserKeys7. Consider such an example:

input level file: 1[B E] 2[F H]
output level file: 3[A C] 4[D I] 5[I K]

First file 2 is picked, which overlaps to file 4. 4 expands to 5. Now the all range is [D K] with 2 output level files. When we try to expand that, [D K] overlaps with file 1 and 2 in the input level, and 1 and 2 overlaps with 3 and 4 in the output level. So we end up with picking 3 and 4 in the output level. Without expanding, it also has 2 files, so we determine the output level doesn't change, although they are the different two files.

The fix is to expand the output level files after we picked 3 and 4. In that case, there will be three output level files so we will abort the expanding.

I also added two unit tests related to marked_for_compaction and being_compacted. They have been passing though.

Test Plan: Run the new unit test, as well as all other tests.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65373
2016-10-24 09:49:07 -07:00
Edouard A
99c052a34f Fix integer overflow in GetL0ThresholdSpeedupCompaction (#1378) 2016-10-23 18:43:29 -07:00
Aaron Gao
59a7c0337b Change ioptions to store user_comparator, fix bug
Summary:
change ioptions.comparator to user_comparator instread of internal_comparator.
Also change Comparator* to InternalKeyComparator* to make its type explicitly.

Test Plan: make all check -j64

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65121
2016-10-21 11:31:42 -07:00
Islam AbdelRahman
869ae5d786 Support IngestExternalFile (remove AddFile restrictions)
Summary:
Changes in the diff

API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API

Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers

Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob

Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)

Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: jkedgar, hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65061
2016-10-20 17:05:32 -07:00
sdong
1d9dbef64e Restrict running condition of UniversalCompactionTrivialMoveTest2
Summary: DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest2 verifies non-trivial move is not triggered if we load data in sequential order. However, if there are multiple compaction threads, this conditon may not hold. Restrict the running condition to 1 compaction thread to make the test more robust.

Test Plan: Run the test and make sure at least it doesn't regress normally.

Reviewers: yhchiang, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65277
2016-10-20 15:43:00 -07:00
sdong
fb2e412943 column_family_test: disable some tests in LITE
Summary: Some tests in column_family_test depend on functions that are not available in LITE build, which sometimes cause flakiness. Disable them.

Test Plan: Run those tests in LITE build.

Reviewers: yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65271
2016-10-19 15:55:56 -07:00
Aaron Gao
5af651db24 fix data race in compact_files_test
Summary: fix data race

Test Plan: compact_files_test

Reviewers: sdong, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65259
2016-10-19 13:37:51 -07:00
Andrew Kryczka
a0ba0aa877 Fix uninitialized variable gcc error for MyRocks
Summary: make sure seq_ is properly initialized even if ParseInternalKey() fails.

Test Plan: run myrocks release tests

Reviewers: lightmark, mung, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65199
2016-10-19 10:59:46 -07:00
Islam AbdelRahman
b88f8e87c5 Support SST files with Global sequence numbers [reland]
Summary:
reland https://reviews.facebook.net/D62523

- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: sdong, yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65211
2016-10-18 16:59:37 -07:00
Aaron Gao
52c9808c3a not split file in compaciton on level 0
Summary: we should not split file on level 0 in compaction because it will fail the following verification of seqno order on level 0

Test Plan: check with filldeterministic in db_bench

Reviewers: yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65193
2016-10-18 16:30:34 -07:00
Aaron Gao
5e0d6b4cc9 fix db_stress assertion failure
Summary: in rocksdb::DBIter::FindValueForCurrentKey(), last_not_merge_type could also be SingleDelete() which is omitted

Test Plan: db_iter_test

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D65187
2016-10-18 16:07:10 -07:00
sdong
b4d07123c4 SamePrefixTest.InDomainTest to clear the test directory before testing
Summary: SamePrefixTest.InDomainTest may fail if the previous run of some test cases in prefix_test fail.

Test Plan: Run the test

Reviewers: lightmark, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65163
2016-10-18 14:01:10 -07:00
Islam AbdelRahman
aa09d03381 Avoid calling GetDBOptions() inside GetFromBatchAndDB()
Summary:
MyRocks hit a regression, @mung generated perf reports showing that the reason is the cost of calling `GetDBOptions()` inside `GetFromBatchAndDB()`
This diff avoid calling `GetDBOptions` and use the `ImmutableDBOptions` instead

Test Plan: make check -j64

Reviewers: sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, mung

Differential Revision: https://reviews.facebook.net/D65151
2016-10-18 13:19:26 -07:00