Commit Graph

2750 Commits

Author SHA1 Message Date
Siying Dong
1ba2804b7f Remove XFunc tests
Summary:
Xfunc is hardly used. Remove it to keep the code simple.
Closes https://github.com/facebook/rocksdb/pull/1905

Differential Revision: D4603220

Pulled By: siying

fbshipit-source-id: 731f96d
2017-02-23 12:09:11 -08:00
Aaron Gao
1ef5f50e84 detect logical sector size
Summary:
querying logical sector size from the device instead of hardcoding it for linux platform.
Closes https://github.com/facebook/rocksdb/pull/1875

Differential Revision: D4591946

Pulled By: ajkr

fbshipit-source-id: 4e9805c
2017-02-23 11:25:36 -08:00
Aaron Gao
0824934423 truncate patch
Summary:
omit the override for the previous commit
Closes https://github.com/facebook/rocksdb/pull/1898

Differential Revision: D4598743

Pulled By: lightmark

fbshipit-source-id: f98a378
2017-02-22 10:39:11 -08:00
Aaron Gao
286a36db7f posix writablefile truncate
Summary:
we occasionally missing this call so the file size will be wrong
Closes https://github.com/facebook/rocksdb/pull/1894

Differential Revision: D4598446

Pulled By: lightmark

fbshipit-source-id: 42b6ef5
2017-02-22 10:09:14 -08:00
Mike Kolupaev
18eeb7b90e Fix interference between max_total_wal_size and db_write_buffer_size checks
Summary:
This is a trivial fix for OOMs we've seen a few days ago in logdevice.

RocksDB get into the following state:
(1) Write throughput is too high for flushes to keep up. Compactions are out of the picture - automatic compactions are disabled, and for manual compactions we don't care that much if they fall behind. We write to many CFs, with only a few L0 sst files in each, so compactions are not needed most of the time.
(2) total_log_size_ is consistently greater than GetMaxTotalWalSize(). It doesn't get smaller since flushes are falling ever further behind.
(3) Total size of memtables is way above db_write_buffer_size and keeps growing. But the write_buffer_manager_->ShouldFlush() is not checked because (2) prevents it (for no good reason, afaict; this is what this commit fixes).
(4) Every call to WriteImpl() hits the MaybeFlushColumnFamilies() path. This keeps flushing the memtables one by one in order of increasing log file number.
(5) No write stalling trigger is hit. We rely on max_write_buffer_number
Closes https://github.com/facebook/rocksdb/pull/1893

Differential Revision: D4593590

Pulled By: yiwu-arbug

fbshipit-source-id: af79c5f
2017-02-21 16:09:10 -08:00
Aaron Gao
2a0f3d0de1 level compaction expansion
Summary:
reimplement the compaction expansion on lower level.

Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]

If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.

The previous code is messy and wrong.

In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.

**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760

Differential Revision: D4395564

Pulled By: lightmark

fbshipit-source-id: 2dc2c5c
2017-02-21 10:24:17 -08:00
Yi Wu
381fd32247 Remove timeout_hint_us from WriteOptions
Summary:
The option has been deprecated for two years and has no effect. Removing.
Closes https://github.com/facebook/rocksdb/pull/1866

Differential Revision: D4555203

Pulled By: yiwu-arbug

fbshipit-source-id: c48f627
2017-02-17 15:24:17 -08:00
Islam AbdelRahman
fce7a6e196 Fail IngestExternalFile when bg_error_ exists
Summary:
Fail IngestExternalFile() when bg_error_ exists
Closes https://github.com/facebook/rocksdb/pull/1881

Differential Revision: D4580621

Pulled By: IslamAbdelRahman

fbshipit-source-id: 1194913
2017-02-17 13:39:17 -08:00
Shu Zhang
756c5924e6 Allow adding external v1 sst file with no global seqno support
Summary:
This is a follow up fix for https://github.com/facebook/rocksdb/pull/1783. After it, we should be able to ingest external v1 sst files with no global seqno field.
Closes https://github.com/facebook/rocksdb/pull/1874

Differential Revision: D4576194

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5b34a3e
2017-02-16 17:09:12 -08:00
Aaron Gao
db2b4eb50e avoid direct io in rocksdb_lite
Summary:
fix lite bugs
disable direct io in lite mode
Closes https://github.com/facebook/rocksdb/pull/1870

Differential Revision: D4559866

Pulled By: yiwu-arbug

fbshipit-source-id: 3761c51
2017-02-16 10:39:13 -08:00
Andrew Kryczka
43e9f01c20 Fix repair_test on ROCKSDB_LITE
Summary:
RepairDB isn't included in rocksdb lite, so don't test it.
Closes https://github.com/facebook/rocksdb/pull/1873

Differential Revision: D4565094

Pulled By: ajkr

fbshipit-source-id: 8cc0898
2017-02-15 11:24:12 -08:00
Xiaofei Du
7106a994fe Use monotonic time points in write_controller.cc and rate_limiter.cc
Summary:
NowMicros() provides non-monotonic time. When wall clock is
synchronized or changed, the non-monotonicity time points will affect write rate
controllers. This patch changes write_controller.cc and rate_limiter.cc to use
monotonic time points.
Closes https://github.com/facebook/rocksdb/pull/1865

Differential Revision: D4561732

Pulled By: siying

fbshipit-source-id: 95ece62
2017-02-14 18:24:24 -08:00
Yi Wu
c2247dc1c7 Make DBImpl::has_unpersisted_data_ atomic
Summary:
Seems to me `has_unpersisted_data_` is read from read thread and write
from write thread concurrently without synchronization. Making it an
atomic.

I update the logic not because seeing any problem with it, but it just
feel confusing.
Closes https://github.com/facebook/rocksdb/pull/1869

Differential Revision: D4555837

Pulled By: yiwu-arbug

fbshipit-source-id: eff2ab8
2017-02-13 18:54:13 -08:00
Sagar Vemuri
eb912a927e Remove disableDataSync option
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859

Differential Revision: D4541292

Pulled By: sagar0

fbshipit-source-id: 5b3a6ca
2017-02-13 11:09:13 -08:00
Dmitri Smirnov
a5adda0642 Fix repair issues
Summary:
Record the first parsed sequence number as the minimum
  so we can find the true minimum otherwise everything is larger than zero.
  Fix the comparator name comparision.
Closes https://github.com/facebook/rocksdb/pull/1858

Differential Revision: D4544365

Pulled By: ajkr

fbshipit-source-id: 439cbc2
2017-02-10 10:54:12 -08:00
Andrew Kryczka
b48e4778be Consolidate file cutting logic in compaction loop
Summary:
It was really annoying to have two places (top and bottom of compaction loop) where we cut output files. I had bugs in both DeleteRange and dictionary compression due to updating only one of the two. This diff consolidates the file-cutting logic to the bottom of the compaction loop.

Keep in mind that my goal with input_status is to be consistent with the past behavior, even though I'm not sure it's ideal.
Closes https://github.com/facebook/rocksdb/pull/1832

Differential Revision: D4503038

Pulled By: ajkr

fbshipit-source-id: 7da5213
2017-02-08 16:24:17 -08:00
Maysam Yabandeh
c4a37dcb44 Print the missed last layer in cfstats
Summary:
Printing compaction stats used to operate on two variable:
number_levels_: for printing the layer
num_levels_to_check: for updating the compaction score

After this commit: 361010d447 these two are mixed up and as a result the last layer might not be printed out: https://fb.facebook.com/groups/rocksdb.internal/permalink/1315716625143616/

number_levels_ was used to decide which layers to print: 672300f47f/db/internal_stats.cc (L753) but after the patch it is based on the return value of DumpCFMapStats 361010d447/db/internal_stats.cc (L929) which returns num_levels_to_check: 361010d447/db/internal_stats.cc (L917)
Closes https://github.com/facebook/rocksdb/pull/1853

Differential Revision: D4529280

Pulled By: maysamyabandeh

fbshipit-source-id: 3fd9448
2017-02-08 10:39:15 -08:00
Maysam Yabandeh
69d5262c81 Two-level Indexes
Summary:
Partition Index blocks and use a Partition-index as a 2nd level index.

The two-level index can be used by setting
BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
configuring BlockBasedTableOptions::index_per_partition

t15539501
Closes https://github.com/facebook/rocksdb/pull/1814

Differential Revision: D4473535

Pulled By: maysamyabandeh

fbshipit-source-id: bffb87e
2017-02-06 16:39:12 -08:00
Dmitri Smirnov
0a4cdde50a Windows thread
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
  an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread.  On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823

Differential Revision: D4492902

Pulled By: siying

fbshipit-source-id: c74cb11
2017-02-06 14:54:18 -08:00
Vitaliy Liptchinsky
1aaa898cf1 Adding GetApproximateMemTableStats method
Summary:
Added method that returns approx num of entries as well as size for memtables.
Closes https://github.com/facebook/rocksdb/pull/1841

Differential Revision: D4511990

Pulled By: VitaliyLi

fbshipit-source-id: 9a4576e
2017-02-06 14:54:16 -08:00
Siying Dong
036d668b19 Fix wrong result in data race case related to Get()
Summary:
In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version.
Closes https://github.com/facebook/rocksdb/pull/1816

Differential Revision: D4475958

Pulled By: siying

fbshipit-source-id: bd9e67a
2017-02-03 11:39:15 -08:00
Islam AbdelRahman
574b543f80 Rename merger.h -> merging_iterator.h
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836

Differential Revision: D4505357

Pulled By: IslamAbdelRahman

fbshipit-source-id: 07b28d8
2017-02-02 16:54:19 -08:00
oranagra
b96372dead improving the C wrapper
Summary:
- rocksdb_property_int (so that we don't have to parse strings)
- and rocksdb_set_options (to allow controlling options via strings)
- a few other missing options exposed
- a documentation comment fix
Closes https://github.com/facebook/rocksdb/pull/1793

Differential Revision: D4456569

Pulled By: yiwu-arbug

fbshipit-source-id: 9f1fac1
2017-01-27 17:39:16 -08:00
Siying Dong
04c4ec41d1 Change corruption_test to use 4 bits.
Summary:
In the patch which LRU cache was made use dynamic shard bits, I changed to 2 shard bits to make the test happy. Look like it is occasionally still unhappy. Change it to 4 shard bits.
Closes https://github.com/facebook/rocksdb/pull/1815

Differential Revision: D4475849

Pulled By: siying

fbshipit-source-id: 575ff00
2017-01-27 11:24:16 -08:00
Siying Dong
2d75cd40d3 NewLRUCache() to pick number of shard bits based on capacity if not given
Summary:
If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity.
Closes https://github.com/facebook/rocksdb/pull/1584

Differential Revision: D4242517

Pulled By: siying

fbshipit-source-id: 86b0f18
2017-01-27 06:39:12 -08:00
Siying Dong
f25f1ec60b Add test DBTest2.GetRaceFlush which can expose a data race bug
Summary:
A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now.
Closes https://github.com/facebook/rocksdb/pull/1813

Differential Revision: D4472310

Pulled By: siying

fbshipit-source-id: 5755ebd
2017-01-26 16:39:14 -08:00
sdong
5dad9d6d28 Avoid logs_ operation out of DB mutex
Summary:
logs_.back() is called out of DB mutex, which can cause data race. We move the access into the DB mutex protection area.
Closes https://github.com/facebook/rocksdb/pull/1774

Reviewed By: AsyncDBConnMarkedDownDBException

Differential Revision: D4417472

Pulled By: AsyncDBConnMarkedDownDBException

fbshipit-source-id: 2da1f1e
2017-01-25 15:54:13 -08:00
Islam AbdelRahman
a7b13919bf Fix CompactFiles() bug when used with CompactionFilter using SuperVersion
Summary:
GetAndRefSuperVersion() should not be called again in the same thread before ReturnAndCleanupSuperVersion() is called.

If we have a compaction filter that is using DB::Get, This will happen
```
CompactFiles() {
  GetAndRefSuperVersion() // -- first call
    ..
    CompactionFilter() {
      GetAndRefSuperVersion() // -- second call
      ReturnAndCleanupSuperVersion()
    }
    ..
  ReturnAndCleanupSuperVersion()
}
```

We solve this issue in the same way Iterator is solving it, but using GetReferencedSuperVersion()

This was discovered in https://github.com/facebook/mysql-5.6/issues/427 by alxyang
Closes https://github.com/facebook/rocksdb/pull/1803

Differential Revision: D4460155

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5e54322
2017-01-25 14:09:13 -08:00
Andrew Kryczka
616a1464ea Fix DeleteRange including sentinels in output files
Summary:
when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels.

blame: #1614
Closes https://github.com/facebook/rocksdb/pull/1804

Differential Revision: D4460426

Pulled By: ajkr

fbshipit-source-id: 69444b5
2017-01-25 11:09:12 -08:00
Islam AbdelRahman
03ca2ac8a9 Remove function from DBImpl that are not used anywhere
Summary:
GetAndRefSuperVersionUnlocked
ReturnAndCleanupSuperVersionUnlocked
GetColumnFamilyHandleUnlocked

Are dead code that are not used any where
Closes https://github.com/facebook/rocksdb/pull/1802

Differential Revision: D4459948

Pulled By: IslamAbdelRahman

fbshipit-source-id: 30fa89d
2017-01-24 19:24:13 -08:00
Andrew Kryczka
b0029bc7fa Test merge op covered by range deletion in memtable
Summary:
It's a test case for #1797. Also got rid of kTypeDeletion in the conditional since we treat it the same as kTypeRangeDeletion.
Closes https://github.com/facebook/rocksdb/pull/1800

Differential Revision: D4451300

Pulled By: ajkr

fbshipit-source-id: b39dda1
2017-01-24 13:39:11 -08:00
Andrew Kryczka
d438e1ec17 Test range deletion block outlives table reader
Summary:
This test ensures RangeDelAggregator can still access blocks even if it outlives the table readers that created them (detailed description in comments).

I plan to optimize away the extra cache lookup we currently do in BlockBasedTable::NewRangeTombstoneIterator(), as it is ~5% CPU in my random read benchmark in a database with 1k tombstones. This test will help make sure nothing breaks in the process.
Closes https://github.com/facebook/rocksdb/pull/1739

Differential Revision: D4375954

Pulled By: ajkr

fbshipit-source-id: aef9357
2017-01-24 13:24:14 -08:00
Andrew Kryczka
9da4d542fe Range deletions unsupported in tailing iterator
Summary:
change the iterator status to NotSupported as soon as a range tombstone
is encountered by a ForwardIterator.
Closes https://github.com/facebook/rocksdb/pull/1593

Differential Revision: D4246294

Pulled By: ajkr

fbshipit-source-id: aef9f49
2017-01-23 13:39:12 -08:00
Hyeonseok Oh
f2b4939da4 fixed typo
Summary:
I fixed exisit -> exist
Closes https://github.com/facebook/rocksdb/pull/1799

Differential Revision: D4451466

Pulled By: yiwu-arbug

fbshipit-source-id: b447c3a
2017-01-23 12:54:13 -08:00
yinqiwen
973f1b78fd memtable: delete merge value for range deleteion
Summary: Closes https://github.com/facebook/rocksdb/pull/1797

Differential Revision: D4448004

Pulled By: ajkr

fbshipit-source-id: 3ffc27c
2017-01-23 12:24:14 -08:00
Vitaliy Liptchinsky
753ff84a3d Fix get approx size
Summary:
Fixing GetApproximateSize bug for the case of computing stats for mem tables only.
Closes https://github.com/facebook/rocksdb/pull/1795

Differential Revision: D4445507

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3905846
2017-01-20 15:54:12 -08:00
Jay Lee
537da370da c: allow set savepoint to writebatch
Summary:
Allow set SavePoint to WriteBatch in C ABI.
Closes https://github.com/facebook/rocksdb/pull/1698

Differential Revision: D4378556

Pulled By: yiwu-arbug

fbshipit-source-id: afca746
2017-01-20 13:24:13 -08:00
Changli Gao
5ac97314e7 Fix std::out_of_range when DBOptions::keep_log_file_num is zero
Summary:
We should validate this option, otherwise we may see
std::out_of_range thrown at: db/db_impl.cc:1124

1123     for (unsigned int i = 0; i <= end; i++) {
1124       std::string& to_delete = old_info_log_files.at(i);
1125       std::string full_path_to_delete =
1126           (immutable_db_options_.db_log_dir.empty()
Closes https://github.com/facebook/rocksdb/pull/1722

Differential Revision: D4379495

Pulled By: yiwu-arbug

fbshipit-source-id: e136552
2017-01-20 13:24:12 -08:00
Shu Zhang
3c0852d1da Make ingest external file backward compatible
Summary: Closes https://github.com/facebook/rocksdb/pull/1783

Differential Revision: D4443463

Pulled By: IslamAbdelRahman

fbshipit-source-id: 39d21d6
2017-01-20 12:09:19 -08:00
Siying Dong
0e8dfd6062 Fix OptimizeForPointLookup()
Summary:
If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead.
Closes https://github.com/facebook/rocksdb/pull/1791

Differential Revision: D4442836

Pulled By: siying

fbshipit-source-id: bf6c9cd
2017-01-20 10:54:12 -08:00
Vitaliy Liptchinsky
e840213d6e Change DB::GetApproximateSizes for more flexibility needed for MyRocks
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787

Differential Revision: D4441111

Pulled By: IslamAbdelRahman

fbshipit-source-id: c11f4c3
2017-01-20 09:39:11 -08:00
Yi Wu
9239103cd4 Flush job should release reference current version if sync log failed
Summary:
Fix the bug when sync log fail, FlushJob::Run() will not be execute and
reference to cfd->current() will not be release.
Closes https://github.com/facebook/rocksdb/pull/1792

Differential Revision: D4441316

Pulled By: yiwu-arbug

fbshipit-source-id: 5523e28
2017-01-19 23:09:15 -08:00
Islam AbdelRahman
da54d36a96 Disable IngestExternalFile in ReadOnly mode
Summary:
Disable IngestExternalFile() in read only mode
Closes https://github.com/facebook/rocksdb/pull/1781

Differential Revision: D4439179

Pulled By: IslamAbdelRahman

fbshipit-source-id: b7e46e7
2017-01-19 15:54:19 -08:00
Reid Horuff
5cf176ca15 Fix for 2PC causing WAL to grow too large
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*

To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768

Differential Revision: D4403265

Pulled By: reidHoruff

fbshipit-source-id: ce800ff
2017-01-19 15:39:12 -08:00
Andrew Kryczka
f9d18e22d2 Fix DeleteRange file boundary correctness issue with max_compaction_bytes
Summary:
Cockroachdb exposed this bug in #1778. The bug happens when a compaction's output files are ended due to exceeding max_compaction_bytes. In that case we weren't taking into account the next file's start key when deciding how far to extend the current file's max_key. This caused the non-overlapping key-range invariant to be violated.

Note this was correctly handled for the usual case of cutting compaction output, which is file size exceeding max_output_file_size. I am not sure why these are two separate code paths, but we can consider refactoring it to prevent such errors in the future.
Closes https://github.com/facebook/rocksdb/pull/1784

Differential Revision: D4430235

Pulled By: ajkr

fbshipit-source-id: 80af748
2017-01-18 11:54:22 -08:00
Islam AbdelRahman
3ce091fd73 Add KEEP_DB env var option
Summary:
When debugging tests, it's useful to preserve the DB to investigate it and check the logs
This will allow us to set KEEP_DB=1 to preserve the DB
Closes https://github.com/facebook/rocksdb/pull/1759

Differential Revision: D4393826

Pulled By: IslamAbdelRahman

fbshipit-source-id: 1bff689
2017-01-17 13:54:20 -08:00
Siying Dong
77b4806625 Fix 2PC with concurrent memtable insert
Summary:
If concurrent memtable insert is enabled, and one prepare command and a normal command are grouped into a commit group, the sequence ID will be calculated incorrectly.
Closes https://github.com/facebook/rocksdb/pull/1730

Differential Revision: D4371081

Pulled By: siying

fbshipit-source-id: cd40c6d
2017-01-17 11:24:28 -08:00
Mike Kolupaev
d18dd2c41f Abort compactions more reliably when closing DB
Summary:
DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper).

There's also some minor code cleanup along the way.
Closes https://github.com/facebook/rocksdb/pull/1639

Differential Revision: D4306571

Pulled By: yiwu-arbug

fbshipit-source-id: f050890
2017-01-11 15:09:21 -08:00
Changli Gao
9f246298e2 Performance: Iterate vector by reference
Summary: Closes https://github.com/facebook/rocksdb/pull/1763

Differential Revision: D4398796

Pulled By: yiwu-arbug

fbshipit-source-id: b82636d
2017-01-11 10:54:37 -08:00
Dmitri Smirnov
3c233ca4ea Fix Windows environment issues
Summary:
Enable directIO on WritableFileImpl::Append
     with offset being current length of the file.
     Enable UniqueID tests on Windows, disable others but
     leeting them to compile. Unique tests are valuable to
     detect failures on different filesystems and upcoming
     ReFS.
     Clear output in WinEnv Getchildren.This is different from
     previous strategy, do not touch output on failure.
     Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746

Differential Revision: D4385681

Pulled By: IslamAbdelRahman

fbshipit-source-id: c07b702
2017-01-09 15:54:12 -08:00
Maysam Yabandeh
d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh
54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Andrew Kryczka
b104b87814 Maintain position in range deletions map
Summary:
When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key.

- When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key.
- Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map.
- When the iterator seeks, we use kBinarySearch for repositioning
- After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used.
- Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete().
Closes https://github.com/facebook/rocksdb/pull/1701

Differential Revision: D4350318

Pulled By: ajkr

fbshipit-source-id: 5129b76
2017-01-05 10:39:12 -08:00
siddontang
653ac1f9c6 C API: support total_order_mode
Summary: Closes https://github.com/facebook/rocksdb/pull/1687

Differential Revision: D4349210

Pulled By: IslamAbdelRahman

fbshipit-source-id: 32d0fbd
2017-01-03 18:39:14 -08:00
Adam Retter
85ac1a320a Fix rocksdb::Status::getState
Summary:
This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState.
Closes https://github.com/facebook/rocksdb/issues/1688
Closes https://github.com/facebook/rocksdb/pull/1714

Differential Revision: D4364181

Pulled By: yiwu-arbug

fbshipit-source-id: 8e073b4
2017-01-03 18:39:14 -08:00
Islam AbdelRahman
76711b6e77 Make ExternalSSTFileTest::CompactionDeadlock more deterministic
Summary:
It's not always true that `ASSERT_EQ(running_threads.load(), 2);`
Closes https://github.com/facebook/rocksdb/pull/1736

Differential Revision: D4374091

Pulled By: IslamAbdelRahman

fbshipit-source-id: 4f70bbd
2017-01-03 18:09:20 -08:00
Islam AbdelRahman
c963460dbc Fix tests under GCC_481
Summary:
This fix the issue with tests failing under GCC 481, I am not sure what is the exact reason
Closes https://github.com/facebook/rocksdb/pull/1735

Differential Revision: D4374094

Pulled By: IslamAbdelRahman

fbshipit-source-id: b3625bc
2017-01-03 17:54:12 -08:00
Vincent Lee
e425ec1162 utilities/backupable: backup should limit the copy size of wal.
Summary:
Since the backup work as snapshot, we should only copy
 the bytes of the wal while we get the alive files.
Closes https://github.com/facebook/rocksdb/pull/1733

Differential Revision: D4373457

Pulled By: ajkr

fbshipit-source-id: 389318f
2016-12-31 10:54:20 -08:00
Maysam Yabandeh
0712d541d1 Delegate Cleanables
Summary:
Cleanable objects will perform the registered cleanups when
they are destructed. We however rather to delay this cleaning like when
we are gathering the merge operands. Current approach is to create the
Cleanable object on heap (instead of on stack) and delay deleting it.

By allowing Cleanables to delegate their cleanups to another cleanable
object we can delay the cleaning without however the need to craete the
cleanable object on heap and keeping it around. This patch applies this
technique for the cleanups of BlockIter and shows improved performance
for some in-memory benchmarks:
+1.8% for merge worklaod, +6.4% for non-merge workload when the merge
operator is specified.
https://our.intern.facebook.com/intern/tasks?t=15168163

Non-merge benchmark:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none

Reading random with no merge operator specified:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="read
Closes https://github.com/facebook/rocksdb/pull/1711

Differential Revision: D4361163

Pulled By: maysamyabandeh

fbshipit-source-id: 9801e07
2016-12-29 15:54:19 -08:00
Islam AbdelRahman
d58ef52ba6 Allow SstFileWriter to Fadvise the file away from page cache
Summary:
Add `fadvise_trigger` option to `SstFileWriter`

If fadvise_trigger is passed with a non-zero value, SstFileWriter will invalidate the os page cache every `fadvise_trigger` bytes for the sst file
Closes https://github.com/facebook/rocksdb/pull/1731

Differential Revision: D4371246

Pulled By: IslamAbdelRahman

fbshipit-source-id: 91caff1
2016-12-29 15:09:19 -08:00
Siying Dong
17a4b75cc3 Always fsync the file after file copying
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728

Differential Revision: D4371083

Pulled By: siying

fbshipit-source-id: 579e14c
2016-12-28 19:09:16 -08:00
leipeng
a738af8f84 db/pinned_iterators_manager.h: bugfix
Summary:
std::unique(beg, end) returns an iterator of unique_end, data behind unique_end should not be accessed.
Closes https://github.com/facebook/rocksdb/pull/1726

Differential Revision: D4371076

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5564450
2016-12-28 18:54:57 -08:00
Siying Dong
438f22bc56 Fix bug of Checkpoint loses recent transactions with 2PC
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724

Differential Revision: D4368319

Pulled By: siying

fbshipit-source-id: cc2c746
2016-12-28 12:24:16 -08:00
Aaron Gao
972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Islam AbdelRahman
989e644ed8 Remove sst_file_manager option from LITE
Summary:
Remove sst_file_manager option from LITE
Closes https://github.com/facebook/rocksdb/pull/1690

Differential Revision: D4341331

Pulled By: IslamAbdelRahman

fbshipit-source-id: 9f9328d
2016-12-21 17:54:21 -08:00
Islam AbdelRahman
1beef6569a Fix c_test
Summary:
addfile phase in c_test could fail because in previous steps we did a DeleteRange.
Fix the test by simply moving the addfile phase before DeleteRange
Closes https://github.com/facebook/rocksdb/pull/1672

Differential Revision: D4328896

Pulled By: IslamAbdelRahman

fbshipit-source-id: 1d946df
2016-12-21 17:39:14 -08:00
Andrew Kryczka
50e305de98 Collapse range deletions
Summary:
Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach.

Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map.
Closes https://github.com/facebook/rocksdb/pull/1614

Differential Revision: D4270397

Pulled By: ajkr

fbshipit-source-id: 93092c7
2016-12-19 16:54:12 -08:00
Yi Wu
5d1457dbbf Dump persistent cache options
Summary:
Dump persistent cache options
Closes https://github.com/facebook/rocksdb/pull/1679

Differential Revision: D4337019

Pulled By: yiwu-arbug

fbshipit-source-id: 3812f8a
2016-12-19 14:09:12 -08:00
Daniel Black
342370f1d3 Simplify MemTable::Update
Summary:
As suggested by testn in #1650

The Add is at the end of the function. Having a fallthough
will result in it being added twice.
Closes https://github.com/facebook/rocksdb/pull/1676

Differential Revision: D4331906

Pulled By: yiwu-arbug

fbshipit-source-id: 895c4a0
2016-12-17 00:09:13 -08:00
Ding Ma
1a136c1f13 Expose file size
Summary:
add a new function to SstFileWriter that will tell the user how big is there file right now.
Closes https://github.com/facebook/rocksdb/pull/1686

Differential Revision: D4338868

Pulled By: mdyuki1016

fbshipit-source-id: c1ee16a
2016-12-16 18:39:12 -08:00
Andrew Kryczka
fbff4628a9 Reduce compaction iterator status checks
Summary:
seems it's expensive to check status since the underlying merge iterator checks status of all its children. so only do it when it's really necessary to get the status before invoking Next(), i.e., when we're advancing to get the first key in the next file.
Closes https://github.com/facebook/rocksdb/pull/1691

Differential Revision: D4343446

Pulled By: siying

fbshipit-source-id: 70ab315
2016-12-16 17:39:09 -08:00
Daniel Black
816c1e30ca gcc-7 requires include <functional> for std::function
Summary:
Fixes compile error:

In file included from ./util/statistics.h:17:0,
                 from ./util/stop_watch.h:8,
                 from ./util/perf_step_timer.h:9,
                 from ./util/iostats_context_imp.h:8,
                 from ./util/posix_logger.h:27,
                 from ./port/util_logger.h:18,
                 from ./db/auto_roll_logger.h:15,
                 from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
   typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656

Differential Revision: D4318702

Pulled By: yiwu-arbug

fbshipit-source-id: 8c5d17a
2016-12-16 11:24:18 -08:00
Yi Wu
c270735861 Iterator should be in corrupted status if merge operator return false
Summary:
Iterator should be in corrupted status if merge operator return false.
Also add test to make sure if max_successive_merges is hit during write,
data will not be lost.
Closes https://github.com/facebook/rocksdb/pull/1665

Differential Revision: D4322695

Pulled By: yiwu-arbug

fbshipit-source-id: b327b05
2016-12-16 11:09:16 -08:00
siddontang
8f5d24ae68 C API: support get usage and pinned_usage for cache
Summary: Closes https://github.com/facebook/rocksdb/pull/1671

Differential Revision: D4327453

Pulled By: yiwu-arbug

fbshipit-source-id: bcdbc65
2016-12-15 17:24:17 -08:00
Daniel Black
cfc34d7c4e Missing break in case in DBTestBase::CurrentOptions
Summary:
Found by gcc-7 compile error.

This appeared to be a fault as these options seems too different.
Closes https://github.com/facebook/rocksdb/pull/1667

Differential Revision: D4324174

Pulled By: yiwu-arbug

fbshipit-source-id: 0f65383
2016-12-13 18:39:14 -08:00
Daniel Black
bfbcec2339 Gcc 7 error expansion to defined
Summary:
sorry if these gcc-7/clang-4 cleanups are getting tedious.
Closes https://github.com/facebook/rocksdb/pull/1658

Differential Revision: D4318792

Pulled By: yiwu-arbug

fbshipit-source-id: 8e85891
2016-12-13 18:39:14 -08:00
Daniel Black
67adc937b6 intentional fallthough (prevents gcc-7/clang-4 error)
Summary:
db/memtable.cc: In member function 'void rocksdb::MemTable::Update(rocksdb::SequenceNumber, const rocksdb::Slice&, const rocksdb::Slice&)':
db/memtable.cc:736:11: error: this statement may fall through [-Werror=implicit-fallthrough=]
           }
           ^
db/memtable.cc:738:9: note: here
         default:
         ^~~~~~~
cc1plus: all warnings being treated as errors

closes #1650
Closes https://github.com/facebook/rocksdb/pull/1655

Differential Revision: D4318696

Pulled By: yiwu-arbug

fbshipit-source-id: 1a8981c
2016-12-13 14:39:17 -08:00
Islam AbdelRahman
1a146f89c7 break Flush wait for dropped CF
Summary:
In FlushJob we dont do the Flush if the CF is dropped
https://github.com/facebook/rocksdb/blob/master/db/flush_job.cc#L184-L188

but inside WaitForFlushMemTable we keep waiting forever even if the CF is dropped.
Closes https://github.com/facebook/rocksdb/pull/1664

Differential Revision: D4321032

Pulled By: IslamAbdelRahman

fbshipit-source-id: 6e2b25d
2016-12-13 14:09:12 -08:00
Yi Wu
36d42e65d0 Disable test to unblock travis build
Summary:
The two tests keep failing in travis. Disable them and will fix later.
Closes https://github.com/facebook/rocksdb/pull/1648

Differential Revision: D4316389

Pulled By: yiwu-arbug

fbshipit-source-id: 0a370e7
2016-12-13 11:54:14 -08:00
siddontang
b57dd9262a C API: support writebatch delete range
Summary:
Seem that writebatch delete range can work now, so I add C API for later use.

Btw, can we use this feature in production now?
Closes https://github.com/facebook/rocksdb/pull/1647

Differential Revision: D4314534

Pulled By: ajkr

fbshipit-source-id: e835165
2016-12-13 11:24:18 -08:00
Islam AbdelRahman
2ba59b5a1e Disallow ingesting files into dropped CFs
Summary:
This PR update IngestExternalFile to return an error if we try to ingest a file into a dropped CF.

Right now if IngestExternalFile want to flush a memtable, and it's ingesting a file into a dropped CF, it will wait forever since flushing is not possible for the dropped CF
Closes https://github.com/facebook/rocksdb/pull/1657

Differential Revision: D4318657

Pulled By: IslamAbdelRahman

fbshipit-source-id: ed6ea2b
2016-12-13 00:54:14 -08:00
Jonathan Lee
2cabdb8f44 Increase buffer size
Summary:
When compiling with GCC>=7.0.0, "db/internal_stats.cc" fails to compile as the data being written to the buffer potentially exceeds its size.

This fix simply doubles the size of the buffer, thus accommodating the max possible data size.
Closes https://github.com/facebook/rocksdb/pull/1635

Differential Revision: D4302162

Pulled By: yiwu-arbug

fbshipit-source-id: c76ad59
2016-12-09 11:54:22 -08:00
Jonathan Lee
4a17b47bb5 Remove unnecessary header include
Summary:
Remove "util/testharness.h" from list of includes for "db/db_filesnapshot.cc", as it wasn't being used and thus caused an extraneous dependency on gtest.
Closes https://github.com/facebook/rocksdb/pull/1634

Differential Revision: D4302146

Pulled By: yiwu-arbug

fbshipit-source-id: e900c0b
2016-12-09 11:54:21 -08:00
Mike Kolupaev
8c2b921fdf Fixed a crash in debug build in flush_job.cc
Summary:
It was doing `&range_del_iters[0]` on an empty vector. Even though the resulting pointer is never dereferenced, it's still bad for two reasons:
* the practical reason: it crashes with `std::out_of_range` exception in our debug build,
* the "C++ standard lawyer" reason: it's undefined behavior because, in `std::vector` implementation, it probably "dereferences" a null pointer, which is invalid even though it doesn't actually read the pointed memory, just converts a pointer into a reference (and then flush_job.cc converts it back to pointer); nullptr references are undefined behavior.
Closes https://github.com/facebook/rocksdb/pull/1612

Differential Revision: D4265625

Pulled By: al13n321

fbshipit-source-id: db26fb9
2016-12-09 10:39:12 -08:00
Islam AbdelRahman
20ce081fae Fix issue where IngestExternalFile insert blocks in block cache with g_seqno=0
Summary:
When we Ingest an external file we open it to read some metadata and first/last key
during doing that we insert blocks into the block cache with global_seqno = 0

If we move the file (did not copy it) into the DB, we will use these blocks with the wrong seqno in the read path
Closes https://github.com/facebook/rocksdb/pull/1627

Differential Revision: D4293332

Pulled By: yiwu-arbug

fbshipit-source-id: 3ce5523
2016-12-08 13:39:18 -08:00
zhangjinpeng1987
45c7ce1377 CompactRangeOptions C API
Summary:
Add C API for CompactRangeOptions.
Closes https://github.com/facebook/rocksdb/pull/1596

Differential Revision: D4252339

Pulled By: yiwu-arbug

fbshipit-source-id: f768f93
2016-12-07 17:54:14 -08:00
Andrew Kryczka
b821984d31 DeleteRange read path end-to-end tests
Summary: Closes https://github.com/facebook/rocksdb/pull/1592

Differential Revision: D4246260

Pulled By: ajkr

fbshipit-source-id: ce03fa2
2016-12-07 12:54:17 -08:00
Artemiy Kolesnikov
2f4fc539c6 Compaction::IsTrivialMove relaxing
Summary:
IsTrivialMove returns true if no input file overlaps with output_level+1 with more than max_compaction_bytes_ bytes.
Closes https://github.com/facebook/rocksdb/pull/1619

Differential Revision: D4278338

Pulled By: yiwu-arbug

fbshipit-source-id: 994c001
2016-12-07 11:54:11 -08:00
Islam AbdelRahman
ed8fbdb560 Add EventListener::OnExternalFileIngested() event
Summary:
Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events
Closes https://github.com/facebook/rocksdb/pull/1623

Differential Revision: D4285844

Pulled By: IslamAbdelRahman

fbshipit-source-id: 0b95a88
2016-12-06 14:09:17 -08:00
Mike Kolupaev
beb36d9c1e Fixed CompactionFilter::Decision::kRemoveAndSkipUntil
Summary:
Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure:

  db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed

The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't.
Closes https://github.com/facebook/rocksdb/pull/1611

Differential Revision: D4265622

Pulled By: al13n321

fbshipit-source-id: 747b10f
2016-12-05 15:24:11 -08:00
Islam AbdelRahman
67f37cf198 Allow user to specify a CF for SST files generated by SstFileWriter
Summary:
Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF.
This allow us to persist the CF id in the generated file
Closes https://github.com/facebook/rocksdb/pull/1615

Differential Revision: D4270422

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7fb954e
2016-12-05 14:24:16 -08:00
Anton Safonov
9053fe2a5c Made delete_obsolete_files_period_micros option dynamic
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595

Differential Revision: D4246569

Pulled By: tonek

fbshipit-source-id: d23f560
2016-12-05 14:24:16 -08:00
Islam AbdelRahman
edde954e7b fix clang build
Summary:
override is missing for FilterV2
Closes https://github.com/facebook/rocksdb/pull/1606

Differential Revision: D4263832

Pulled By: IslamAbdelRahman

fbshipit-source-id: d8b337a
2016-12-01 18:39:10 -08:00
Islam AbdelRahman
e39d080871 Fix travis (compile for clang < 3.9)
Summary:
Travis fail because it uses clang 3.6 which don't recognize
`__attribute__((__no_sanitize__("undefined")))`
Closes https://github.com/facebook/rocksdb/pull/1601

Differential Revision: D4257175

Pulled By: IslamAbdelRahman

fbshipit-source-id: fb4d1ab
2016-12-01 10:09:22 -08:00
fangchenliaohui
b77007df8b Bug: paralle_group status updated in WriteThread::CompleteParallelWorker
Summary:
Multi-write thread may update the status of the parallel_group in
WriteThread::CompleteParallelWorker if the status of Writer is not ok!
When copy write status to the paralle_group, the write thread just hold the
mutex of the the writer processed by itself. it is useless. The thread
should held the the leader of the parallel_group instead.
Closes https://github.com/facebook/rocksdb/pull/1598

Differential Revision: D4252335

Pulled By: siying

fbshipit-source-id: 3864cf7
2016-12-01 09:54:11 -08:00
Mike Kolupaev
247d0979aa Support for range skips in compaction filter
Summary:
This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API.

Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :)

The diff is not as big as it may seem, about half of the lines are a test.
Closes https://github.com/facebook/rocksdb/pull/1599

Differential Revision: D4252092

Pulled By: al13n321

fbshipit-source-id: 41e1e48
2016-12-01 07:09:15 -08:00
Panagiotis Ktistakis
96fcefbf1d c api: expose option for dynamic level size target
Summary: Closes https://github.com/facebook/rocksdb/pull/1587

Differential Revision: D4245923

Pulled By: yiwu-arbug

fbshipit-source-id: 6ee7291
2016-11-30 11:24:14 -08:00
zhangjinpeng1987
00197cff39 Add C API to set base_backgroud_compactions
Summary:
Add C API to set base_backgroud_compactions
Closes https://github.com/facebook/rocksdb/pull/1571

Differential Revision: D4245709

Pulled By: yiwu-arbug

fbshipit-source-id: 792c6b8
2016-11-30 11:09:13 -08:00
Andrew Kryczka
5b219eccb5 deleterange end-to-end test improvements for lite/robustness
Summary: Closes https://github.com/facebook/rocksdb/pull/1591

Differential Revision: D4246019

Pulled By: ajkr

fbshipit-source-id: 0c4aa37
2016-11-29 12:24:13 -08:00
Andrew Kryczka
e333528991 DeleteRange write path end-to-end tests
Summary: Closes https://github.com/facebook/rocksdb/pull/1578

Differential Revision: D4241171

Pulled By: ajkr

fbshipit-source-id: ce5fd83
2016-11-29 11:09:22 -08:00
Siying Dong
7784980fcd Fix mis-reporting of compaction read bytes to the base level
Summary:
In dynamic leveled compaction, when calculating read bytes, output level bytes may be wronglyl calculated as input level inputs. Fix it.
Closes https://github.com/facebook/rocksdb/pull/1475

Differential Revision: D4148412

Pulled By: siying

fbshipit-source-id: f2f475a
2016-11-29 11:09:22 -08:00
Islam AbdelRahman
3c6b49ed66 Fix implicit conversion between int64_t to int
Summary:
Make conversion explicit, implicit conversion breaks the build
Closes https://github.com/facebook/rocksdb/pull/1589

Differential Revision: D4245158

Pulled By: IslamAbdelRahman

fbshipit-source-id: aaec00d
2016-11-29 10:54:15 -08:00
Siying Dong
b3b875657f Remove unused assignment in db/db_iter.cc
Summary:
"make analyze" complains the assignment is not useful. Remove it.
Closes https://github.com/facebook/rocksdb/pull/1581

Differential Revision: D4241697

Pulled By: siying

fbshipit-source-id: 178f67a
2016-11-29 09:09:14 -08:00
Andrew Kryczka
4f6e89b1d0 Fix range deletion covering key in same SST file
Summary:
AddTombstones() needs to be before t->Get(), oops :'(
Closes https://github.com/facebook/rocksdb/pull/1576

Differential Revision: D4241041

Pulled By: ajkr

fbshipit-source-id: 781ceea
2016-11-28 22:54:13 -08:00
Islam AbdelRahman
a2bf265a39 Avoid intentional overflow in GetL0ThresholdSpeedupCompaction
Summary:
99c052a34f fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve.
UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t
Closes https://github.com/facebook/rocksdb/pull/1582

Differential Revision: D4241525

Pulled By: IslamAbdelRahman

fbshipit-source-id: b3ae21f
2016-11-28 18:39:13 -08:00
Islam AbdelRahman
52fd1ff2c2 disable UBSAN for functions with intentional -ve shift / overflow
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow

These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577

Differential Revision: D4240801

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3e1caf6
2016-11-28 17:54:12 -08:00
Islam AbdelRahman
1886c435b9 Fix CompactionJob::Install division by zero
Summary:
Fix CompactionJob::Install division by zero
Closes https://github.com/facebook/rocksdb/pull/1580

Differential Revision: D4240794

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7286721
2016-11-28 16:54:16 -08:00
Islam AbdelRahman
13e66a8f51 Fix compaction_job.cc division by zero
Summary:
Fix division by zero in compaction_job.cc
Closes https://github.com/facebook/rocksdb/pull/1575

Differential Revision: D4240818

Pulled By: IslamAbdelRahman

fbshipit-source-id: a8bc757
2016-11-28 16:39:13 -08:00
Andrew Kryczka
01eabf7375 Fix double-counted deletion stat
Summary:
Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted.
Closes https://github.com/facebook/rocksdb/pull/1574

Differential Revision: D4241181

Pulled By: ajkr

fbshipit-source-id: c9aaa15
2016-11-28 15:54:12 -08:00
Andrew Kryczka
7ffb10fc1a DeleteRange compaction statistics
Summary:
- "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them
- "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them
- s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction
- Move the above class into a separate .h file to avoid circular dependency.
Closes https://github.com/facebook/rocksdb/pull/1520

Differential Revision: D4187179

Pulled By: ajkr

fbshipit-source-id: 10c2103
2016-11-28 11:54:12 -08:00
Mike Kolupaev
236d4c67e9 Less linear search in DBIter::Seek() when keys are overwritten a lot
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:

```
  rocksdb::MergingIterator::Next
  rocksdb::DBIter::FindNextUserEntryInternal
  rocksdb::DBIter::Seek
```

I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.

CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413

Differential Revision: D4083879

Pulled By: IslamAbdelRahman

fbshipit-source-id: a83ddae
2016-11-28 10:24:11 -08:00
Siying Dong
cd7c4143d7 Improve Write Stalling System
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.

Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562

Differential Revision: D4218519

Pulled By: siying

fbshipit-source-id: 95e4088
2016-11-23 09:24:15 -08:00
Yi Wu
dfb6fe6755 Unified InlineSkipList::Insert algorithm with hinting
Summary:
This PR is based on nbronson's diff with small
modifications to wire it up with existing interface. Comparing to
previous version, this approach works better for inserting keys in
decreasing order or updating the same key, and impose less restriction
to the prefix extractor.

---- Summary from original diff ----

This diff introduces a single InlineSkipList::Insert that unifies
the existing sequential insert optimization (prev_), concurrent insertion,
and insertion using externally-managed insertion point hints.

There's a deep symmetry between insertion hints (cursors) and the
concurrent algorithm.  In both cases we have partial information from
the recent past that is likely but not certain to be accurate.  This diff
introduces the struct InlineSkipList::Splice, which encodes predecessor
and successor information in the same form that was previously only used
within a single call to InsertConcurrently.  Splice holds information
about an insertion point that can be used to levera
Closes https://github.com/facebook/rocksdb/pull/1561

Differential Revision: D4217283

Pulled By: yiwu-arbug

fbshipit-source-id: 33ee437
2016-11-22 14:09:13 -08:00
Andrew Kryczka
734e4acafb Eliminate redundant cache lookup with range deletion
Summary:
When we introduced range deletion block, TableCache::Get() and TableCache::NewIterator() each did two table cache lookups, one for range deletion block iterator and another for getting the table reader to which the Get()/NewIterator() is delegated. This extra cache lookup was very CPU-intensive (about 10% overhead in a read-heavy benchmark). We can avoid it by reusing the Cache::Handle created for range deletion block iterator to get the file reader.
Closes https://github.com/facebook/rocksdb/pull/1537

Differential Revision: D4201167

Pulled By: ajkr

fbshipit-source-id: d33ffd8
2016-11-21 21:24:11 -08:00
Maysam Yabandeh
182b940e70 Add WriteOptions.no_slowdown
Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527

Differential Revision: D4191405

Pulled By: maysamyabandeh

fbshipit-source-id: 7f3ce3f
2016-11-21 18:09:13 -08:00
Karthikeyan Radhakrishnan
4118e13330 Persistent Cache: Expose stats to user via public API
Summary:
Exposing persistent cache stats (counters) to the user via public API.
Closes https://github.com/facebook/rocksdb/pull/1485

Differential Revision: D4155274

Pulled By: siying

fbshipit-source-id: 30a9f50
2016-11-21 17:39:13 -08:00
Andrew Kryczka
fd43ee09da Range deletion microoptimizations
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548

Differential Revision: D4208169

Pulled By: ajkr

fbshipit-source-id: 2fd65cf
2016-11-21 12:24:13 -08:00
Andrew Kryczka
fe349db57b Remove Arena in RangeDelAggregator
Summary:
The Arena construction/destruction introduced significant overhead to read-heavy workload just by creating empty vectors for its blocks, so avoid it in RangeDelAggregator.
Closes https://github.com/facebook/rocksdb/pull/1547

Differential Revision: D4207781

Pulled By: ajkr

fbshipit-source-id: 9d1c130
2016-11-19 14:24:12 -08:00
Andrew Kryczka
3f62215210 Lazily initialize RangeDelAggregator's map and pinning manager
Summary:
Since a RangeDelAggregator is created for each read request, these heap-allocating member variables were consuming significant CPU (~3% total) which slowed down request throughput. The map and pinning manager are only necessary when range deletions exist, so we can defer their initialization until the first range deletion is encountered. Currently lazy initialization is done for reads only since reads pass us a single snapshot, which is easier to store on the stack for later insertion into the map than the vector passed to us by flush or compaction.

Note the Arena member variable is still expensive, I will figure out what to do with it in a subsequent diff. It cannot be lazily initialized because we currently use this arena even to allocate empty iterators, which is necessary even when no range deletions exist.
Closes https://github.com/facebook/rocksdb/pull/1539

Differential Revision: D4203488

Pulled By: ajkr

fbshipit-source-id: 3b36279
2016-11-18 17:09:11 -08:00
Andrew Kryczka
635a7bd1ad refactor TableCache Get/NewIterator for single exit points
Summary:
these functions were too complicated to change with exit points everywhere, so refactored them.

btw, please review urgently, this is a prereq to fix the 5.0 perf regression
Closes https://github.com/facebook/rocksdb/pull/1534

Differential Revision: D4198972

Pulled By: ajkr

fbshipit-source-id: 04ebfb7
2016-11-17 14:39:13 -08:00
Siying Dong
a4eb7387b2 Allow plain table to store index on file with bloom filter disabled
Summary:
Currently plain table bloom filter is required if storing metadata on file. Remove the constraint.
Closes https://github.com/facebook/rocksdb/pull/1525

Differential Revision: D4190977

Pulled By: siying

fbshipit-source-id: be60442
2016-11-17 11:09:13 -08:00
Yi Wu
36e4762ce0 Remove Ticker::SEQUENCE_NUMBER
Summary:
Remove the ticker count because:
* Having to reset the ticker count in WriteImpl is ineffiecent;
* It doesn't make sense to have it as a ticker count if multiple db
  instance share a statistics object.
Closes https://github.com/facebook/rocksdb/pull/1531

Differential Revision: D4194442

Pulled By: yiwu-arbug

fbshipit-source-id: e2110a9
2016-11-16 22:39:09 -08:00
Andrew Kryczka
760ef68a69 fix deleterange asan issue
Summary:
pinned_iters_mgr_ pins iterators allocated with arena_, so we should order the
instance variable declarations such that the pinned iterators have their destructors
executed before the arena is destroyed.
Closes https://github.com/facebook/rocksdb/pull/1528

Differential Revision: D4191984

Pulled By: ajkr

fbshipit-source-id: 1386f20
2016-11-16 14:09:07 -08:00
Siying Dong
972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Yi Wu
1543d5d92e Report memory usage by memtable insert hints map.
Summary:
It is hard to measure acutal memory usage by std containers. Even
providing a custom allocator will miss count some of the usage. Here we
only do a wild guess on its memory usage.
Closes https://github.com/facebook/rocksdb/pull/1511

Differential Revision: D4179945

Pulled By: yiwu-arbug

fbshipit-source-id: 32ab929
2016-11-15 20:24:13 -08:00
Andrew Kryczka
48e8baebc0 Decouple data iterator and range deletion iterator in TableCache
Summary:
Previously we used TableCache::NewIterator() for multiple purposes (data
block iterator and range deletion iterator), and returned non-ok status in
the data block iterator. In one case where the caller only used the range
deletion block iterator (9e7cf3469b/db/version_set.cc (L965-L973)),
we didn't check/free the data block iterator containing non-ok status, which
caused a valgrind error.

So, this diff decouples creation of data block and range deletion block iterators,
and updates the callers accordingly. Both functions can return non-ok status
in an InternalIterator. Since the non-ok status is returned in an iterator that the
callers will definitely use, it should be more usable/less error-prone.
Closes https://github.com/facebook/rocksdb/pull/1513

Differential Revision: D4181423

Pulled By: ajkr

fbshipit-source-id: 835b8f5
2016-11-15 17:24:28 -08:00
Andrew Kryczka
661e4c9267 DeleteRange unsupported in non-block-based tables
Summary:
Return an error from DeleteRange() (or Write() if the user is using the
low-level WriteBatch API) if an unsupported table type is configured.
Closes https://github.com/facebook/rocksdb/pull/1519

Differential Revision: D4185933

Pulled By: ajkr

fbshipit-source-id: abcdf84
2016-11-15 15:24:16 -08:00
Andrew Kryczka
489d142808 DeleteRange interface
Summary:
Expose DeleteRange() interface since we think the implementation is functionally correct now.
Closes https://github.com/facebook/rocksdb/pull/1503

Differential Revision: D4171921

Pulled By: ajkr

fbshipit-source-id: 5e21c98
2016-11-15 15:24:16 -08:00
Islam AbdelRahman
eba99c28e4 Fix min_write_buffer_number_to_merge = 0 bug
Summary:
It's possible that we set min_write_buffer_number_to_merge to 0.
This should never happen
Closes https://github.com/facebook/rocksdb/pull/1515

Differential Revision: D4183356

Pulled By: yiwu-arbug

fbshipit-source-id: c9d39d7
2016-11-15 13:54:08 -08:00
Artemiy Kolesnikov
91300d01f6 Dynamic max_total_wal_size option
Summary: Closes https://github.com/facebook/rocksdb/pull/1509

Differential Revision: D4176426

Pulled By: yiwu-arbug

fbshipit-source-id: b57689d
2016-11-14 22:54:17 -08:00
Andrew Kryczka
ec2f64794b Consider subcompaction boundaries when updating file boundaries for range deletion
Summary:
Adjusted AddToBuilder() to take lower_bound and upper_bound, which serve two purposes: (1) only range deletions overlapping with the interval [lower_bound, upper_bound) will be added to the output file, and (2) the output file's boundaries will not be extended before lower_bound or after upper_bound. Our computation of lower_bound/upper_bound consider both subcompaction boundaries and previous/next files within the subcompaction.

Test cases are here (level subcompactions: https://gist.github.com/ajkr/63c7eae3e9667c5ebdc0a7efb74ac332, and universal subcompactions: https://gist.github.com/ajkr/5a62af77c4ebe4052a1955c496d51fdb) but can't be included in this diff as they depend on committing the API first. They fail before this change and pass after.
Closes https://github.com/facebook/rocksdb/pull/1501

Reviewed By: yhchiang

Differential Revision: D4171685

Pulled By: ajkr

fbshipit-source-id: ee99db8
2016-11-14 20:24:21 -08:00
Andrew Kryczka
3b192f6186 Handle full final subcompaction output file with range deletions
Summary:
This conditional should only open a new file that's dedicated to range deletions when it's the sole output of the subcompaction. Previously, we created such a file whenever the table builder was nullptr, which would've also been the case whenever the CompactionIterator's final key coincided with the final output table becoming full.
Closes https://github.com/facebook/rocksdb/pull/1507

Differential Revision: D4174613

Pulled By: ajkr

fbshipit-source-id: 9ffacea
2016-11-14 17:54:20 -08:00
Andrew Kryczka
6c57952002 Make range deletion inclusive-exclusive
Summary:
This makes it easier to implement future optimizations like range collapsing.
Closes https://github.com/facebook/rocksdb/pull/1504

Differential Revision: D4172214

Pulled By: ajkr

fbshipit-source-id: ac4942f
2016-11-14 17:39:13 -08:00
Yi Wu
1ea79a78c9 Optimize sequential insert into memtable - Part 1: Interface
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.

This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
  implementation to return a hint of insert position, which can be later
  pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
  via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419

Differential Revision: D4079367

Pulled By: yiwu-arbug

fbshipit-source-id: 3555326
2016-11-13 19:09:18 -08:00
Yi Wu
df5eeb85ca Optimize sequential insert into memtable - Part 2: Implementation
Summary:
Implement a insert hint into skip-list to hint insert position. This is
to optimize for the write workload where there are multiple stream of
sequential writes. For example, there is a stream of keys of a1, a2,
a3... but also b1, b2, b2... Each stream are not neccessary strictly
sequential, but can get reorder a little bit. User can specify a prefix
extractor and the `SkipListRep` can thus maintan a hint for each of the
stream for fast insert into memtable.

This is the internal implementation part. See #1419 for the interface part.
See inline comments for details.
Closes https://github.com/facebook/rocksdb/pull/1449

Differential Revision: D4106781

Pulled By: yiwu-arbug

fbshipit-source-id: f4d48c4
2016-11-13 13:09:16 -08:00
Islam AbdelRahman
5ed650857d Fix SstFileWriter destructor
Summary:
If user did not call SstFileWriter::Finish() or called Finish() but it failed.
We need to abandon the builder, to avoid destructing it while it's open
Closes https://github.com/facebook/rocksdb/pull/1502

Differential Revision: D4171660

Pulled By: IslamAbdelRahman

fbshipit-source-id: ab6f434
2016-11-12 20:11:19 -08:00
Lijun Tang
adb665e0bf Allowed delayed_write_rate option to be dynamically set.
Summary: Closes https://github.com/facebook/rocksdb/pull/1488

Differential Revision: D4157784

Pulled By: siying

fbshipit-source-id: f150081
2016-11-12 15:54:11 -08:00
Maysam Yabandeh
361010d447 Exporting compaction stats in the form of a map
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477

Differential Revision: D4149836

Pulled By: maysamyabandeh

fbshipit-source-id: b3df19f
2016-11-11 20:54:14 -08:00
Aaron Gao
b39b2ee12f do not call get() in recovery mode
Summary:
This is a previous fix that has a typo
Closes https://github.com/facebook/rocksdb/pull/1487

Differential Revision: D4157381

Pulled By: lightmark

fbshipit-source-id: f079be8
2016-11-10 11:24:20 -08:00
Reid Horuff
1ca5f6d132 Fix 2PC Recovery SeqId Miscount
Summary:
Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL.
Closes https://github.com/facebook/rocksdb/pull/1486

Differential Revision: D4156064

Pulled By: reidHoruff

fbshipit-source-id: a6da8d9
2016-11-10 11:09:22 -08:00
Andrew Kryczka
c90fef88b1 fix open failure with empty wal
Summary: Closes https://github.com/facebook/rocksdb/pull/1490

Differential Revision: D4158821

Pulled By: IslamAbdelRahman

fbshipit-source-id: 59b73f4
2016-11-09 22:24:26 -08:00
Andrew Kryczka
4e20c5da20 Store internal keys in TombstoneMap
Summary:
This fixes a correctness issue where ranges with same begin key would overwrite each other.

This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator.

End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix.
Closes https://github.com/facebook/rocksdb/pull/1484

Differential Revision: D4155248

Pulled By: ajkr

fbshipit-source-id: 304b4b9
2016-11-09 15:09:18 -08:00
Yueh-Hsuan Chiang
a9fb346e4a Fix RocksDB Lite build failure in c_test.cc
Summary:
Fix the following RocksDB Lite build failure in c_test.cc

db/c_test.c:1051:3: error: implicit declaration of function 'fprintf' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
  fprintf(stderr, "SKIPPED\n");
  ^
db/c_test.c:1051:3: error: declaration of built-in function 'fprintf' requires inclusion of the header <stdio.h> [-Werror,-Wbuiltin-requires-header]
db/c_test.c:1051:11: error: use of undeclared identifier 'stderr'
  fprintf(stderr, "SKIPPED\n");
          ^
3 errors generated.
Closes https://github.com/facebook/rocksdb/pull/1479

Differential Revision: D4151160

Pulled By: yhchiang

fbshipit-source-id: a471a30
2016-11-09 12:24:18 -08:00
Reid Horuff
d133b08f68 Use correct sequence number when creating memtable
Summary:
copied from: 5ebfd2623a

Opening existing RocksDB attempts recovery from log files, which uses
wrong sequence number to create the memtable. This is a regression
introduced in change a400336.

This change includes a test demonstrating the problem, without the fix
the test fails with "Operation failed. Try again.: Transaction could not
check for conflicts for operation at SequenceNumber 1 as the MemTable
only contains changes newer than SequenceNumber 2.  Increasing the value
of the max_write_buffer_number_to_maintain option could reduce the
frequency of this error"

This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname
and me.
Closes https://github.com/facebook/rocksdb/pull/1458

Differential Revision: D4143791

Pulled By: reidHoruff

fbshipit-source-id: 5a25033
2016-11-09 12:24:17 -08:00
Islam AbdelRahman
9bd191d2f4 Fix deadlock between (WriterThread/Compaction/IngestExternalFile)
Summary:
A deadlock is possible if this happen

(1) Writer thread is stopped because it's waiting for compaction to finish
(2) Compaction is waiting for current IngestExternalFile() calls to finish
(3) IngestExternalFile() is waiting to be able to acquire the writer thread
(4) WriterThread is held by stopped writes that are waiting for compactions to finish

This patch fix the issue by not incrementing num_running_ingest_file_ except when we acquire the writer thread.

This patch include a unittest to reproduce the described scenario
Closes https://github.com/facebook/rocksdb/pull/1480

Differential Revision: D4151646

Pulled By: IslamAbdelRahman

fbshipit-source-id: 09b39db
2016-11-09 10:54:10 -08:00
Islam AbdelRahman
193221e0a1 Fix Forward Iterator Seek()/SeekToFirst()
Summary:
In ForwardIterator::SeekInternal(), we may end up passing empty Slice representing an internal key to InternalKeyComparator::Compare.
and when we try to extract the user key from this empty Slice, we will create a slice with size = 0 - 8 ( which will overflow and cause us to read invalid memory as well )

Scenarios to reproduce these issues are in the unit tests
Closes https://github.com/facebook/rocksdb/pull/1467

Differential Revision: D4136660

Pulled By: lightmark

fbshipit-source-id: 151e128
2016-11-08 13:54:31 -08:00
Aaron Gao
e48f3f8b9e remove tabs and duplicate #include in c api
Summary:
fix lint error about tabs and duplicate includes.
Closes https://github.com/facebook/rocksdb/pull/1476

Differential Revision: D4149646

Pulled By: lightmark

fbshipit-source-id: 2e0a632
2016-11-08 13:54:31 -08:00
Jay Lee
a7875272d7 c: support seek_for_prev
Summary:
support seek_for_prev in c abi.
Closes https://github.com/facebook/rocksdb/pull/1457

Differential Revision: D4135360

Pulled By: lightmark

fbshipit-source-id: 61256b0
2016-11-08 12:54:13 -08:00
Andrew Kryczka
9e7cf3469b DeleteRange user iterator support
Summary:
Note: reviewed in  https://reviews.facebook.net/D65115

- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464

Differential Revision: D4131753

Pulled By: ajkr

fbshipit-source-id: be86559
2016-11-04 12:09:22 -07:00
Andrew Kryczka
f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00