Commit Graph

7691 Commits

Author SHA1 Message Date
Maysam Yabandeh
2462763b2e Fix mis-spoken assert on prefetch_filter and prefetch_index (#4077)
Summary:
We can have prefetch_index without prefetch_filter but not the other way around. The assert statement is fixed.
Closes https://github.com/facebook/rocksdb/pull/4077

Differential Revision: D8694472

Pulled By: maysamyabandeh

fbshipit-source-id: ccd2804d9d9cdafb1c3e65062c7bc38603e69004
2018-06-29 09:28:12 -07:00
Maysam Yabandeh
29ffbb8a50 Charging block cache more accurately (#4073)
Summary:
Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it.
Closes https://github.com/facebook/rocksdb/pull/4073

Differential Revision: D8686552

Pulled By: maysamyabandeh

fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
2018-06-29 08:57:20 -07:00
Zhongyi Xie
b3efb1cbe0 fix clang analyzer warnings (#4072)
Summary:
clang analyze is giving the following warnings:
> db/compaction_job.cc:1178:16: warning: Called C++ object pointer is null
    } else if (meta->smallest.size() > 0) {
               ^~~~~~~~~~~~~~~~~~~~~
db/compaction_job.cc:1201:33: warning: Access to field 'marked_for_compaction' results in a dereference of a null pointer (loaded from variable 'meta')
    meta->marked_for_compaction = sub_compact->builder->NeedCompact();
    ~~~~
db/version_set.cc:2770:26: warning: Called C++ object pointer is null
        uint32_t cf_id = last_writer->cfd->GetID();
                         ^~~~~~~~~~~~~~~~~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/4072

Differential Revision: D8685852

Pulled By: miasantreble

fbshipit-source-id: b0e2fd9dfc1cbba2317723e09886384b9b1c9085
2018-06-28 19:12:35 -07:00
Manuel Ung
8ad63a4b86 WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069)
Summary:
This adds a new WAL marker of type kTypeBeginUnprepareXID.

Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy.

Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not.
Closes https://github.com/facebook/rocksdb/pull/4069

Differential Revision: D8675099

Pulled By: lth

fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749
2018-06-28 18:58:29 -07:00
Andrew Kryczka
25403c2265 Prefetch cache lines for filter lookup (#4068)
Summary:
Since the filter data is unaligned, even though we ensure all probes are within a span of `cache_line_size` bytes, those bytes can span two cache lines. In that case I doubt hardware prefetching does a great job considering we don't necessarily access those two cache lines in order. This guess seems correct since adding explicit prefetch instructions reduced filter lookup overhead by 19.4%.
Closes https://github.com/facebook/rocksdb/pull/4068

Differential Revision: D8674189

Pulled By: ajkr

fbshipit-source-id: 747427d9a17900151c17820488e3f7efe06b1871
2018-06-28 13:20:29 -07:00
Anand Ananthabhotla
52d4c9b7f6 Allow DB resume after background errors (#3997)
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance

This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997

Differential Revision: D8653831

Pulled By: anand1976

fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
2018-06-28 12:34:40 -07:00
Yanqin Jin
26d67e357e Support group commits of version edits (#3944)
Summary:
This PR supports the group commit of multiple version edit entries corresponding to different column families. Column family drop/creation still cannot be grouped. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Closes https://github.com/facebook/rocksdb/pull/3944

Differential Revision: D8432536

Pulled By: riversand963

fbshipit-source-id: 8f11bd05193b6c0d9272d82e44b676abfac113cb
2018-06-28 12:34:39 -07:00
Maysam Yabandeh
0a5b5d88b2 Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070)
Summary:
Lite does not support readonly DBs.
Closes https://github.com/facebook/rocksdb/pull/4070

Differential Revision: D8677858

Pulled By: maysamyabandeh

fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa
2018-06-28 08:42:17 -07:00
Taewook Oh
b557499eee Suppress leak warning for clang(LLVM) asan (#4066)
Summary:
Instead of __SANITIZE_ADDRESS__ macro, LLVM uses __has_feature(address_sanitzer) to check if ASAN is enabled for the build. I tested it with MySQL sanitizer build that uses RocksDB as a submodule.
Closes https://github.com/facebook/rocksdb/pull/4066

Reviewed By: riversand963

Differential Revision: D8668941

Pulled By: taewookoh

fbshipit-source-id: af4d1da180c1470d257a228f431eebc61490bc36
2018-06-27 22:13:48 -07:00
Yanqin Jin
7f850b889d Remove 'ALIGNAS' from StatisticsImpl. (#4061)
Summary:
Remove over-alignment on `StatisticsImpl` whose benefit is vague and causes UBSAN check to fail due to `std::make_shared` not respecting the over-alignment requirement.

Test plan
```
$ make clean && COMPILE_WITH_UBSAN=1 OPT=-g make -j16 ubsan_check
```
Closes https://github.com/facebook/rocksdb/pull/4061

Differential Revision: D8656506

Pulled By: riversand963

fbshipit-source-id: db355ae9c7bdd2c9e9c5e63cabba13d8d82cc5f9
2018-06-27 20:59:45 -07:00
Zhongyi Xie
14f409c0f1 PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067)
Summary:
with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
Closes https://github.com/facebook/rocksdb/pull/4067

Differential Revision: D8669002

Pulled By: miasantreble

fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
2018-06-27 20:42:43 -07:00
Zhichao Cao
1f6efabe23 Add bottommost_compression_opts to for bottommost_compression (#3985)
Summary:
…ression

 For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985

Reviewed By: riversand963

Differential Revision: D8385911

Pulled By: zhichao-cao

fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
2018-06-27 17:42:38 -07:00
Maysam Yabandeh
235ab9dd32 Pin mmap files in ReadOnlyDB (#4053)
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053

Differential Revision: D8662546

Pulled By: maysamyabandeh

fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
2018-06-27 17:13:34 -07:00
Maximilian Alexander
e8f9d7f0d4 Added PingCaps Rust RocksDB and ObjectiveRocks (#4065)
Summary:
1. I added PingCap's more up-to-date Rust Binding of RocksDB
2. I also added ObjectiveRocks which is a very nice binding for _both_ Swift and Objective-C
Closes https://github.com/facebook/rocksdb/pull/4065

Differential Revision: D8670340

Pulled By: siying

fbshipit-source-id: 3db28bf3a464c3e050df52cc92b19248b7f43944
2018-06-27 15:43:21 -07:00
chouxi
818c84e116 Store timestamp in deadlock detection (#4060)
Summary:
- Summary
    Add timestamp into the DeadlockInfo to store the timestamp when deadlock detected on the rocksdb side.

- Testplan:
    `make check -j64`
Closes https://github.com/facebook/rocksdb/pull/4060

Differential Revision: D8655380

Pulled By: chouxi

fbshipit-source-id: f58e1aa5e09eb1d1eed0a181d4e2304aaf01efe8
2018-06-27 12:27:58 -07:00
Daniel Black
e5ae1bb465 Remove bogus gcc-8.1 warning (#3870)
Summary:
Various rearrangements of the cch maths failed or replacing = '\0' with
memset failed to convince the compiler it was nul terminated. So took
the perverse option of changing strncpy to strcpy.

Return null if memory couldn't be allocated.

util/status.cc: In static member function ‘static const char* rocksdb::Status::CopyState(const char*)’:
util/status.cc:28:15: error: ‘char* strncpy(char*, const char*, size_t)’ output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation]
   std::strncpy(result, state, cch - 1);
   ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
util/status.cc:19:18: note: length computed here
       std::strlen(state) + 1; // +1 for the null terminator
       ~~~~~~~~~~~^~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:645: shared-objects/util/status.o] Error 1

closes #2705
Closes https://github.com/facebook/rocksdb/pull/3870

Differential Revision: D8594114

Pulled By: anand1976

fbshipit-source-id: ab20f3a456a711e4d29144ebe630e4fe3c99ec25
2018-06-27 12:23:07 -07:00
Manuel Ung
a16e00b7b9 WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955)
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.

There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955

Differential Revision: D8286688

Pulled By: lth

fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
2018-06-27 12:23:07 -07:00
Nikhil Benesch
17339dc2f3 Add table property tracking number of range deletions (#4016)
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
2018-06-26 20:27:35 -07:00
Zhongyi Xie
408205a36b use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899)
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899

Differential Revision: D8157459

Pulled By: miasantreble

fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
2018-06-26 15:57:26 -07:00
Bas van Schaik
967aa8157a Create lgtm.yml for LGTM.com C/C++ analysis (#4058)
Summary:
As discussed with thatsafunnyname [here](https://discuss.lgtm.com/t/c-c-lang-missing-for-facebook-rocksdb/1079): this configuration enables C/C++ analysis for RocksDB on LGTM.com.

The initial commit will contain a build command (simple `make`) that previously resulted in a build error. The build log will then be available on LGTM.com for you to investigate (if you like). I'll immediately add a second commit to this PR to correct the build command to `make static_lib`, which worked when I tested it earlier today.

If you like you can also enable automatic code review in pull requests. This will alert you to any new code issues before they actually get merged into `master`. Here's an example of how that works for the AMPHTML project: https://github.com/ampproject/amphtml/pull/13060. You can enable it yourself here: https://lgtm.com/projects/g/facebook/rocksdb/ci/.

I'll also add a badge to your README.md in a separate commit — feel free to remove that from this PR if you don't like it.

(Full disclosure: I'm part of the LGTM.com team 🙂. Ping samlanning)
Closes https://github.com/facebook/rocksdb/pull/4058

Differential Revision: D8648410

Pulled By: ajkr

fbshipit-source-id: 98d55fc19cff1b07268ac8425b63e764806065aa
2018-06-26 12:43:04 -07:00
Peter (Stig) Edwards
2694b6dc26 Remove unused imports, from python scripts. (#4057)
Summary:
Also remove redefined variable.
As reported on https://lgtm.com/projects/g/facebook/rocksdb/
Closes https://github.com/facebook/rocksdb/pull/4057

Differential Revision: D8648342

Pulled By: ajkr

fbshipit-source-id: afd2ba84d1364d316010179edd44777e64ca9183
2018-06-26 12:43:04 -07:00
Andrew Kryczka
a8e503e545 Fix universal compaction scheduling conflict with CompactFiles (#4055)
Summary:
Universal size-amp-triggered compaction was pulling the final sorted run into the compaction without checking whether any of its files are already being compacted. When all compactions are automatic, it is safe since it verifies the second-last sorted run is not already being compacted, which implies the last sorted run is also not being compacted (in automatic compaction multiple sorted runs are always compacted together). But with manual compaction, files in the last sorted run can be compacted independently, so the last sorted run also must be checked.

We were seeing the below assertion failure in `db_stress`. Also the test case included in this PR repros the failure.

```
db_universal_compaction_test: db/compaction.cc:312: void rocksdb::Compaction::MarkFilesBeingCompacted(bool): Assertion `mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted' failed.
Aborted (core dumped)
```
Closes https://github.com/facebook/rocksdb/pull/4055

Differential Revision: D8630094

Pulled By: ajkr

fbshipit-source-id: ac3b30a874678b76e113d4f6c42c1260411b08f8
2018-06-26 10:44:56 -07:00
Daniel Black
346d1069c3 Align StatisticsImpl / StatisticsData (#4036)
Summary:
Pinned the alignment of StatisticsData to the cacheline size rather than just extending its size (which could go over two cache lines)if unaligned in allocation.

Avoid compile errors in the process as per individual commit messages.

strengthen static_assert to CACHELINE rather than the highest common multiple.
Closes https://github.com/facebook/rocksdb/pull/4036

Differential Revision: D8582844

Pulled By: yiwu-arbug

fbshipit-source-id: 363c37029f28e6093e06c60b987bca9aa204bc71
2018-06-25 22:58:19 -07:00
Yi Wu
6d454d7376 BlobDB: is_fifo=true also evict non-TTL blob files (#4049)
Summary:
Previously with is_fifo=true we only evict TTL file. Changing it to also evict non-TTL files from oldest to newest, after exhausted TTL files.
Closes https://github.com/facebook/rocksdb/pull/4049

Differential Revision: D8604597

Pulled By: yiwu-arbug

fbshipit-source-id: bc4209ee27c1528ce4b72833e6f1e1bff80082c1
2018-06-25 22:43:05 -07:00
Sagar Vemuri
189f0c27aa Make BlockBasedTableIterator compaction-aware (#4048)
Summary:
Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`.

In 7103559f49, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain.
Closes https://github.com/facebook/rocksdb/pull/4048

Differential Revision: D8601056

Pulled By: sagar0

fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3
2018-06-25 13:19:27 -07:00
Yi Wu
a71e467381 Blob DB: enable readahead for garbage collection (#3648)
Summary:
Enable readahead for blob DB garbage collection, which should improve GC performance a little bit.
Closes https://github.com/facebook/rocksdb/pull/3648

Differential Revision: D7383791

Pulled By: yiwu-arbug

fbshipit-source-id: 642b3327f7105eca85986d3fb2d8f960a3d83cf1
2018-06-23 23:12:00 -07:00
Yanqin Jin
2729dd72ad Reclaim memory allocated to backup_engine.
Summary: Closes https://github.com/facebook/rocksdb/pull/4045

Differential Revision: D8595609

Pulled By: riversand963

fbshipit-source-id: 5ba5954d804b82b0e7264b2e18e1da4c94103b53
2018-06-23 17:12:14 -07:00
Maysam Yabandeh
80ade9ad83 Pin top-level index on partitioned index/filter blocks (#4037)
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037

Differential Revision: D8596218

Pulled By: maysamyabandeh

fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
2018-06-22 15:27:46 -07:00
Yi Wu
c726f7fda8 Fix dangling checkpoint pointer in db_stress (#4042)
Summary:
Fix db_stress failed to delete checkpoint pointer. It's caught by asan_crash test.
Closes https://github.com/facebook/rocksdb/pull/4042

Differential Revision: D8592604

Pulled By: yiwu-arbug

fbshipit-source-id: 7b2d67d5e3dfb05f71c33fcf320482303e97d3ef
2018-06-22 11:43:50 -07:00
Adam Retter
64c85d0d97 Set DEBUG_LEVEL=0 for RocksJava Mac Release (#4040)
Summary:
Closes https://github.com/facebook/rocksdb/issues/2717
Closes https://github.com/facebook/rocksdb/pull/4040

Differential Revision: D8592058

Pulled By: sagar0

fbshipit-source-id: d01099a1067aa32659abb0b4bed641d919a3927e
2018-06-22 10:57:48 -07:00
Zhongyi Xie
795e663df0 option for timing measurement of non-blocking ops during compaction (#4029)
Summary:
For example calling CompactionFilter is always timed and gives the user no way to disable.
This PR will disable the timer if `Statistics::stats_level_` (which is part of DBOptions) is `kExceptDetailedTimers`
Closes https://github.com/facebook/rocksdb/pull/4029

Differential Revision: D8583670

Pulled By: miasantreble

fbshipit-source-id: 913be9fe433ae0c06e88193b59d41920a532307f
2018-06-21 21:28:05 -07:00
Andrew Kryczka
0a5b16c7c5 Cleanup staging directory at start of checkpoint (#4035)
Summary:
- Attempt to clean the checkpoint staging directory before starting a checkpoint. It was already cleaned up at the end of checkpoint. But it wasn't cleaned up in the edge case where the process crashed while staging checkpoint files.
- Attempt to clean the checkpoint directory before calling `Checkpoint::Create` in `db_stress`. This handles the case where checkpoint directory was created by a previous `db_stress` run but the process crashed before cleaning it up.
- Use `DestroyDB` for cleaning checkpoint directory since a checkpoint is a DB.
Closes https://github.com/facebook/rocksdb/pull/4035

Reviewed By: yiwu-arbug

Differential Revision: D8580223

Pulled By: ajkr

fbshipit-source-id: 28c667400e249fad0fdedc664b349031b7b61599
2018-06-21 16:27:12 -07:00
Sagar Vemuri
645e57c22d Assert for Direct IO at the beginning in PositionedRead (#3891)
Summary:
Moved the direct-IO assertion to the top in `PosixSequentialFile::PositionedRead`, as it doesn't make sense to check for sector alignments before checking for direct IO.
Closes https://github.com/facebook/rocksdb/pull/3891

Differential Revision: D8267972

Pulled By: sagar0

fbshipit-source-id: 0ecf77c0fb5c35747a4ddbc15e278918c0849af7
2018-06-21 14:58:01 -07:00
Yi Wu
58c221440c Update TARGETS file (#4028)
Summary:
-Wshorten-64-to-32 is invalid flag in fbcode. Changing it to -Warrowing.
Closes https://github.com/facebook/rocksdb/pull/4028

Differential Revision: D8553694

Pulled By: yiwu-arbug

fbshipit-source-id: 1523cbcb4c76cf1d2b10a4d28b5f58c78e6cb876
2018-06-21 14:42:39 -07:00
Yanqin Jin
397495964b Fix a warning (treated as error) caused by type mismatch.
Summary: Closes https://github.com/facebook/rocksdb/pull/4032

Differential Revision: D8573061

Pulled By: riversand963

fbshipit-source-id: 112324dcb35956d6b3ec891073f4f21493933c8b
2018-06-21 11:13:09 -07:00
Sagar Vemuri
7103559f49 Improve direct IO range scan performance with readahead (#3884)
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.

**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.

**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.

**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```

Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

```
Before:
seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884

Differential Revision: D8082143

Pulled By: sagar0

fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
2018-06-21 11:13:08 -07:00
Yanqin Jin
524c6e6b72 Add file name info to SequentialFileReader. (#4026)
Summary:
We potentially need this information for tracing, profiling and diagnosis.
Closes https://github.com/facebook/rocksdb/pull/4026

Differential Revision: D8555214

Pulled By: riversand963

fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2
2018-06-21 08:42:24 -07:00
Andrew Kryczka
14cee194d6 Support file ingestion in stress test (#4018)
Summary:
Once per `ingest_external_file_one_in` operations, uses SstFileWriter to create a file containing `ingest_external_file_width` consecutive keys. The file is named containing the thread ID to avoid clashes. The file is then added to the DB using `IngestExternalFile`.

We can't enable it by default in crash test because `nooverwritepercent` and `test_batches_snapshot` both must be zero for the DB's whole lifetime. Perhaps we should setup a separate test with that config as range deletion also requires it.
Closes https://github.com/facebook/rocksdb/pull/4018

Differential Revision: D8507698

Pulled By: ajkr

fbshipit-source-id: 1437ea26fd989349a9ce8b94117241c65e40f10f
2018-06-20 22:27:45 -07:00
Dmitri Smirnov
61d69d450d Hide jemalloc aligned allocation functions into .cc (#4025)
Summary:
so they could be overriden
Closes https://github.com/facebook/rocksdb/pull/4025

Differential Revision: D8526287

Pulled By: siying

fbshipit-source-id: 9537b299dc907b4d1eeaf77a8784b13cb058280d
2018-06-19 17:12:06 -07:00
Maysam Yabandeh
28a9d8910b Fix the bug with duplicate prefix in partition filters (#4024)
Summary:
https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut.
This bug could result into false negatives, which means that DB could skip an existing key.
Closes https://github.com/facebook/rocksdb/pull/4024

Differential Revision: D8518866

Pulled By: maysamyabandeh

fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf
2018-06-19 14:12:46 -07:00
Siying Dong
92ee3350e0 BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004)
Summary:
b555ed30a4 makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound.

For example, if an SST file has a data block containing following keys : {a, z}

The user sets the upper bound to be "x", and it executed following queries:
Seek("b")
Seek("c")
Seek("d")

Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again.

To prevent this regression case, we keep the current data block iterator if it is upper bound.
Closes https://github.com/facebook/rocksdb/pull/4004

Differential Revision: D8463192

Pulled By: siying

fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81
2018-06-19 09:57:11 -07:00
Andrew Kryczka
7f3a634e06 Support pipelined write in stress/crash tests
Summary: Closes https://github.com/facebook/rocksdb/pull/4019

Differential Revision: D8508681

Pulled By: ajkr

fbshipit-source-id: 23a3c07d642386446e322b02e69cdf70d12ef009
2018-06-19 09:14:12 -07:00
Andrew Kryczka
8585059ae0 Support backup and checkpoint in db_stress (#4005)
Summary:
Add the `backup_one_in` and `checkpoint_one_in` options to periodically trigger backups and checkpoints. The directory names contain thread ID to avoid clashing with parallel backups/checkpoints. Enable checkpoint in crash test so our CI runs will use it. Didn't enable backup in crash test since it copies all the files which is too slow.
Closes https://github.com/facebook/rocksdb/pull/4005

Differential Revision: D8472275

Pulled By: ajkr

fbshipit-source-id: ff91bdc37caac4ffd97aea8df96b3983313ac1d5
2018-06-18 19:28:18 -07:00
Andrew Kryczka
de2c6fb158 Fix stderr processing in crash test (#4006)
Summary:
Fixed bug where `db_stress` output a line with a warning followed by a line with an error, and `db_crashtest.py` considered that a success. For example:

```
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
open error: Corruption: SST file is ahead of WALs
```
Closes https://github.com/facebook/rocksdb/pull/4006

Differential Revision: D8473463

Pulled By: ajkr

fbshipit-source-id: 60461bdd7491d9d26c63f7d4ee522a0f88ba3de7
2018-06-18 17:58:13 -07:00
Tomas Kolda
c766887458 Fix ExternalSSTFileTest::OverlappingRanges test on Solaris Sparc (#4012)
Summary:
Fix of #4011
Closes https://github.com/facebook/rocksdb/pull/4012

Differential Revision: D8499173

Pulled By: sagar0

fbshipit-source-id: cbb2b90c544ed364a3640ea65835d577b2dbc5df
2018-06-18 14:57:37 -07:00
Tomas Kolda
7b4b43febb zLinux build error with gcc and IBM Java headers (#4013)
Summary:
`SetByteArrayRegion` does not have const source buffer thus compilation error. I have made that same as in other JNI files (const_cast). It was missing for new transaction functionality added recently.
Closes https://github.com/facebook/rocksdb/pull/4013

Differential Revision: D8493290

Pulled By: sagar0

fbshipit-source-id: 14afedf365b111121bd11e68a8d546a1cae68b26
2018-06-18 13:58:28 -07:00
Tomas Kolda
e5bee404ce zLinux s390x support in JNI (#4009)
Summary:
Adding support for zLinux on s390x architecture in JNI.
Closes https://github.com/facebook/rocksdb/pull/4009

Differential Revision: D8483750

Pulled By: siying

fbshipit-source-id: e681657c27e7a28f1731e08e8570382de5deff44
2018-06-18 09:57:02 -07:00
Tomas Kolda
e750dacffb Crash on Windows, because of shared_ptr reinterpret cast (#3999)
Summary:
For more details see #3998
Closes https://github.com/facebook/rocksdb/pull/3999

Differential Revision: D8458905

Pulled By: sagar0

fbshipit-source-id: d6e09182933253a08eaf81ac7cfe50ed3b6576c5
2018-06-17 20:56:33 -07:00
Zhongyi Xie
80bc35927c Should only decode restart points for uncompressed blocks (#3996)
Summary:
The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843.
This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy.
Closes https://github.com/facebook/rocksdb/pull/3996

Differential Revision: D8416186

Pulled By: miasantreble

fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28
2018-06-15 19:26:58 -07:00
Anand Ananthabhotla
c48764ba47 Don't generate a notification for a 0 size SST (#4003)
Summary:
Don't call the OnTableFileCreated listener callback when a 0 size SST
file gets created by Flush. Doing so causes an assertion failure in db_stress. It is also not correct behavior as we call env->DeleteFile() for such files right before the notification.
Closes https://github.com/facebook/rocksdb/pull/4003

Differential Revision: D8461385

Pulled By: anand1976

fbshipit-source-id: ae92d4f921c2e2cff981ad58f4929ed8b609f35d
2018-06-15 17:57:24 -07:00