Commit Graph

7312 Commits

Author SHA1 Message Date
Kefu Chai
13a0bd90ce cmake: add options for enabling TBB and NUMA support
Summary:
see also https://github.com/facebook/rocksdb/issues/3036

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/3750

Differential Revision: D7765170

Pulled By: ajkr

fbshipit-source-id: 455788b3131bf62a4987a65684b757e68473eed9
2018-04-25 14:26:55 -07:00
Andrew Kryczka
dfc61e7c24 initialize local variable for UBSAN in PosixEnv function
Summary:
It seems clear to me that the variable is initialized before line 492, but it wasn't clear to UBSAN. The failure was:

```
In file included from ./env/io_posix.h:14:0,
                 from env/env_posix.cc:44:
./include/rocksdb/env.h: In member function ‘virtual rocksdb::Status rocksdb::{anonymous}::PosixEnv::NewMemoryMappedFileBuffer(const string&, std::unique_ptr<rocksdb::MemoryMappedFileBuffer>*)’:
./include/rocksdb/env.h:822:36: error: ‘base’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
       : base(_base), length(_length) {}
                                    ^
env/env_posix.cc:482:11: note: ‘base’ was declared here
     void* base;
```

We can just initialize to nullptr to keep UBSAN happy.
Closes https://github.com/facebook/rocksdb/pull/3770

Differential Revision: D7756287

Pulled By: ajkr

fbshipit-source-id: 0f2efb9594e2d3a30706a4ca7e1d4a6328031bf2
2018-04-25 13:42:02 -07:00
Andrew Kryczka
9b89479e64 Pass -latomic to linker when using clang
Summary:
clang compilation is failing due to a4fb1f8c04. In that commit I added a call to `std::atomic::is_lock_free` which was evidently relying on a compiler builtin only present in gcc.

Drawbacks to this fix are:

- users may need to install libatomic
- there might be cases where clang is used even though USE_CLANG is unset (e.g., when clang is the only available compiler). I didn't figure out how to add -latomic in those cases...

An alternative fix mentioned in http://lists.llvm.org/pipermail/llvm-bugs/2017-August/057263.html is using -stdlib=libc++ with clang.
Closes https://github.com/facebook/rocksdb/pull/3769

Differential Revision: D7756261

Pulled By: ajkr

fbshipit-source-id: 26888300683fa9970ab5950239d1aa217e8efd49
2018-04-25 12:13:41 -07:00
Andrew Kryczka
a4fb1f8c04 Add crash-recovery correctness check to db_stress
Summary:
Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values.

In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`.

For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class.

`db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup.
Closes https://github.com/facebook/rocksdb/pull/3629

Differential Revision: D7463144

Pulled By: ajkr

fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b
2018-04-24 15:58:22 -07:00
Maysam Yabandeh
bc0da4b512 Skip duplicate bloom keys when whole_key and prefix are mixed
Summary:
Currently we rely on FilterBitsBuilder to skip the duplicate keys. It does that by comparing that hash of the key to the hash of the last added entry. This logic breaks however when we have whole_key_filtering mixed with prefix blooms as their addition to FilterBitsBuilder will be interleaved. The patch fixes that by comparing the last whole key and last prefix with the whole key and prefix of the new key respectively and skip the call to FilterBitsBuilder if it is a duplicate.
Closes https://github.com/facebook/rocksdb/pull/3764

Differential Revision: D7744413

Pulled By: maysamyabandeh

fbshipit-source-id: 15df73bbbafdfd754d4e1f42ea07f47b03bc5eb8
2018-04-24 10:58:16 -07:00
Gabriel Wicke
090c78a0d7 Support lowering CPU priority of background threads
Summary:
Background activities like compaction can negatively affect
latency of higher-priority tasks like request processing. To avoid this,
rocksdb already lowers the IO priority of background threads on Linux
systems. While this takes care of typical IO-bound systems, it does not
help much when CPU (temporarily) becomes the bottleneck. This is
especially likely when using more expensive compression settings.

This patch adds an API to allow for lowering the CPU priority of
background threads, modeled on the IO priority API. Benchmarks (see
below) show significant latency and throughput improvements when CPU
bound. As a result, workloads with some CPU usage bursts should benefit
from lower latencies at a given utilization, or should be able to push
utilization higher at a given request latency target.

A useful side effect is that compaction CPU usage is now easily visible
in common tools, allowing for an easier estimation of the contribution
of compaction vs. request processing threads.

As with IO priority, the implementation is limited to Linux, degrading
to a no-op on other systems.
Closes https://github.com/facebook/rocksdb/pull/3763

Differential Revision: D7740096

Pulled By: gwicke

fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
2018-04-24 08:41:51 -07:00
Mike Kolupaev
affe01b0d5 Improve write time breakdown stats
Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.

These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602

Differential Revision: D7251562

Pulled By: al13n321

fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
2018-04-23 17:58:54 -07:00
Siying Dong
d5afa73789 Revert "Skip deleted WALs during recovery"
Summary:
This reverts commit 73f21a7b21.

It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:

"Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"

This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
Closes https://github.com/facebook/rocksdb/pull/3762

Differential Revision: D7730035

Pulled By: siying

fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
2018-04-23 12:01:26 -07:00
Andrew Kryczka
a8a28da215 Avoid directory renames in BackupEngine
Summary:
We used to name private directories like "1.tmp" while BackupEngine populated them, and then rename without the ".tmp" suffix (i.e., rename "1.tmp" to "1") after all files were copied. On glusterfs, directory renames like this require operations across many hosts, and partial failures have caused operational problems.

Fortunately we don't need to rename private directories. We already have a meta-file that uses the tempfile-rename pattern to commit a backup atomically after all its files have been successfully copied. So we can copy private files directly to their final location, so now there's no directory rename.
Closes https://github.com/facebook/rocksdb/pull/3749

Differential Revision: D7705610

Pulled By: ajkr

fbshipit-source-id: fd724a28dd2bf993ce323a5f2cb7e7d6980cc346
2018-04-20 17:28:33 -07:00
Yi Wu
2e72a5899b Disable EnvPosixTest::FilePermission
Summary:
The test is flaky in our CI but could not be reproduce manually on the same CI host. Disabling it.
Closes https://github.com/facebook/rocksdb/pull/3753

Differential Revision: D7716320

Pulled By: yiwu-arbug

fbshipit-source-id: 6bed3b05880c1d24e8dc86bc970e5181bc98fb45
2018-04-20 15:42:42 -07:00
Maysam Yabandeh
bb2a2ec731 WritePrepared Txn: rollback via commit
Summary:
Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap.
This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache.
The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction.
Closes https://github.com/facebook/rocksdb/pull/3745

Differential Revision: D7696193

Pulled By: maysamyabandeh

fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c
2018-04-20 15:28:19 -07:00
Anand Ananthabhotla
dbdaa4662e Add a stat for MultiGet keys found, update memtable hit/miss stats
Summary:
1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
number of keys successfully read
2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
DBImpl::GetImpl(), but not MultiGet
Closes https://github.com/facebook/rocksdb/pull/3730

Differential Revision: D7677364

Pulled By: anand1976

fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
2018-04-20 15:28:19 -07:00
Maysam Yabandeh
c3d1e36cce WritePrepared Txn: enable TryAgain for duplicates at the end of the batch
Summary:
The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might  not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
Closes https://github.com/facebook/rocksdb/pull/3747

Differential Revision: D7708391

Pulled By: maysamyabandeh

fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d
2018-04-20 15:28:19 -07:00
Maysam Yabandeh
17e04039dd Propagate fill_cache config to partitioned index iterator
Summary:
Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator.
Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache.
Closes https://github.com/facebook/rocksdb/pull/3739

Differential Revision: D7678308

Pulled By: maysamyabandeh

fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da
2018-04-20 15:13:05 -07:00
przemyslaw.skibinski@percona.com
dee95a1afc Fix GitHub issue #3716: gcc-8 warnings
Summary:
Fix the following gcc-8 warnings:
- conflicting C language linkage declaration [-Werror]
- writing to an object with no trivial copy-assignment [-Werror=class-memaccess]
- array subscript -1 is below array bounds [-Werror=array-bounds]

Solves https://github.com/facebook/rocksdb/issues/3716
Closes https://github.com/facebook/rocksdb/pull/3736

Differential Revision: D7684161

Pulled By: yiwu-arbug

fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a
2018-04-20 13:42:47 -07:00
Zhongyi Xie
8a9c7f71c9 fix compilation error: implicit conversion loses integer precision
Summary:
Fix compilation error with clang:
> tools/db_stress.cc:2598:21: error: implicit conversion loses integer precision: 'gflags::uint64' (aka 'unsigned long') to 'uint32_t' (aka 'unsigned int') [-Werror,-Wshorten-64-to-32]
        Random rand(FLAGS_seed);
               ~~~~ ^~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/3746

Differential Revision: D7703209

Pulled By: miasantreble

fbshipit-source-id: 18c56a5138a2f308e4213594bc82e8e64bc21570
2018-04-19 18:57:43 -07:00
Paweł Bylica
69faddb32e CMake: Read rocksdb version from version.h header file
Summary:
This replaces reading the rocksdb version by external shell script. This does not work reliably on Windows (I wander how it works on AppVeyor).
Closes https://github.com/facebook/rocksdb/pull/3737

Differential Revision: D7703106

Pulled By: ajkr

fbshipit-source-id: 4079c7c77431757e9ddc801363ed896b18fdbf23
2018-04-19 17:42:11 -07:00
Zhongyi Xie
e1e826b980 check return status for Sync() and Append() calls to avoid corruption
Summary:
Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status.
Closes https://github.com/facebook/rocksdb/pull/3740

Differential Revision: D7678848

Pulled By: miasantreble

fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77
2018-04-19 14:13:46 -07:00
Yi Wu
ad511684b2 Add block cache related DB properties
Summary:
Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage.
Closes https://github.com/facebook/rocksdb/pull/3734

Differential Revision: D7657180

Pulled By: yiwu-arbug

fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd
2018-04-18 21:42:25 -07:00
Andrew Kryczka
3cea61392f include thread-pool priority in thread names
Summary:
Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority.
Closes https://github.com/facebook/rocksdb/pull/3702

Differential Revision: D7581415

Pulled By: ajkr

fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1
2018-04-18 17:27:56 -07:00
Maysam Yabandeh
6d06be22c0 Improve db_stress with transactions
Summary:
db_stress was already capable running transactions by setting use_txn. Running it under stress showed a couple of problems fixed in this patch.
- The uncommitted transaction must be either rolled back or commit after recovery.
- Current implementation of WritePrepared transaction cannot handle cf drop before crash. Clarified that in the comments and added safety checks. When running with use_txn, clear_column_family_one_in must be set to 0.
Closes https://github.com/facebook/rocksdb/pull/3733

Differential Revision: D7654419

Pulled By: maysamyabandeh

fbshipit-source-id: a024bad80a9dc99677398c00d29ff17d4436b7f3
2018-04-18 16:32:35 -07:00
Yanqin Jin
2ee1496c43 Add missing whitespace.
Summary: Closes https://github.com/facebook/rocksdb/pull/3729

Differential Revision: D7645465

Pulled By: riversand963

fbshipit-source-id: a64da0960fe6c39847ef848b8888fe9a9c1df25d
2018-04-17 09:57:40 -07:00
Yi Wu
2c2f388897 db_bench fillXXXdeterministic should respect compression type
Summary:
db_bench fillXXXdeterministic should respect compression type when calling CompactFiles().
Closes https://github.com/facebook/rocksdb/pull/3731

Differential Revision: D7647761

Pulled By: yiwu-arbug

fbshipit-source-id: 15e12429e0dd93ece2231b015f2e26c2d94781e6
2018-04-16 18:01:47 -07:00
Harry Wong
b4f333922a Improve the comment on TableFactory::NewTableReader()
Summary:
`DBImpl::AddFile()` has been replaced by `DBImpl::IngestExternalFile()`.
Closes https://github.com/facebook/rocksdb/pull/3726

Differential Revision: D7646875

Pulled By: ajkr

fbshipit-source-id: 241eb7a8d88527fdc5c26b0c3f6faec3296451f8
2018-04-16 16:58:20 -07:00
Yanqin Jin
5e48811844 Initialize a boolean member variable of a struct.
Summary:
The reason for this initialization is that LLVM UBSAN check will fail due to
uninitialized bool. [StackOverflow post](https://stackoverflow.com/questions/31420154/runtime-error-load-of-value-127-which-is-not-a-valid-value-for-type-bool).

UBSAN log:
> ===== Running external_sst_file_basic_test
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from ExternalSSTFileBasicTest
[ RUN      ] ExternalSSTFileBasicTest.Basic
[       OK ] ExternalSSTFileBasicTest.Basic (6 ms)
[ RUN      ] ExternalSSTFileBasicTest.NoCopy
db/external_sst_file_ingestion_job.h:23:8: runtime error: load of value 253, which is not a valid value for type 'bool'

miasantreble  I've tested this locally using the following command.
```
TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 OPT=-g make J=1 -j8 ubsan_check
```

ajkr This PR is related to your review comment in [PR](https://github.com/facebook/rocksdb/pull/3713/). It turns out that, with UBSAN enabled, we must provide a default value for boolean member variables.
Closes https://github.com/facebook/rocksdb/pull/3728

Differential Revision: D7642476

Pulled By: riversand963

fbshipit-source-id: 4c09a4b8d271151cb99ae7393db9e4ad9f29762e
2018-04-16 14:28:01 -07:00
Zhongyi Xie
af95aecd01 use delete[] to dealloc an array
Summary:
fix a bug in `db_stress` where an int array was incorrectly deallocated using delete instead of delete[]
Closes https://github.com/facebook/rocksdb/pull/3725

Differential Revision: D7634749

Pulled By: miasantreble

fbshipit-source-id: 489b776f5f4c03de1824edac5495787ec19cc910
2018-04-15 23:56:39 -07:00
Zhongyi Xie
954b496b3f fix memory leak in two_level_iterator
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718

Reviewed By: maysamyabandeh

Differential Revision: D7621192

Pulled By: miasantreble

fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146
2018-04-15 17:26:26 -07:00
Kefu Chai
9fcd82e987 cmake: append rados to THIRDPARTY_LIBS before appending it to LIBS
Summary:
otherwise the env_librados_test executable will fail to link against
librados.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/3724

Differential Revision: D7631542

Pulled By: ajkr

fbshipit-source-id: 38afbf21f9aeb7dedfb840aba8b2f8b421f9edb0
2018-04-15 13:27:54 -07:00
Jingguo Yao
81d44f2bc5 fix-typo: add missing periods
Summary: Closes https://github.com/facebook/rocksdb/pull/3720

Differential Revision: D7631525

Pulled By: ajkr

fbshipit-source-id: 50cf4dc363b0d32b150d963011171a8a6f53a384
2018-04-15 13:12:23 -07:00
Amy Tai
28087acd79 Implemented Knuth shuffle to construct permutation for selecting no_o…
Summary:
…verwrite_keys. Also changed each no_overwrite_key set to an unordered set, otherwise Knuth shuffle only gets you 2x time improvement, because insertion (and subsequent internal sorting) into an ordered set is the bottleneck.

With this change, each iteration of permutation construction and prefix selection takes around 40 secs, as opposed to 360 secs previously. However, this still means that with the default 10 CF per blackbox test case, the test is going to time out given the default interval of 200 secs.

Also, there is currently an assertion error affecting all blackbox tests in db_crashtest.py; this assertion error will be fixed in a future PR.
Closes https://github.com/facebook/rocksdb/pull/3699

Differential Revision: D7624616

Pulled By: amytai

fbshipit-source-id: ea64fbe83407ff96c1c0ecabbc6c830576939393
2018-04-13 22:13:13 -07:00
Xiaofei Du
a0102aa6d7 Make database files' permissions configurable
Summary: Closes https://github.com/facebook/rocksdb/pull/3709

Differential Revision: D7610227

Pulled By: xiaofeidu008

fbshipit-source-id: 88a52f0f9f96e2195fccde995cf9760b785e9f07
2018-04-13 13:13:04 -07:00
zhangjinpeng1987
31ee4bf240 add kEntryRangeDeletion
Summary:
When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
Closes https://github.com/facebook/rocksdb/pull/3695

Differential Revision: D7570322

Pulled By: ajkr

fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369
2018-04-13 11:27:17 -07:00
Steven Fackler
1f5457ef21 Merge raw and shared pointer log method impls
Summary:
Calling rocksdb::Log, rocksdb::Info, etc with a `shared_ptr<Logger>` should behave the same as calling those functions with a `Logger *`. This PR achieves it by making the `shared_ptr<Logger>` versions delegate to the `Logger *` versions.

Closes #3689
Closes https://github.com/facebook/rocksdb/pull/3710

Differential Revision: D7595557

Pulled By: ajkr

fbshipit-source-id: 64dd7f20fd42dc821bac7b8032705c35b483e00d
2018-04-13 11:12:54 -07:00
Yanqin Jin
c81b0abedd Improve accuracy of I/O stats collection of external SST ingestion.
Summary:
RocksDB supports ingestion of external ssts. If ingestion_options.move_files is true, when performing ingestion, RocksDB first tries to link external ssts. If external SST file resides on a different FS, or the underlying FS does not support hard link, then RocksDB performs actual file copy. However, no matter which choice is made, current code increase bytes-written when updating compaction stats, which is inaccurate when RocksDB does NOT copy file.

Rename a sync point.
Closes https://github.com/facebook/rocksdb/pull/3713

Differential Revision: D7604151

Pulled By: riversand963

fbshipit-source-id: dd0c0d9b9a69c7d9ffceafc3d9c23371aa413586
2018-04-13 10:58:42 -07:00
David Lai
3be9b36453 comment unused parameters to turn on -Wunused-parameter flag
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662

Differential Revision: D7426121

Pulled By: Dayvedde

fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
2018-04-12 17:59:16 -07:00
Maysam Yabandeh
d15397ba10 WritePrepared Txn: rollback_merge_operands hack
Summary:
This is a hack as temporary fix of MyRocks with rollbacking  the merge operands. The way MyRocks uses merge operands is without protection of locks, which violates the assumption behind the rollback algorithm. They are ok with not being rolled back as it would just create a gap in the autoincrement column. The hack add an option to disable the rollback of merge operands by default and only enables it to let the unit test pass.
Closes https://github.com/facebook/rocksdb/pull/3711

Differential Revision: D7597177

Pulled By: maysamyabandeh

fbshipit-source-id: 544be0f666c7e7abb7f651ec8b23124e05056728
2018-04-12 11:58:11 -07:00
Maysam Yabandeh
6f5e6445d9 WritePrepared Txn: fix smallest_prep atomicity issue
Summary:
We introduced smallest_prep optimization in this commit b225de7e10, which enables storing the smallest uncommitted sequence number along with the snapshot. This enables the readers that read from the snapshot to skip further checks and safely assumed the data is committed if its sequence number is less than smallest uncommitted when the snapshot was taken. The problem was that smallest uncommitted and the snapshot must be taken atomically, and the lack of atomicity had led to readers using a smallest uncommitted after the snapshot was taken and hence mistakenly skipping some data.
This patch fixes the problem by i) separating the process of removing of prepare entries from the AddCommitted function, ii) removing the prepare entires AFTER the committed sequence number is published, iii) getting smallest uncommitted (from the prepare list) BEFORE taking a snapshot. This guarantees that the smallest uncommitted that is accompanied with a snapshot is less than or equal of such number if it was obtained atomically.

Tested by running MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest that was failing sporadically.
Closes https://github.com/facebook/rocksdb/pull/3703

Differential Revision: D7581934

Pulled By: maysamyabandeh

fbshipit-source-id: dc9d6f4fb477eba75d4d5927326905b548a96a32
2018-04-11 20:11:51 -07:00
Yanqin Jin
d42bd041c5 Improve visibility into the reasons for compaction.
Summary:
Add `compaction_reason` as part of event log for event `compaction started`.
Add counters for each `CompactionReason`.
Closes https://github.com/facebook/rocksdb/pull/3679

Differential Revision: D7550348

Pulled By: riversand963

fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d
2018-04-11 10:58:44 -07:00
Andrew Kryczka
019d7894eb fix calling SetOptions on deprecated options
Summary:
In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion:

```
db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed.
```

We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons.
Closes https://github.com/facebook/rocksdb/pull/3700

Differential Revision: D7572596

Pulled By: ajkr

fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1
2018-04-10 19:02:09 -07:00
Yanqin Jin
d95014b9df fix some text in comments.
Summary:
1. Remove redundant text.
2. Make terminology consistent across all comments and doc of RocksDB. Also do
   our best to conform to conventions. Specifically, use 'callback' instead of
   'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
Closes https://github.com/facebook/rocksdb/pull/3693

Differential Revision: D7560396

Pulled By: riversand963

fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
2018-04-10 15:59:24 -07:00
Zhongyi Xie
2770a94c42 make MockTimeEnv::current_time_ atomic to fix data race
Summary:
fix a new TSAN failure
https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397
Closes https://github.com/facebook/rocksdb/pull/3694

Differential Revision: D7565310

Pulled By: miasantreble

fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825
2018-04-10 14:13:18 -07:00
Dmitri Smirnov
5ec382b918 Fix up backupable_db stack corruption.
Summary:
Fix up OACR(Lint) warnings.
Closes https://github.com/facebook/rocksdb/pull/3674

Differential Revision: D7563869

Pulled By: ajkr

fbshipit-source-id: 8c1e5045c8a6a2d85b2933fdbc60fde93bf0c9de
2018-04-09 19:27:24 -07:00
Maysam Yabandeh
d2bcd7611f Fix the memory leak with pinned partitioned filters
Summary:
The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug.
Closes https://github.com/facebook/rocksdb/pull/3692

Differential Revision: D7559763

Pulled By: maysamyabandeh

fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a
2018-04-09 16:28:19 -07:00
Gihwan Oh
65fe8d6cd6 Change a comment
Summary:
In this case, we add input files of compaction, not outputs.
Closes https://github.com/facebook/rocksdb/pull/3686

Differential Revision: D7556781

Pulled By: ajkr

fbshipit-source-id: ae135bb6eda60db8f275a9ba2d21c18aaadef5b7
2018-04-09 13:42:31 -07:00
Andrew Kryczka
1c27cbfbd1 fix intra-L0 FIFO for uncompressed use case
Summary:
- inflate the argument passed as `max_compact_bytes_per_del_file` by a bit (10%). The intent of this argument is prevent L0 files from being intra-L0 compacted multiple times. Without compression, some intra-L0 compactions exceed this limit (and thus aren't executed), even though none of their files have gone through intra-L0 before.
- fix `FindIntraL0Compaction` as it was rejecting some valid intra-L0 compactions. In particular, `compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len), whereas `new_compact_bytes_per_del_file` is the work-per-deleted-file for the span [0, span_len+1). The former is more correct for checking whether we've found an eligible span.
Closes https://github.com/facebook/rocksdb/pull/3684

Differential Revision: D7530396

Pulled By: ajkr

fbshipit-source-id: cad4f50902bdc428ac9ff6fffb13eb288648d85e
2018-04-09 13:42:31 -07:00
Zhongyi Xie
f3a1d9e049 fix data race
Summary:
Fix a TSAN failure in `DBRangeDelTest.ValidLevelSubcompactionBoundaries`:
https://gist.github.com/miasantreble/712e04b4de2ff7f193c98b1acf07e899
Closes https://github.com/facebook/rocksdb/pull/3691

Differential Revision: D7541400

Pulled By: miasantreble

fbshipit-source-id: b0b4538980bce7febd0385e61d6e046580bcaefb
2018-04-09 12:28:28 -07:00
Maysam Yabandeh
bde1c1a72a WritePrepared Txn: add stats
Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683

Differential Revision: D7529393

Pulled By: maysamyabandeh

fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
2018-04-07 21:56:42 -07:00
Maysam Yabandeh
eb5a295440 WritePrepared Txn: add write_committed option to dump_wal
Summary:
Currently dump_wal cannot print the prepared records from the WAL that is generated by WRITE_PREPARED write policy since the default reaction of the handler is to return NotSupported if markers of WRITE_PREPARED are encountered. This patch enables the admin to pass --write_committed=false option, which will be accordingly passed to the handler. Note that DBFileDumperCommand and DBDumperCommand are still not updated by this patch but firstly they are not urgent and secondly we need to revise this approach later when we also add WRITE_UNPREPARED markers so I leave it for future work.

Tested by running it on a WAL generated by WRITE_PREPARED:
$ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log  | grep BEGIN_PREARE | head -1
1,2,70,0,BEGIN_PREARE
$ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log --write_committed=false | grep BEGIN_PREARE | head -1
1,2,70,0,BEGIN_PREARE PUT(0) : 0x30303031313330313938 PUT(0) : 0x30303032353732313935 END_PREPARE(0x74786E31313535383434323738303738363938313335312D30)
Closes https://github.com/facebook/rocksdb/pull/3682

Differential Revision: D7522090

Pulled By: maysamyabandeh

fbshipit-source-id: a0332207261c61e18b2f9dfbe9feecd9a1339aca
2018-04-07 21:56:42 -07:00
Adam Retter
ca87aef82d Added support for SstFileManager to RocksJava
Summary: Closes https://github.com/facebook/rocksdb/pull/3666

Differential Revision: D7457634

Pulled By: sagar0

fbshipit-source-id: 47741e2ee66e9255c580f4e38cfb86b284c27c2f
2018-04-06 21:26:32 -07:00
Gihwan Oh
74767deec3 Fix typo
Summary:
regrad -> regard
Closes https://github.com/facebook/rocksdb/pull/3685

Differential Revision: D7540952

Pulled By: miasantreble

fbshipit-source-id: e08c9389f7fccf401c962a4441b62cd5e73a33ad
2018-04-06 15:42:50 -07:00