Commit Graph

3004 Commits

Author SHA1 Message Date
Maysam Yabandeh
02693f64fc WritePrepared Txn: ValidateSnapshot
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101

Differential Revision: D6199405

Pulled By: maysamyabandeh

fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
2017-11-01 19:11:09 -07:00
Mikhail Antonov
7fe3b32896 Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:
 - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
 - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
 - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:
 - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:
 - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

 - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999

Differential Revision: D6175602

Pulled By: mikhail-antonov

fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:56:43 -07:00
Maysam Yabandeh
17731a43a6 WritePrepared Txn: Optimize for recoverable state
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071

Differential Revision: D6148023

Pulled By: maysamyabandeh

fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
2017-11-01 17:26:46 -07:00
Shaohua Li
33c7d4ccd9 Make writable_file_max_buffer_size dynamic
Summary:
The DBOptions::writable_file_max_buffer_size can be changed dynamically.
Closes https://github.com/facebook/rocksdb/pull/3053

Differential Revision: D6152720

Pulled By: shligit

fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681
2017-10-31 13:56:35 -07:00
Andrew Kryczka
b7bc9cc038 fix tracking oldest snapshot for bottom-level compaction
Summary:
The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case.
Closes https://github.com/facebook/rocksdb/pull/3080

Differential Revision: D6185257

Pulled By: ajkr

fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd
2017-10-30 00:55:58 -07:00
Yi Wu
792ef10ca8 Return Status::InvalidArgument if user request sync write while disabling WAL
Summary:
write_options.sync = true and write_options.disableWAL is incompatible. When WAL is disabled, we are not able to persist the write immediately. Return an error in this case to avoid misuse of the options.
Closes https://github.com/facebook/rocksdb/pull/3086

Differential Revision: D6176822

Pulled By: yiwu-arbug

fbshipit-source-id: 1eb10028c14fe7d7c13c8bc12c0ef659f75aa071
2017-10-28 22:11:18 -07:00
Andrew Kryczka
6a9335dbbb always drop tombstones compacted to bottommost level
Summary:
Problem was in bottommost compaction, when an L0->L0 compaction happened and L0 was bottommost. Then we'd preserve tombstones according to `Compaction::KeyNotExistsBeyondOutputLevel`, while zeroing seqnum according to `CompactionIterator::PrepareOutput`, thus triggering the assertion in `PrepareOutput`. To fix, we can just drop tombstones in L0->L0 when the output is "bottommost", i.e., the compaction includes the oldest L0 file and there's nothing at lower levels.
Closes https://github.com/facebook/rocksdb/pull/3085

Differential Revision: D6175742

Pulled By: ajkr

fbshipit-source-id: 8ab19a2e001496f362e9eb0a71757e2f6ecfdb3b
2017-10-27 15:56:35 -07:00
Yi Wu
84a04af9a9 TableProperty::oldest_key_time defaults to 0
Summary:
We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0.

Also revert db_sst_test back to before #2842.
Closes https://github.com/facebook/rocksdb/pull/3079

Differential Revision: D6165702

Pulled By: yiwu-arbug

fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0
2017-10-27 15:00:05 -07:00
Islam AbdelRahman
05993155ef Mark files as trash by using .trash extension
Summary:
SstFileManager move files that need to be deleted into a trash directory.
Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted
Closes https://github.com/facebook/rocksdb/pull/2970

Differential Revision: D5976805

Pulled By: IslamAbdelRahman

fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343
2017-10-27 13:27:12 -07:00
Prashant D
50e95a63dd Fix coverity issues column_family, compaction_db/iterator
Summary:
db/column_family.h :
79  ColumnFamilyHandleInternal()

CID 1322806 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member internal_cfd_ is not initialized in this constructor nor in any functions that it calls.
 80      : ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {}

db/compacted_db_impl.cc:
 18CompactedDBImpl::CompactedDBImpl(
 19  const DBOptions& options, const std::string& dbname)
 20  : DBImpl(options, dbname) {
   	2. uninit_member: Non-static class member cfd_ is not initialized in this constructor nor in any functions that it calls.
   	4. uninit_member: Non-static class member version_ is not initialized in this constructor nor in any functions that it calls.

CID 1396120 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
6. uninit_member: Non-static class member user_comparator_ is not initialized in this constructor nor in any functions that it calls.
 21}

db/compaction_iterator.cc:
9. uninit_member: Non-static class member current_user_key_sequence_ is not initialized in this constructor nor in any functions that it calls.
11. uninit_member: Non-static class member current_user_key_snapshot_ is not initialized in this constructor nor in any functions that it calls.

CID 1419855 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
13. uninit_member: Non-static class member current_key_committed_ is not initialized in this constructor nor in any functions that it calls.
Closes https://github.com/facebook/rocksdb/pull/3084

Differential Revision: D6172999

Pulled By: sagar0

fbshipit-source-id: 084d73393faf8022c01359cfb445807b6a782460
2017-10-27 11:26:42 -07:00
Prashant D
47166baeac Fix coverity uninitialized fields warnings
Pulled By: ajkr

Differential Revision: D6170448

fbshipit-source-id: 5fd6d1608fc0df27c94d9f5059315ce7f79b8f5c
2017-10-26 21:11:50 -07:00
Andrew Kryczka
95667383db implement lower bound for iterators
Summary:
- for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified
- for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`.
Closes https://github.com/facebook/rocksdb/pull/3074

Differential Revision: D6158654

Pulled By: ajkr

fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518
2017-10-26 17:27:42 -07:00
Yi Wu
5a2a6483dc Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.

Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.

See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066

Differential Revision: D6142115

Pulled By: yiwu-arbug

fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:30:54 -07:00
Andrew Kryczka
9b18cc2363 single-file bottom-level compaction when snapshot released
Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.

- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes https://github.com/facebook/rocksdb/pull/3009

Differential Revision: D6062044

Pulled By: ajkr

fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
2017-10-25 16:30:37 -07:00
Islam AbdelRahman
addfe1ef4b Fix tombstone scans in SeekForPrev outside prefix
Summary:
When doing a Seek() or SeekForPrev() we should stop the moment we see a key with a different prefix as start if ReadOptions:: prefix_same_as_start was set to true

Right now we don't stop if we encounter a tombstone outside the prefix while executing SeekForPrev()
Closes https://github.com/facebook/rocksdb/pull/3067

Differential Revision: D6149638

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7f659862d2bf552d3c9104a360c79439ceba2f18
2017-10-25 15:12:00 -07:00
zach shipko
386a57e6ef Fix build on OpenBSD
Summary:
A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed.
Closes https://github.com/facebook/rocksdb/pull/3061

Differential Revision: D6138800

Pulled By: ajkr

fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c
2017-10-24 13:27:38 -07:00
Yi Wu
66a2c44ef4 Add DB::Properties::kEstimateOldestKeyTime
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
2017-10-23 15:27:27 -07:00
Dmitri Smirnov
d2a65c59e1 Fix unused var warnings in Release mode
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048

Differential Revision: D6126272

Pulled By: maysamyabandeh

fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
2017-10-23 14:27:04 -07:00
Maysam Yabandeh
63822eb761 Enable two write queues for transactions
Summary:
Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config.
Closes https://github.com/facebook/rocksdb/pull/3046

Differential Revision: D6106534

Pulled By: maysamyabandeh

fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c
2017-10-23 14:27:04 -07:00
Sagar Vemuri
a02ed12638 Exclude DBTest.DynamicFIFOCompactionOptions test under RocksDB Lite
Summary:
This test shouldn't be enabled under the lite version; and this fixes the failing contrun test due to #3006.
Closes https://github.com/facebook/rocksdb/pull/3056

Differential Revision: D6114681

Pulled By: sagar0

fbshipit-source-id: dc5243549ae6b1353cec7edb820c771d95f66dda
2017-10-20 17:11:39 -07:00
Maysam Yabandeh
3ef55d2c7c Split CompactionFilterWithValueChange
Summary:
The test currently times out when it is run under tsan. This patch split it into 4 tests.
Closes https://github.com/facebook/rocksdb/pull/3047

Differential Revision: D6106515

Pulled By: maysamyabandeh

fbshipit-source-id: 03a28cdf8b1c097be2361b1b0cc3dc1acf2b5d63
2017-10-20 15:42:07 -07:00
Andrew Kryczka
f8b5bb2fd8 remove unused code
Summary:
fixup 6a541afcc4. This code didn't do anything because (1) `bytes_per_sync` is assigned in `EnvOptions`'s constructor; and (2) `OptimizeForCompactionTableWrite`'s return value was ignored, even though its only purpose is to return something.
Closes https://github.com/facebook/rocksdb/pull/3055

Differential Revision: D6114132

Pulled By: ajkr

fbshipit-source-id: ea4831770930e9cf83518e13eb2e1934d1f5487c
2017-10-20 14:11:52 -07:00
Sagar Vemuri
f0804db7f7 Make FIFO compaction options dynamically configurable
Summary:
ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.

Some of the ways in which the fifo compaction options can be set are:
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
- `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`

Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
Closes https://github.com/facebook/rocksdb/pull/3006

Differential Revision: D6058619

Pulled By: sagar0

fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
2017-10-19 15:26:36 -07:00
Dmitri Smirnov
ebab2e2d42 Enable MSVC W4 with a few exceptions. Fix warnings and bugs
Summary: Closes https://github.com/facebook/rocksdb/pull/3018

Differential Revision: D6079011

Pulled By: yiwu-arbug

fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57
2017-10-19 10:57:12 -07:00
Gihwan Oh
7deed2b43c Fix a typo in a comment
Summary:
instad of for specific level -> instead of a specific level
Closes https://github.com/facebook/rocksdb/pull/3040

Differential Revision: D6090811

Pulled By: sagar0

fbshipit-source-id: 499edef0a6f596c448f61791e6aca8f5cce08e9c
2017-10-18 12:32:28 -07:00
Maysam Yabandeh
7e38238981 WritePrepared Txn: Disable GC during recovery
Summary:
Disables GC during recovery of a WritePrepared txn db to avoid GCing uncommitted key values.
Closes https://github.com/facebook/rocksdb/pull/2980

Differential Revision: D6000191

Pulled By: maysamyabandeh

fbshipit-source-id: fc4d522c643d24ebf043f811fe4ecd0dd0294675
2017-10-18 09:11:50 -07:00
Nikhil Benesch
7891af8b53 expose a hook to skip tables during iteration
Summary:
As discussed on the mailing list (["Skipping entire SSTs while iterating"](https://groups.google.com/forum/#!topic/rocksdb/ujHCJVLrHlU)), this patch adds a `table_filter` to `ReadOptions` that allows specifying a callback to be executed during iteration before each table in the database is scanned. The callback is passed the table's properties; the table is scanned iff the callback returns true.

This can be used in conjunction with a `TablePropertiesCollector` to dramatically speed up scans by skipping tables that are known to contain irrelevant data for the scan at hand.

We're using this [downstream in CockroachDB](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/engine/db.cc#L2009-L2022) already. With this feature, under ideal conditions, we can reduce the time of an incremental backup in  from hours to seconds.

FYI, the first commit in this PR fixes a segfault that I unfortunately have not figured out how to reproduce outside of CockroachDB. I'm hoping you accept it on the grounds that it is not correct to return 8-byte aligned memory from a call to `malloc` on some 64-bit platforms; one correct approach is to infer the necessary alignment from `std::max_align_t`, as done here. As noted in the first commit message, the bug is tickled by having a`std::function` in `struct ReadOptions`. That is, the following patch alone is enough to cause RocksDB to segfault when run from CockroachDB on Darwin.

```diff
 --- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@@ -1546,6 +1546,13 @@ struct ReadOptions {
   // Default: false
   bool ignore_range_deletions;

+  // A callback to determine whether relevant keys for this scan exist in a
+  // given table based on the table's properties. The callback is passed the
+  // properties of each table during iteration. If the callback returns false,
+  // the table will not be scanned.
+  // Default: empty (every table will be scanned)
+  std::function<bool(const TableProperties&)> table_filter;
+
   ReadOptions();
   ReadOptions(bool cksum, bool cache);
 };
```

/cc danhhz
Closes https://github.com/facebook/rocksdb/pull/2265

Differential Revision: D5054262

Pulled By: yiwu-arbug

fbshipit-source-id: dd6b28f2bba6cb8466250d8c5c542d3c92785476
2017-10-17 22:12:00 -07:00
Yi Wu
eaaef91178 Blob DB: Store blob index as kTypeBlobIndex in base db
Summary:
Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db.

Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it.
Closes https://github.com/facebook/rocksdb/pull/3000

Differential Revision: D6050044

Pulled By: yiwu-arbug

fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632
2017-10-17 17:28:11 -07:00
zhangjinpeng1987
966b32b57c fix delete range bug
Summary:
Fix this [issue](https://github.com/facebook/rocksdb/issues/2989).
ajkr PTAL

Close #2989
Closes https://github.com/facebook/rocksdb/pull/3017

Differential Revision: D6078541

Pulled By: yiwu-arbug

fbshipit-source-id: ef3db87b37b9156f83ca468aa39dea1f6dbde49d
2017-10-17 11:13:19 -07:00
Nikhil Benesch
c0208dffbe arena: derive alignment unit from std::max_align_t
Summary:
As raised in #2265, the arena allocator will return memory that is improperly aligned to store a `std::function` on macOS. Oddly, I'm unable to tickle this bug without adding a `std::function` field to `struct ReadOptions`—but my proposal in #2265 does exactly that.

In any case, here's a simple reproduction. Apply this bogus patch to get a `std::function` into `struct ReadOptions`

```
 --- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@@ -1035,6 +1035,8 @@ struct ReadOptions {
   // Default: 0
   uint64_t max_skippable_internal_keys;

+  std::function<void()> foo;
+
   ReadOptions();
   ReadOptions(bool cksum, bool cache);
 };
```

then compile `db_properties_test` *with ubsan* and run `ReadLatencyHistogramByLevel`:

```
$ make COMPILE_WITH_UBSAN=1 db_properties_test
$ ./db_properties_test --gtest_filter=DBPropertiesTest.ReadLatencyHistogramByLevel
```

ubsan will complain about several misaligned accesses:

```
Note: Google Test filter = DBPropertiesTest.ReadLatencyHistogramByLevel
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBPropertiesTest
[ RUN      ] DBPropertiesTest.ReadLatencyHistogramByLevel
util/coding.h:372:12: runtime error: load of misaligned address 0x00010d85516c for type 'const unsigned long', which requires 8 byte alignment
0x00010d85516c: note: pointer points here
  01 00 34 57 00 00 00 00  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  78 24 82 0a 01 00 00 00
              ^
util/coding.h:362:3: runtime error: store to misaligned address 0x7fff5733fac4 for type 'unsigned long', which requires 8 byte alignment
0x7fff5733fac4: note: pointer points here
  01 00 00 00 00 00 00 00  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  80 1d 96 0d 01 00 00 00
              ^
util/coding.h:372:12: runtime error: load of misaligned address 0x00010d85516c for type 'const unsigned long', which requires 8 byte alignment
0x00010d85516c: note: pointer points here
  01 00 34 57 00 00 00 00  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  78 24 82 0a 01 00 00 00
              ^
version_set.cc:854: runtime error: constructor call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
version_set.cc:512: runtime error: constructor call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
version_set.cc:505: runtime error: constructor call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
options.h:931: runtime error: constructor call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
options.h:931: runtime error: constructor call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
functional:1583: runtime error: constructor call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1585:9: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1585:9: runtime error: store to misaligned address 0x00010dbfa648 for type '__base *' (aka '__base<void ()> *'), which requires 16 byte alignment
0x00010dbfa648: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/version_set.cc:864:29: runtime error: upcast of misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:521:12: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:521:12: runtime error: load of misaligned address 0x00010dbfa5d8 for type 'rocksdb::TableCache *', which requires 16 byte alignment
0x00010dbfa5d8: note: pointer points here
 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00
              ^
db/version_set.cc:522:9: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:9: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:24: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:38: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:57: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:522:57: runtime error: load of misaligned address 0x00010dbfa678 for type 'rocksdb::RangeDelAggregator *', which requires 16 byte alignment
0x00010dbfa678: note: pointer points here
 01 00 00 00  d0 a1 bf 0d 01 00 00 00  00 00 00 00 00 00 00 00  f8 db 70 0a 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:523:54: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:523:54: runtime error: load of misaligned address 0x00010dbfa668 for type 'rocksdb::HistogramImpl *', which requires 16 byte alignment
0x00010dbfa668: note: pointer points here
 01 00 00 00  c8 88 a5 0d 01 00 00 00  00 00 00 00 01 00 00 00  d0 a1 bf 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:524:9: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:524:47: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:524:62: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/table_cache.cc:228:33: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
table/block_based_table_reader.cc:1554:41: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
table/block_based_table_reader.cc:1396:21: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
include/rocksdb/options.h:931:8: runtime error: reference binding to misaligned address 0x00010dbfa628 for type 'const std::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1584:13: runtime error: load of misaligned address 0x00010dbfa648 for type '__base *const' (aka '__base<void ()> *const'), which requires 16 byte alignment
0x00010dbfa648: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  c8 a5 97 0d 01 00 00 00  38 36 9b 0d
              ^
table/block_based_table_reader.cc:1555:24: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/table_cache.cc:244:54: runtime error: load of misaligned address 0x00010dbfa618 for type 'const bool', which requires 16 byte alignment
0x00010dbfa618: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/table_cache.cc:246:49: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/version_set.cc:532:12: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
db/version_set.cc:532:12: runtime error: member access within misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
db/version_set.cc:532:26: runtime error: load of misaligned address 0x00010dbfa5f8 for type 'const rocksdb::Slice *const', which requires 16 byte alignment
0x00010dbfa5f8: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
version_set.cc:493: runtime error: member call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment
0x00010dbfa5c8: note: pointer points here
 00 00 00 00  a0 db 70 0a 01 00 00 00  00 00 00 00 00 00 00 00  90 14 98 0d 01 00 00 00  00 00 00 00
              ^
version_set.cc:493: runtime error: member call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
options.h:931: runtime error: member call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment
0x00010dbfa5e8: note: pointer points here
 00 00 00 00  01 01 ff ff ff ff ff ff  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
options.h:931: runtime error: member call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
functional:1765: runtime error: member call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:9: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:9: runtime error: load of misaligned address 0x00010dbfa648 for type '__base *' (aka '__base<void ()> *'), which requires 16 byte alignment
0x00010dbfa648: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  c8 a5 97 0d 01 00 00 00  38 36 9b 0d
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:27: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1768:14: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment
0x00010dbfa628: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1768:14: runtime error: load of misaligned address 0x00010dbfa648 for type '__base *' (aka '__base<void ()> *'), which requires 16 byte alignment
0x00010dbfa648: note: pointer points here
 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  c8 a5 97 0d 01 00 00 00  38 36 9b 0d
              ^
[       OK ] DBPropertiesTest.ReadLatencyHistogramByLevel (1599 ms)
[----------] 1 test from DBPropertiesTest (1599 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1599 ms total)
[  PASSED  ] 1 test.
```

So it seems the root cause is that the internal implementation of `std::function` on macOS (and perhaps with libc++ generally?) requires 16-byte aligned memory, but the arena allocator only guarantees that the returned memory will be `sizeof(void*)` aligned, which is only 8-byte alignment on my machine. This patch solves the problem by adjusting the allocator to derive the necessary alignment from `alignof(std::max_align_t)`, which is properly 16 bytes on my machine.

As I mentioned in #2265, none of RocksDB's tests will cause this unaligned access to actually abort the process, but, on macOS, linking CockroachDB against a version of RocksDB with the above patch and letting it run for just a few seconds will cause a SIGABRT.

```
Process 19792 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000004f5e78f cockroach`DBNewIter + 95
cockroach`DBNewIter:
->  0x4f5e78f <+95>:  callq  *0x28(%rax)
    0x4f5e792 <+98>:  jmp    0x4f5e79e                 ; <+110>
    0x4f5e794 <+100>: movq   -0x50(%rbp), %rcx
    0x4f5e798 <+104>: movq   %rax, %rdi
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000004f5e78f cockroach`DBNewIter + 95
```

I'd get you a backtrace, but [Go doesn't include cgo debug information on macOS](https://github.com/golang/go/issues/6942). I've also tried building against libc++ on Linux, where debug information would be available, but I can't seem to trigger the bug there.

In any case, this PR both fixes the segfault in CockroachDB and fixes the warnings reported by ubsan.
Closes https://github.com/facebook/rocksdb/pull/2347

Differential Revision: D5108596

Pulled By: yiwu-arbug

fbshipit-source-id: bd5e4323b2ce915ed4fe78e123cb8996aec75a00
2017-10-17 11:13:19 -07:00
Changli Gao
b8cea7cc27 VersionBuilder: Erase with iterators for better performance
Summary: Closes https://github.com/facebook/rocksdb/pull/3007

Differential Revision: D6077701

Pulled By: yiwu-arbug

fbshipit-source-id: a6fd5b8a23f4feb1660b9ce027f651a7e90352b3
2017-10-17 10:12:37 -07:00
Yi Wu
8e63cad078 fix lite build
Summary:
* make `checksum_type_string_map` available for lite
* comment out `FilesPerLevel` in lite mode.
* travis and legocastle lite build also build `all` target and run tests
Closes https://github.com/facebook/rocksdb/pull/3015

Differential Revision: D6069822

Pulled By: yiwu-arbug

fbshipit-source-id: 9fe92ac220e711e9e6ed4e921bd25ef4314796a0
2017-10-17 08:57:09 -07:00
Andrew Kryczka
8dd0a7e11a add comment in SuperVersion referencing logic
Summary:
The referencing logic is super confusing so added a comment at the part that took me longest to figure out.
Closes https://github.com/facebook/rocksdb/pull/2996

Differential Revision: D6034969

Pulled By: ajkr

fbshipit-source-id: 9cc2e744c1f79d6d57d378f86ed59238a5f583db
2017-10-11 15:12:31 -07:00
Yi Wu
fb4ae4d810 fix DBImpl::NewInternalIterator super-version leak on failure
Summary:
Close #2955
Closes https://github.com/facebook/rocksdb/pull/2960

Differential Revision: D5962872

Pulled By: yiwu-arbug

fbshipit-source-id: a6472d5c015bea3dc476c572ff5a5c90259e6059
2017-10-11 14:57:43 -07:00
Zhongyi Xie
0f3b36964e Fix counter for memtable updates
Summary:
Right now in `PutCFImpl` we always increment NUMBER_KEYS_UPDATED counter for both in-place update or insertion. This PR fixes this by using the correct counter for either case.
Closes https://github.com/facebook/rocksdb/pull/2986

Differential Revision: D6016300

Pulled By: miasantreble

fbshipit-source-id: 0aed327522e659450d533d1c47d3a9f568fac65d
2017-10-10 21:26:11 -07:00
Andrew Kryczka
70aa942153 fix file numbers after repair
Summary:
The file numbers assigned post-repair were sometimes smaller than older files' numbers due to `LogAndApply` saving the wrong next file number in the manifest.

- Mark the highest file seen during repair as used before `LogAndApply` so the correct next file number will be stored.
- Renamed `MarkFileNumberUsedDuringRecovery` to `MarkFileNumberUsed` since now it's used during repair in addition to during recovery
- Added `TEST_Current_Next_FileNo` to expose the next file number for the unit test.
Closes https://github.com/facebook/rocksdb/pull/2988

Differential Revision: D6018083

Pulled By: ajkr

fbshipit-source-id: 3f25cbf74439cb8f16dd12af90b67f9f9f75e718
2017-10-10 13:12:37 -07:00
Jay Patel
1a61ba179e compaction picker to use max_bytes_for_level_multiplier_additional
Summary:
Hi,
As part of some optimization, we're using multiple DB locations (tmpfs and spindle) to store data and configured max_bytes_for_level_multiplier_additional. But, max_bytes_for_level_multiplier_additional is not used to compute the actual size for the level while picking the DB location. So, even if DB location does not have space, RocksDB mistakenly puts the level at that location.

Can someone pls. verify the fix? Let me know any other changes required.

Thanks,
Jay
Closes https://github.com/facebook/rocksdb/pull/2704

Differential Revision: D5992515

Pulled By: ajkr

fbshipit-source-id: cbbc6c0e0a7dbdca91c72e0f37b218c4cec57e28
2017-10-09 22:59:02 -07:00
Yi Wu
8c392a31d7 WritePrepared Txn: Iterator
Summary:
On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed.
Closes https://github.com/facebook/rocksdb/pull/2981

Differential Revision: D6001471

Pulled By: yiwu-arbug

fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe
2017-10-09 17:15:28 -07:00
Maysam Yabandeh
ec6c5383d0 WritePrepared Txn: end-to-end tests
Summary:
Enable WritePrepared policy for existing transaction tests.
Closes https://github.com/facebook/rocksdb/pull/2972

Differential Revision: D5993614

Pulled By: maysamyabandeh

fbshipit-source-id: d1eb53e2920c4e2a56434bb001231c98426f3509
2017-10-06 14:26:45 -07:00
Yi Wu
d1b74b0c82 WritePrepared Txn: Compaction/Flush
Summary:
Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator:
* check if key has been committed. If not, output uncommitted keys AS-IS.
* use SnapshotChecker to check if key is visible to a snapshot when in need.
* do not output key with seq = 0 if the key is not committed.
Closes https://github.com/facebook/rocksdb/pull/2926

Differential Revision: D5902907

Pulled By: yiwu-arbug

fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320
2017-10-06 10:41:53 -07:00
Adrien Schildknecht
01542400a8 Inform caller when rocksdb is stalling writes
Summary:
Add a new function in Listener to let the caller know when rocksdb
is stalling writes.
Closes https://github.com/facebook/rocksdb/pull/2897

Differential Revision: D5860124

Pulled By: schischi

fbshipit-source-id: ee791606169aa64f772c86f817cebf02624e05e1
2017-10-05 18:11:43 -07:00
Andrew Kryczka
821887036e pin L0 filters/indexes for compaction outputs
Summary:
We need to tell the iterator the compaction output file's level so it can apply proper optimizations, like pinning filter and index blocks when user enables `pin_l0_filter_and_index_blocks_in_cache` and the output file's level is zero.
Closes https://github.com/facebook/rocksdb/pull/2949

Differential Revision: D5945597

Pulled By: ajkr

fbshipit-source-id: 2389decf9026ffaa32d45801a77d002529f64a62
2017-10-03 16:27:28 -07:00
Sagar Vemuri
377e004048 Fix DBOptionsTest.SetBytesPerSync test when run with no compression
Summary:
Also made the test more easier to understand:
- changed the value size to ~1MB.
- switched to NoCompression. We don't anyway need compression in this test for dynamic options.

The test failures started happening starting from: #2893 .
Closes https://github.com/facebook/rocksdb/pull/2957

Differential Revision: D5959392

Pulled By: sagar0

fbshipit-source-id: 2d55641e429246328bc6d10fcb9ef540d6ce07da
2017-10-03 13:42:11 -07:00
Yi Wu
d1cab2b64e Add ValueType::kTypeBlobIndex
Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).

The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().

Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886

Differential Revision: D5838431

Pulled By: yiwu-arbug

fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
2017-10-03 09:11:23 -07:00
Andrew Kryczka
880411f54c disable populating block cache for in-place updates
Summary:
There's no point populating the block cache during this read. The key we read is guaranteed to be overwritten with a new `kValueType` key immediately afterwards, so can't be accessed again. A user was seeing high turnover of data blocks, at least partially due to this.
Closes https://github.com/facebook/rocksdb/pull/2959

Differential Revision: D5961672

Pulled By: ajkr

fbshipit-source-id: e7cb27c156c5db3b32af355c780efb99dbdf087c
2017-10-02 20:41:24 -07:00
Aliaksei Sandryhaila
cf51d3eb73 Remove an "unused" variable
Summary:
PR 2893 introduced a variable that is only used in TEST_SYNC_POINT_CALLBACK. When RocksDB is not built in debug mode, this method is not compiled in, and the variable is unused, which triggers a compiler error.

This patch reverts the corresponding part of #2893.
Closes https://github.com/facebook/rocksdb/pull/2956

Reviewed By: yiwu-arbug

Differential Revision: D5955679

Pulled By: asandryh

fbshipit-source-id: ac4a8e85b22da7f02efb117cd2e4a6e07ba73390
2017-10-02 15:26:29 -07:00
Andrew Kryczka
5df172da2f fix deletion-triggered compaction in table builder
Summary:
It was broken when `NotifyCollectTableCollectorsOnFinish` was introduced. That function called `Finish` on each of the `TablePropertiesCollector`s, and `CompactOnDeletionCollector::Finish()` was resetting all its internal state. Then, when we checked whether compaction is necessary, the flag had already been cleared.

Fixed above issue by avoiding resetting internal state during `Finish()`. Multiple calls to `Finish()` are allowed, but callers cannot invoke `AddUserKey()` on the collector after any finishes.
Closes https://github.com/facebook/rocksdb/pull/2936

Differential Revision: D5918659

Pulled By: ajkr

fbshipit-source-id: 4f05e9d80e50ee762ba1e611d8d22620029dca6b
2017-09-28 18:17:30 -07:00
Maysam Yabandeh
385049baf2 WritePrepared Txn: Recovery
Summary:
Recover txns from the WAL. Also added some unit tests.
Closes https://github.com/facebook/rocksdb/pull/2901

Differential Revision: D5859596

Pulled By: maysamyabandeh

fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0
2017-09-28 16:56:45 -07:00
Sagar Vemuri
93c2b91740 Introduce conditional merge-operator invocation in point lookups
Summary:
For every merge operand encountered for a key in the read path we now have the ability to decide whether to look further (to retrieve more merge operands for the key) or stop and invoke the merge operator to return the value. The user needs to override `ShouldMerge()` method with a condition to terminate search when true to avail this facility.

This has a couple of advantages:
1. It helps in limiting the number of merge operands that are looked at to compute a value as part of a user Get operation.
2. It allows to peek at a merge key-value to see if further merge operands need to look at.

Example: Limiting the number of merge operands that are looked at: Lets say you have 10 merge operands for a key spread over various levels. If you only want RocksDB to look at the latest two merge operands instead of all 10 to compute the value, it is now possible with this PR. You can set the condition in `ShouldMerge()` to return true when the size of the operand list is 2. Look at the example implementation in the unit test. Without this PR, a Get might look at all the 10 merge operands in different levels before invoking the merge-operator.

Added a new unit test.
Made sure that there is no perf regression by running benchmarks.

Command line to Load data:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="mergerandom" --merge_operator="uint64add" --num=10000000
...
mergerandom  :      12.861 micros/op 77757 ops/sec;    8.6 MB/s ( updates:10000000)
```

**ReadRandomMergeRandom bechmark results:**
Command line:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="readrandommergerandom" --merge_operator="uint64add" --num=10000000
```

Base -- Without this code change (on commit fc7476b):
```
readrandommergerandom :      38.586 micros/op 25916 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8)
```

With this code change:
```
readrandommergerandom :      38.653 micros/op 25870 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8)
```
Closes https://github.com/facebook/rocksdb/pull/2923

Differential Revision: D5898239

Pulled By: sagar0

fbshipit-source-id: daefa325019f77968639a75c851d46352c2303ef
2017-09-28 15:58:49 -07:00
Quinn Jarrell
6a541afcc4 Make bytes_per_sync and wal_bytes_per_sync mutable
Summary:
SUMMARY
Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
TEST PLAN
ran make check
all passed

Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
Closes https://github.com/facebook/rocksdb/pull/2893

Reviewed By: yiwu-arbug

Differential Revision: D5845814

Pulled By: TheRushingWookie

fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
2017-09-27 17:49:45 -07:00
Maysam Yabandeh
aa67bae6cf Break down PinnedDataIteratorRandomized
Summary:
Its timing out under tsan.
Closes https://github.com/facebook/rocksdb/pull/2928

Differential Revision: D5911766

Pulled By: maysamyabandeh

fbshipit-source-id: 2faacc07752ac8713a3a2abb5a4c4b7ae3bdf208
2017-09-26 14:27:30 -07:00
Zhongyi Xie
1d6700f9e6 Add test kPointInTimeRecoveryCFConsistency
Summary:
Context/problem:

- CFs may be flushed at different times
- A WAL can only be deleted after all CFs have flushed beyond end of that WAL.
- Point-in-time recovery might stop upon reaching the first corruption.
- Some CFs may have already flushed beyond that point, while others haven't. We should fail the Open() instead of proceeding with inconsistent CFs.
Closes https://github.com/facebook/rocksdb/pull/2900

Differential Revision: D5863281

Pulled By: miasantreble

fbshipit-source-id: 180dbaf83d96c804cff49b3c406312a4ae61313e
2017-09-22 17:26:36 -07:00
Andrew Kryczka
4708a6875c Repair DBs with trailing slash in name
Summary:
Problem:

- `DB::SanitizeOptions` strips trailing slash from `wal_dir` but not `dbname`
- We check whether `wal_dir` and `dbname` refer to the same directory using string equality: https://github.com/facebook/rocksdb/blob/master/db/repair.cc#L258
- Providing `dbname` with trailing slash causes default `wal_dir` to be misidentified as a separate directory.
- Then the repair tries to add all SST files to the `VersionEdit` twice (once for `dbname` dir, once for `wal_dir`) and fails with coredump.

Solution:

- Add a new `Env` function, `AreFilesSame`, which uses device and inode number to check whether files are the same. It's currently only implemented in `PosixEnv`.
- Migrate repair to use `AreFilesSame` to check whether `dbname` and `wal_dir` are same. If unsupported, falls back to string comparison.
Closes https://github.com/facebook/rocksdb/pull/2827

Differential Revision: D5761349

Pulled By: ajkr

fbshipit-source-id: c839d548678b742af1166d60b09abd94e5476238
2017-09-22 12:42:22 -07:00
Andrew Kryczka
fc7476bec1 fix populating range deletions in forward iterator
Summary:
fixes #2902
Closes https://github.com/facebook/rocksdb/pull/2917

Differential Revision: D5887175

Pulled By: ajkr

fbshipit-source-id: 364e292c636a3238bfc53b0fb9a01ff2f82dcbb9
2017-09-21 17:56:38 -07:00
PhaniShekhar
65a9cd6168 Use L1 size as estimate for L0 size in LevelCompactionBuilder::GetPathID
Summary:
Fix for [2461](https://github.com/facebook/rocksdb/issues/2461).

Problem: When using multiple db_paths setting with RocksDB, RocksDB incorrectly calculates the size of L1 in LevelCompactionBuilder::GetPathId.

max_bytes_for_level_base is used as L0 size and L1 size is calculated as (L0 size * max_bytes_for_level_multiplier). However, L1 size should be max_bytes_for_level_base.

Solution: Use max_bytes_for_level_base as L1 size. Also, use L1 size as the estimated size of L0.
Closes https://github.com/facebook/rocksdb/pull/2903

Differential Revision: D5885442

Pulled By: maysamyabandeh

fbshipit-source-id: 036da1c9298d173b9b80479cc6661ee4b7a951f6
2017-09-21 15:57:58 -07:00
Yi Wu
b4596c6174 Fix Get does not return super version on error
Summary:
This is caught when I was testing #2886.
Closes https://github.com/facebook/rocksdb/pull/2907

Differential Revision: D5863153

Pulled By: yiwu-arbug

fbshipit-source-id: 8c54759ba1a0dc101f24ab50423e35731300612d
2017-09-19 12:01:09 -07:00
Maysam Yabandeh
60beefd6e0 WritePrepared Txn: Advance seq one per batch
Summary:
By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions.

Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch.
Closes https://github.com/facebook/rocksdb/pull/2885

Differential Revision: D5837843

Pulled By: maysamyabandeh

fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c
2017-09-18 14:45:08 -07:00
Maysam Yabandeh
c57050b770 Use the default copy constructor in Options
Summary:
Our current implementation of (semi-)copy constructor of DBOptions and ColumnFamilyOptions seems to intend value by value copy, which is what the default copy constructor does anyway. Moreover not using the default constructor has the risk of forgetting to add newly added options.

As an example, allow_2pc seems to be forgotten in the copy constructor which was causing one of the unit tests not seeing its effect.
Closes https://github.com/facebook/rocksdb/pull/2888

Differential Revision: D5846368

Pulled By: maysamyabandeh

fbshipit-source-id: 1ee92a2aeae93886754b7bc039c3411ea2458683
2017-09-15 17:15:10 -07:00
Yi Wu
6b3c71f6ed Fix DBImpl::NotifyOnCompactionCompleted data race
Summary:
Access of `cfd->current()` needs to hold db mutex. The data race is caught by TSAN but hard to reproduce: https://gist.github.com/yiwu-arbug/0fc6dc0de915297a1740aa9610be9373
Closes https://github.com/facebook/rocksdb/pull/2894

Differential Revision: D5843884

Pulled By: yiwu-arbug

fbshipit-source-id: 0a30a421bc96f51840821538ad6453dc0815a942
2017-09-15 11:56:31 -07:00
Siying Dong
edcbb36944 Three code-level optimization to Iterator::Next()
Summary:
Three small optimizations:
(1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree.
(2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads.
(3) Move the switching direction logic in MergingIterator::Next() to a separate function.

These three in total improves readseq performance by about 3% in my benchmark setting.
Closes https://github.com/facebook/rocksdb/pull/2880

Differential Revision: D5829252

Pulled By: siying

fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3
2017-09-14 17:57:31 -07:00
Siying Dong
885b1c682e Two small refactoring for better inlining
Summary:
Move uncommon code paths in RangeDelAggregator::ShouldDelete() and IterKey::EnlargeBufferIfNeeded() to a separate function, so that the inlined strcuture can be more optimized.

Optimize it because these places show up in CPU profiling, though minimum. The performance is really hard measure. I ran db_bench with readseq benchmark against in-memory DB many times. The variation is big, but it seems to show 1% improvements.
Closes https://github.com/facebook/rocksdb/pull/2877

Differential Revision: D5828123

Pulled By: siying

fbshipit-source-id: 41a49e229f91e9f8409f85cc6f0dc70e31334e4b
2017-09-14 15:41:49 -07:00
Oleksandr Anyshchenko
ffac68367f Added save points for transactions C API
Summary:
Added possibility to set save points in transactions and then rollback to them
Closes https://github.com/facebook/rocksdb/pull/2876

Differential Revision: D5825829

Pulled By: yiwu-arbug

fbshipit-source-id: 62168992340bbcddecdaea3baa2a678475d1429d
2017-09-14 14:18:59 -07:00
Yi Wu
a843df668b Fix use-after-free in c_tset
Summary:
Fix asan error introduce by #2823
Closes https://github.com/facebook/rocksdb/pull/2879

Differential Revision: D5828454

Pulled By: yiwu-arbug

fbshipit-source-id: 50777855667f4e7b634279a654c3bfa01a1ac729
2017-09-13 16:12:02 -07:00
Andrew Kryczka
464fb36de9 fix hanging after CompactFiles with L0 overlap
Summary:
Bug report: https://www.facebook.com/groups/rocksdb.dev/permalink/1389452781153232/

Non-empty `level0_compactions_in_progress_` was aborting `CompactFiles` after incrementing `bg_compaction_scheduled_`, and in that case we never decremented it. This blocked future compactions and prevented DB close as we wait for scheduled compactions to finish/abort during close.

I eliminated `CompactFiles`'s dependency on `level0_compactions_in_progress_`. Since it takes a contiguous span of L0 files -- through the last L0 file if any L1+ files are included -- it's fine to run in parallel with other compactions involving L0. We make the same assumption in intra-L0 compaction.
Closes https://github.com/facebook/rocksdb/pull/2849

Differential Revision: D5780440

Pulled By: ajkr

fbshipit-source-id: 15b15d3faf5a699aed4b82a58352d4a7bb23e027
2017-09-13 15:41:38 -07:00
Oleksandr Anyshchenko
72e4190918 Additions for OptimisticTransactionDB in C API
Summary:
Added some bindings for `OptimisticTransactionDB` in C API
Closes https://github.com/facebook/rocksdb/pull/2823

Differential Revision: D5820672

Pulled By: yiwu-arbug

fbshipit-source-id: 7efd17f619cc0741feddd2050b8fc856f9288350
2017-09-13 12:12:11 -07:00
Amy Xu
5785b1fcb8 Fix naming in InternalKey
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868

Differential Revision: D5804152

Pulled By: axxufb

fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
2017-09-12 17:17:42 -07:00
Maysam Yabandeh
2d30aaae47 Exclude incompatible options in test
Summary:
options.enable_pipelined_write and options.concurrent_prepare are incompatible and should not be set together.
Closes https://github.com/facebook/rocksdb/pull/2875

Differential Revision: D5818358

Pulled By: maysamyabandeh

fbshipit-source-id: dad862508f00817ab302f8b61729accf38315fb8
2017-09-12 14:58:46 -07:00
Archit Mishra
3c42807794 do not call merge when checking to see if key exists
Summary:
Changes:
* added check for value before merge is called on code path that should check if key exists
Closes https://github.com/facebook/rocksdb/pull/2814

Reviewed By: IslamAbdelRahman

Differential Revision: D5743966

Pulled By: armishra

fbshipit-source-id: 6ac4283bc510c8ca50827d87ef0ba631f2b33b18
2017-09-12 12:02:53 -07:00
Andrew Kryczka
025b85b4ac speedup DBTest.EncodeDecompressedBlockSizeTest
Summary:
it sometimes takes more than 10 minutes (i.e., times out) on our internal CI. mainly because bzip is super slow. so I reduced the amount of  work it tries to do.
Closes https://github.com/facebook/rocksdb/pull/2856

Differential Revision: D5795883

Pulled By: ajkr

fbshipit-source-id: e69f986ae60b44ecc26b6b024abd0f13bdf3a3c5
2017-09-12 11:26:47 -07:00
Siying Dong
64b6452e0c Make InternalKeyComparator final and directly use it in merging iterator
Summary:
Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress.

I have to disable the final key word in debug build, as a hack test class depends on overriding the class.
Closes https://github.com/facebook/rocksdb/pull/2860

Differential Revision: D5800461

Pulled By: siying

fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53
2017-09-11 12:04:21 -07:00
Siying Dong
2dd22e5449 Make DBIter class final
Summary:
DBIter is referenced in ArenaWrappedDBIter, which is a simple wrapper. If DBIter is final, some virtual function call can be avoided. Some functions can even be inlined, like DBIter.value() to ArenaWrappedDBIter.value() and DBIter.key() to ArenaWrappedDBIter.key(). The performance gain is hard to measure. I just ran the memory-only benchmark for readseq and saw it didn't regress. There shouldn't be any harm doing it. Just give compiler more choices.
Closes https://github.com/facebook/rocksdb/pull/2859

Differential Revision: D5799888

Pulled By: siying

fbshipit-source-id: 829788f91310c40282dcfb7e412e6ef489931143
2017-09-11 12:04:21 -07:00
Huachao Huang
2a5915049e Fix missing BYTES_PER_WRITE for pipeline write
Summary: Closes https://github.com/facebook/rocksdb/pull/2862

Differential Revision: D5805638

Pulled By: yiwu-arbug

fbshipit-source-id: 72d38c74395690023a719f400daff01527645a17
2017-09-11 11:41:27 -07:00
Maysam Yabandeh
f46464d383 write-prepared txn: call IsInSnapshot
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850

Differential Revision: D5787375

Pulled By: maysamyabandeh

fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
2017-09-11 09:14:48 -07:00
Andrew Kryczka
3cd7ea2e8a rename stall-related internal stats
Summary:
Some of these names, like `MEMTABLE_COMPACTION`, did not mean anything. Tried to give them descriptive names.
Closes https://github.com/facebook/rocksdb/pull/2852

Differential Revision: D5782822

Pulled By: ajkr

fbshipit-source-id: f2695c4124af4073da4492d7135bae2411220f3a
2017-09-07 18:26:18 -07:00
Siying Dong
0e99323ac2 Fix CLANG Analyze
Summary:
clang analyze shows warnings after we upgrade the CLANG version. Fix them.
Closes https://github.com/facebook/rocksdb/pull/2839

Differential Revision: D5769060

Pulled By: siying

fbshipit-source-id: 3f8e4df715590d8984f6564b608fa08cfdfa5f14
2017-09-07 14:28:06 -07:00
Andrew Kryczka
10ddd59ba7 fix CompactFiles inclusion of older L0 files
Summary:
if we're moving any L0 files down, we need to include older L0 files since they may contain older versions of the keys being moved down.
Closes https://github.com/facebook/rocksdb/pull/2845

Differential Revision: D5773800

Pulled By: ajkr

fbshipit-source-id: 9f0770a8eaaeea4c87df2e7a2a1d65bf9d7f4f7e
2017-09-06 11:42:25 -07:00
Kamalalochana Subbaiah
e612e31740 Updated CRC32 Power Optimization Changes
Summary:
Support for PowerPC Architecture
Detecting AltiVec Support
Closes https://github.com/facebook/rocksdb/pull/2716

Differential Revision: D5606836

Pulled By: siying

fbshipit-source-id: 720262453b1546e5fdbbc668eff56848164113f3
2017-08-31 14:16:30 -07:00
Andrew Kryczka
c10cf166fa Dump non-final ZSTD compression type support
Summary: Closes https://github.com/facebook/rocksdb/pull/2810

Differential Revision: D5739947

Pulled By: ajkr

fbshipit-source-id: 09f99718b6b083c2711dcf17f7b68c305f3fd261
2017-08-30 16:41:24 -07:00
Artem Danilov
8a6708f5f2 Extend property map with compaction stats
Summary:
This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
Closes https://github.com/facebook/rocksdb/pull/2794

Differential Revision: D5717676

Pulled By: Tema

fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
2017-08-30 15:26:55 -07:00
Huachao Huang
0980dc6c9a Fix wrong smallest key of delete range tombstones
Summary:
Since tombstones are not stored in order, we may get a wrong smallest key if we only consider the first added tombstone.
Check https://github.com/facebook/rocksdb/issues/2752 for more details.
Closes https://github.com/facebook/rocksdb/pull/2799

Differential Revision: D5728217

Pulled By: ajkr

fbshipit-source-id: 4a53edb0ca80d2a9fcf10749e52d47d57d6417d3
2017-08-29 18:41:35 -07:00
Andrew Kryczka
b767972313 avoid use-after-move error
Summary:
* db/range_del_aggregator.cc (AddTombstone): Avoid a potential
use-after-move bug. The original code would both use and move
`tombstone` in a context where the order of those operations is
not specified.  The fix is to perform the use on a new, preceding
statement.

Author: meyering
Closes https://github.com/facebook/rocksdb/pull/2796

Differential Revision: D5721163

Pulled By: ajkr

fbshipit-source-id: a1d328d6a77a17c6425e8069860a202e615e2f48
2017-08-29 12:11:56 -07:00
Maysam Yabandeh
fbfa3e7a43 WriteAtPrepare: Efficient read from snapshot list
Summary:
Divide the old snapshots to two lists: a few that fit into a cached array and the rest in a vector, which is expected to be empty in normal cases. The former is to optimize concurrent reads from snapshots without requiring locks. It is done by an array of std::atomic, from which std::memory_order_acquire reads are compiled to simple read instructions in most of the x86_64 architectures.
Closes https://github.com/facebook/rocksdb/pull/2758

Differential Revision: D5660504

Pulled By: maysamyabandeh

fbshipit-source-id: 524fcf9a8e7f90a92324536456912a99aaa6740c
2017-08-26 01:00:38 -07:00
Yi Wu
3c840d1a6d Allow DB reopen with reduced options.num_levels
Summary:
Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels.

This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted.
Closes https://github.com/facebook/rocksdb/pull/2740

Differential Revision: D5629354

Pulled By: yiwu-arbug

fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d
2017-08-24 16:10:54 -07:00
Yi Wu
92bfd6c507 Fix DropColumnFamily data race
Summary:
It should hold db mutex while accessing max_total_in_memory_state_.
Closes https://github.com/facebook/rocksdb/pull/2784

Differential Revision: D5696536

Pulled By: yiwu-arbug

fbshipit-source-id: 45430634d7fe11909b38e42e5f169f618681c4ee
2017-08-24 14:56:04 -07:00
Andrew Kryczka
7fbb9eccaf support disabling checksum in block-based table
Summary:
store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length.
Closes https://github.com/facebook/rocksdb/pull/2781

Differential Revision: D5694702

Pulled By: ajkr

fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd
2017-08-23 19:40:47 -07:00
Andrew Kryczka
7eba54eb9b test compaction input-level split range tombstone assumption
Summary:
One of the core assumptions of DeleteRange is that files containing portions of the same range tombstone are treated as a single unit from the perspective of compaction picker. Need better tests for this. This PR adds the tests for manual compaction.
Closes https://github.com/facebook/rocksdb/pull/2769

Differential Revision: D5676677

Pulled By: ajkr

fbshipit-source-id: 1b4b3382b300ff7048b872911405fdf900e4fbec
2017-08-23 14:11:32 -07:00
mkosieradzki
a12479819d Improved transactions support in C API
Summary:
Solves #2632

Added OptimisticTransactionDB to the C API.
Added missing merge operations to Transaction.
Added missing get_for_update operation to transaction

If required I will create tests for this another day.
Closes https://github.com/facebook/rocksdb/pull/2633

Differential Revision: D5600906

Pulled By: yiwu-arbug

fbshipit-source-id: da23e4484433d8f59d471f778ff2ae210e3fe4eb
2017-08-23 12:40:28 -07:00
Andrew Kryczka
8ace1f79b5 add counter for deletion dropping optimization
Summary:
add this counter stat to track usage of deletion-dropping optimization. if usage is low, we can delete it to prevent bugs like #2726.
Closes https://github.com/facebook/rocksdb/pull/2761

Differential Revision: D5665421

Pulled By: ajkr

fbshipit-source-id: 881befa2d199838dac88709e7b376a43d304e3d4
2017-08-19 14:10:08 -07:00
Andrew Kryczka
ed0a4c93ef perf_context measure user bytes read
Summary:
With this PR, we can measure read-amp for queries where perf_context is enabled as follows:

```
SetPerfLevel(kEnableCount);
Get(1, "foo");
double read_amp = static_cast<double>(get_perf_context()->block_read_byte / get_perf_context()->get_read_bytes);
SetPerfLevel(kDisable);
```

Our internal infra enables perf_context for a sampling of queries. So we'll be able to compute the read-amp for the sample set, which can give us a good estimate of read-amp.
Closes https://github.com/facebook/rocksdb/pull/2749

Differential Revision: D5647240

Pulled By: ajkr

fbshipit-source-id: ad73550b06990cf040cc4528fa885360f308ec12
2017-08-18 11:43:33 -07:00
Sagar Vemuri
9a44b4c32c Allow merge operator to be called even with a single operand
Summary:
Added a function `MergeOperator::DoesAllowSingleMergeOperand()` to allow invoking a merge operator even with a single merge operand, if overriden.

This is needed for Cassandra-on-RocksDB work. All Cassandra writes are through merges and this will allow a single merge-value to be updated in the merge-operator invoked via a compaction, if needed, due to an expired TTL.
Closes https://github.com/facebook/rocksdb/pull/2721

Differential Revision: D5608706

Pulled By: sagar0

fbshipit-source-id: f299f9f91c4d1ac26e48bd5906e122c1c5e5f3fc
2017-08-16 23:42:00 -07:00
follitude
ac8fb77afd fix some misspellings
Summary:
PTAL ajkr
Closes https://github.com/facebook/rocksdb/pull/2750

Differential Revision: D5648052

Pulled By: ajkr

fbshipit-source-id: 7cd1ddd61364d5a55a10fdd293fa74b2bf89dd98
2017-08-16 21:57:20 -07:00
Andrew Kryczka
af012c0f83 fix deleterange with memtable prefix bloom
Summary:
the range delete tombstones in memtable should be added to the aggregator even when the memtable's prefix bloom filter tells us the lookup key's not there. This bug could cause data to temporarily reappear until the memtable containing range deletions is flushed.

Reported in #2743.
Closes https://github.com/facebook/rocksdb/pull/2745

Differential Revision: D5639007

Pulled By: ajkr

fbshipit-source-id: 04fc6facb6f978340a3f639536f4ca7c0d73dfc9
2017-08-16 19:13:01 -07:00
Andrew Kryczka
1c8dbe2aa2 update scores after picking universal compaction
Summary:
We forgot to recompute compaction scores after picking a universal compaction like we do in level compaction (a34b2e388e/db/compaction_picker.cc (L691-L695)). This leads to a fairness issue where we waste compactions on CFs/DB instances that don't need it while others can starve.

Previously, ccecf3f4fb fixed the issue for the read-amp-based compaction case; this PR avoids the issue earlier and also for size-ratio-based compactions.
Closes https://github.com/facebook/rocksdb/pull/2688

Differential Revision: D5566191

Pulled By: ajkr

fbshipit-source-id: 010bccb2a107f6a76f3d3022b90aadce5cc48feb
2017-08-16 18:42:33 -07:00
Maysam Yabandeh
eb6425303e Update WritePrepared with the pseudo code
Summary:
Implement the main body of WritePrepared pseudo code. This includes PrepareInternal and CommitInternal, as well as AddCommitted which updates the commit map. It also provides a IsInSnapshot method that could be later called form the read path to decide if a version is in the read snapshot or it should other be skipped.

This patch lacks unit tests and does not attempt to offer an efficient implementation. The idea is that to have the API specified so that we can work on related tasks in parallel.
Closes https://github.com/facebook/rocksdb/pull/2713

Differential Revision: D5640021

Pulled By: maysamyabandeh

fbshipit-source-id: bfa7a05e8d8498811fab714ce4b9c21530514e1c
2017-08-16 16:57:47 -07:00
Siying Dong
71598cdc75 Fix false removal of tombstone issue in FIFO and kCompactionStyleNone
Summary:
Similar to the bug fixed by https://github.com/facebook/rocksdb/pull/2726, FIFO with compaction and kCompactionStyleNone during user customized CompactFiles() with output level to be 0 can suffer from the same problem. Fix it by leveraging the bottommost_level_ flag.
Closes https://github.com/facebook/rocksdb/pull/2735

Differential Revision: D5626906

Pulled By: siying

fbshipit-source-id: 2b148d0461c61dbd986d74655e384419ae442158
2017-08-15 13:02:19 -07:00
Andrew Kryczka
acf935e40f fix deletion dropping in intra-L0
Summary:
`KeyNotExistsBeyondOutputLevel` didn't consider L0 files' key-ranges. So if a key only was covered by older L0 files' key-ranges, we would incorrectly drop deletions of that key. This PR just skips the deletion-dropping optimization when output level is L0.
Closes https://github.com/facebook/rocksdb/pull/2726

Differential Revision: D5617286

Pulled By: ajkr

fbshipit-source-id: 4bff1396b06d49a828ba4542f249191052915bce
2017-08-11 18:12:38 -07:00
yiwu-arbug
3f5888430a Fix c_test ASAN failure
Summary:
Fix c_test missing deletion of write batch pointer.
Closes https://github.com/facebook/rocksdb/pull/2725

Differential Revision: D5613866

Pulled By: yiwu-arbug

fbshipit-source-id: bf3f59a6812178577c9c25bae558ef36414a1f51
2017-08-11 13:00:15 -07:00
Andrew Kryczka
6f051e0c71 fix corruption_test valgrind
Summary: Closes https://github.com/facebook/rocksdb/pull/2724

Differential Revision: D5613416

Pulled By: ajkr

fbshipit-source-id: ed55fb66ab1b41dfdfe765fe3264a1c87a8acb00
2017-08-11 12:29:14 -07:00
Kent767
ac098a4626 expose set_skip_stats_update_on_db_open to C bindings
Summary:
It would be super helpful to not have to recompile rocksdb to get this performance tweak for mechanical disks.

I have signed the CLA.
Closes https://github.com/facebook/rocksdb/pull/2718

Differential Revision: D5606994

Pulled By: yiwu-arbug

fbshipit-source-id: c05e92bad0d03bd38211af1e1ced0d0d1e02f634
2017-08-11 12:16:45 -07:00
Siying Dong
666a005f9b Support prefetch last 512KB with direct I/O in block based file reader
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708

Differential Revision: D5593091

Pulled By: siying

fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
2017-08-11 12:16:45 -07:00