Summary:
This is a simpler version of #3097 by removing all unrelated changes.
Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError.
Closes https://github.com/facebook/rocksdb/pull/3201
Differential Revision: D6421644
Pulled By: yiwu-arbug
fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e
Summary:
Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened.
Closes https://github.com/facebook/rocksdb/pull/3179
Differential Revision: D6358125
Pulled By: ajkr
fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279
Summary:
Expose read and write options via the C API
Closes https://github.com/facebook/rocksdb/pull/3185
Differential Revision: D6389658
Pulled By: sagar0
fbshipit-source-id: 1848912750329a476805b3cb2f315e7b71f61472
Summary:
Augment WriteWithCallbackTest to also test when seq_per_batch is true.
Closes https://github.com/facebook/rocksdb/pull/3195
Differential Revision: D6398143
Pulled By: maysamyabandeh
fbshipit-source-id: 7bc4218609355ec20fed25df426a8455ec2390d3
Summary:
This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the
number of internal keys skipped during iteration. Keys can be skipped
due to deletes, or lower sequence number, or higher sequence number
than the one requested.
Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary,
padding becomes a zero size array, which the Windows compiler doesn't
like. So add a cacheline worth of padding in that case to keep it happy.
We cannot conditionally add padding as gcc doesn't allow using sizeof
in preprocessor directives.
Closes https://github.com/facebook/rocksdb/pull/3177
Differential Revision: D6353897
Pulled By: anand1976
fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2
Summary:
Allow users to configure the trash-to-DB size ratio limit, so
that ratelimits for deletes can be enforced even when larger portions of
the database are being deleted.
Closes https://github.com/facebook/rocksdb/pull/3158
Differential Revision: D6304897
Pulled By: gdavidsson
fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3
Summary:
Refactor the logic around WriteCallback in the write path to clarify when and how exactly we advance the sequence number and making sure it is consistent across the code.
Closes https://github.com/facebook/rocksdb/pull/3168
Differential Revision: D6324312
Pulled By: maysamyabandeh
fbshipit-source-id: 9a34f479561fdb2a5d01ef6d37a28908d03bbe33
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157
Differential Revision: D6304291
Pulled By: maysamyabandeh
fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
Summary:
When testing rebuilding_trx_ in MemTableInserter might still be set before the tests finishes which would cause ASAN alarms for leaks. This patch deletes the pointers in MemTableInserter destructor.
Closes https://github.com/facebook/rocksdb/pull/3162
Differential Revision: D6317113
Pulled By: maysamyabandeh
fbshipit-source-id: a68be70709a4fff7ac2b768660119311968f9c21
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140
Differential Revision: D6290258
Pulled By: maysamyabandeh
fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118
Differential Revision: D6227247
Pulled By: maysamyabandeh
fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
Summary:
Add per-exe execution capability
Add fix parsing of groups/tests
Add timer test exclusion
Fix unit tests
Ifdef threadpool specific tests that do not pass on Vista threadpool.
Remove spurious outout from prefix_test so test case listing works
properly.
Fix not using standard test directories results in file creation errors
in sst_dump_test.
BlobDb fixes:
In C++ end() iterators can not be dereferenced. They are not valid.
When deleting blob_db_ set it to nullptr before any other code executes.
Not fixed:. On Windows you can not delete a file while it is open.
[ RUN ] BlobDBTest.ReadWhileGC
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
write_batch
Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152
Differential Revision: D6293274
Pulled By: sagar0
fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
Summary:
Add a simple policy for NVMe write time life hint
Closes https://github.com/facebook/rocksdb/pull/3095
Differential Revision: D6298030
Pulled By: shligit
fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d
Summary:
The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen:
- full compaction generates files with `bottommost_compression` type
- now base level is bottommost level since all files are in the same level
- any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level
I changed the code to make bottommost compression apply to base level.
Closes https://github.com/facebook/rocksdb/pull/3141
Differential Revision: D6264614
Pulled By: ajkr
fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82
Summary:
While investigating the usage of `new_table_iterator_nanos` perf counter, I saw some code was wrapper around with unnecessary status check ... so removed it.
Closes https://github.com/facebook/rocksdb/pull/3120
Differential Revision: D6229181
Pulled By: sagar0
fbshipit-source-id: f8a44fe67f5a05df94553fdb233b21e54e88cc34
Summary:
Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.
Note this is the first step of #2987, extracted into a separate PR per reviewer request.
Closes https://github.com/facebook/rocksdb/pull/3057
Differential Revision: D6116891
Pulled By: ajkr
fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
Summary:
Previously setting `write_buffer_size` with `SetOptions` would only apply to new memtables. An internal user wanted it to take effect immediately, instead of at an arbitrary future point, to prevent OOM.
This PR makes the memtable's size mutable, and makes `SetOptions()` mutate it. There is one case when we preserve the old behavior, which is when memtable prefix bloom filter is enabled and the user is increasing the memtable's capacity. That's because the prefix bloom filter's size is fixed and wouldn't work as well on a larger memtable.
Closes https://github.com/facebook/rocksdb/pull/3119
Differential Revision: D6228304
Pulled By: ajkr
fbshipit-source-id: e44bd9d10a5f8c9d8c464bf7436070bb3eafdfc9
Summary:
After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries.
Closes https://github.com/facebook/rocksdb/pull/3090
Differential Revision: D6183812
Pulled By: yiwu-arbug
fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83
Summary:
Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
2. otherwise, it is the latest sequence of the file
Closes https://github.com/facebook/rocksdb/pull/3087
Differential Revision: D6182519
Pulled By: yiwu-arbug
fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
Summary:
It's also defined in db/dbformat.cc per 7fe3b32896
Closes https://github.com/facebook/rocksdb/pull/3111
Differential Revision: D6219140
Pulled By: ajkr
fbshipit-source-id: 0f2b14e41457334a4665c6b7e3f42f1a060a0f35
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101
Differential Revision: D6199405
Pulled By: maysamyabandeh
fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071
Differential Revision: D6148023
Pulled By: maysamyabandeh
fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
Summary:
The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case.
Closes https://github.com/facebook/rocksdb/pull/3080
Differential Revision: D6185257
Pulled By: ajkr
fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd
Summary:
write_options.sync = true and write_options.disableWAL is incompatible. When WAL is disabled, we are not able to persist the write immediately. Return an error in this case to avoid misuse of the options.
Closes https://github.com/facebook/rocksdb/pull/3086
Differential Revision: D6176822
Pulled By: yiwu-arbug
fbshipit-source-id: 1eb10028c14fe7d7c13c8bc12c0ef659f75aa071
Summary:
Problem was in bottommost compaction, when an L0->L0 compaction happened and L0 was bottommost. Then we'd preserve tombstones according to `Compaction::KeyNotExistsBeyondOutputLevel`, while zeroing seqnum according to `CompactionIterator::PrepareOutput`, thus triggering the assertion in `PrepareOutput`. To fix, we can just drop tombstones in L0->L0 when the output is "bottommost", i.e., the compaction includes the oldest L0 file and there's nothing at lower levels.
Closes https://github.com/facebook/rocksdb/pull/3085
Differential Revision: D6175742
Pulled By: ajkr
fbshipit-source-id: 8ab19a2e001496f362e9eb0a71757e2f6ecfdb3b
Summary:
We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0.
Also revert db_sst_test back to before #2842.
Closes https://github.com/facebook/rocksdb/pull/3079
Differential Revision: D6165702
Pulled By: yiwu-arbug
fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0
Summary:
SstFileManager move files that need to be deleted into a trash directory.
Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted
Closes https://github.com/facebook/rocksdb/pull/2970
Differential Revision: D5976805
Pulled By: IslamAbdelRahman
fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343
Summary:
db/column_family.h :
79 ColumnFamilyHandleInternal()
CID 1322806 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member internal_cfd_ is not initialized in this constructor nor in any functions that it calls.
80 : ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {}
db/compacted_db_impl.cc:
18CompactedDBImpl::CompactedDBImpl(
19 const DBOptions& options, const std::string& dbname)
20 : DBImpl(options, dbname) {
2. uninit_member: Non-static class member cfd_ is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member version_ is not initialized in this constructor nor in any functions that it calls.
CID 1396120 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
6. uninit_member: Non-static class member user_comparator_ is not initialized in this constructor nor in any functions that it calls.
21}
db/compaction_iterator.cc:
9. uninit_member: Non-static class member current_user_key_sequence_ is not initialized in this constructor nor in any functions that it calls.
11. uninit_member: Non-static class member current_user_key_snapshot_ is not initialized in this constructor nor in any functions that it calls.
CID 1419855 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
13. uninit_member: Non-static class member current_key_committed_ is not initialized in this constructor nor in any functions that it calls.
Closes https://github.com/facebook/rocksdb/pull/3084
Differential Revision: D6172999
Pulled By: sagar0
fbshipit-source-id: 084d73393faf8022c01359cfb445807b6a782460
Summary:
- for `SeekToFirst()`, just convert it to a regular `Seek()` if lower bound is specified
- for operations that iterate backwards over user keys (`SeekForPrev`, `SeekToLast`, `Prev`), change `PrevInternal` to check whether user key went below lower bound every time the user key changes -- same approach we use to ensure we stay within a prefix when `prefix_same_as_start=true`.
Closes https://github.com/facebook/rocksdb/pull/3074
Differential Revision: D6158654
Pulled By: ajkr
fbshipit-source-id: cb0e3a922e2650d2cd4d1c6e1c0f1e8b729ff518
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
Summary:
When snapshots are held for a long time, files may reach the bottom level containing overwritten/deleted keys. We previously had no mechanism to trigger compaction on such files. This particularly impacted DBs that write to different parts of the keyspace over time, as such files would never be naturally compacted due to second-last level files moving down. This PR introduces a mechanism for bottommost files to be recompacted upon releasing all snapshots that prevent them from dropping their deleted/overwritten keys.
- Changed `CompactionPicker` to compact files in `BottommostFilesMarkedForCompaction()`. These are the last choice when picking. Each file will be compacted alone and output to the same level in which it originated. The goal of this type of compaction is to rewrite the data excluding deleted/overwritten keys.
- Changed `ReleaseSnapshot()` to recompute the bottom files marked for compaction when the oldest existing snapshot changes, and schedule a compaction if needed. We cache the value that oldest existing snapshot needs to exceed in order for another file to be marked in `bottommost_files_mark_threshold_`, which allows us to avoid recomputing marked files for most snapshot releases.
- Changed `VersionStorageInfo` to track the list of bottommost files, which is recomputed every time the version changes by `UpdateBottommostFiles()`. The list of marked bottommost files is first computed in `ComputeBottommostFilesMarkedForCompaction()` when the version changes, but may also be recomputed when `ReleaseSnapshot()` is called.
- Extracted core logic of `Compaction::IsBottommostLevel()` into `VersionStorageInfo::RangeMightExistAfterSortedRun()` since logic to check whether a file is bottommost is now necessary outside of compaction.
Closes https://github.com/facebook/rocksdb/pull/3009
Differential Revision: D6062044
Pulled By: ajkr
fbshipit-source-id: 123d201cf140715a7d5928e8b3cb4f9cd9f7ad21
Summary:
When doing a Seek() or SeekForPrev() we should stop the moment we see a key with a different prefix as start if ReadOptions:: prefix_same_as_start was set to true
Right now we don't stop if we encounter a tombstone outside the prefix while executing SeekForPrev()
Closes https://github.com/facebook/rocksdb/pull/3067
Differential Revision: D6149638
Pulled By: IslamAbdelRahman
fbshipit-source-id: 7f659862d2bf552d3c9104a360c79439ceba2f18
Summary:
A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed.
Closes https://github.com/facebook/rocksdb/pull/3061
Differential Revision: D6138800
Pulled By: ajkr
fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842
Differential Revision: D5770600
Pulled By: yiwu-arbug
fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048
Differential Revision: D6126272
Pulled By: maysamyabandeh
fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
Summary:
Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config.
Closes https://github.com/facebook/rocksdb/pull/3046
Differential Revision: D6106534
Pulled By: maysamyabandeh
fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c
Summary:
This test shouldn't be enabled under the lite version; and this fixes the failing contrun test due to #3006.
Closes https://github.com/facebook/rocksdb/pull/3056
Differential Revision: D6114681
Pulled By: sagar0
fbshipit-source-id: dc5243549ae6b1353cec7edb820c771d95f66dda
Summary:
The test currently times out when it is run under tsan. This patch split it into 4 tests.
Closes https://github.com/facebook/rocksdb/pull/3047
Differential Revision: D6106515
Pulled By: maysamyabandeh
fbshipit-source-id: 03a28cdf8b1c097be2361b1b0cc3dc1acf2b5d63
Summary:
fixup 6a541afcc4. This code didn't do anything because (1) `bytes_per_sync` is assigned in `EnvOptions`'s constructor; and (2) `OptimizeForCompactionTableWrite`'s return value was ignored, even though its only purpose is to return something.
Closes https://github.com/facebook/rocksdb/pull/3055
Differential Revision: D6114132
Pulled By: ajkr
fbshipit-source-id: ea4831770930e9cf83518e13eb2e1934d1f5487c
Summary:
ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.
Some of the ways in which the fifo compaction options can be set are:
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
- `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`
Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
Closes https://github.com/facebook/rocksdb/pull/3006
Differential Revision: D6058619
Pulled By: sagar0
fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92
Summary:
instad of for specific level -> instead of a specific level
Closes https://github.com/facebook/rocksdb/pull/3040
Differential Revision: D6090811
Pulled By: sagar0
fbshipit-source-id: 499edef0a6f596c448f61791e6aca8f5cce08e9c
Summary:
As discussed on the mailing list (["Skipping entire SSTs while iterating"](https://groups.google.com/forum/#!topic/rocksdb/ujHCJVLrHlU)), this patch adds a `table_filter` to `ReadOptions` that allows specifying a callback to be executed during iteration before each table in the database is scanned. The callback is passed the table's properties; the table is scanned iff the callback returns true.
This can be used in conjunction with a `TablePropertiesCollector` to dramatically speed up scans by skipping tables that are known to contain irrelevant data for the scan at hand.
We're using this [downstream in CockroachDB](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/engine/db.cc#L2009-L2022) already. With this feature, under ideal conditions, we can reduce the time of an incremental backup in from hours to seconds.
FYI, the first commit in this PR fixes a segfault that I unfortunately have not figured out how to reproduce outside of CockroachDB. I'm hoping you accept it on the grounds that it is not correct to return 8-byte aligned memory from a call to `malloc` on some 64-bit platforms; one correct approach is to infer the necessary alignment from `std::max_align_t`, as done here. As noted in the first commit message, the bug is tickled by having a`std::function` in `struct ReadOptions`. That is, the following patch alone is enough to cause RocksDB to segfault when run from CockroachDB on Darwin.
```diff
--- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@@ -1546,6 +1546,13 @@ struct ReadOptions {
// Default: false
bool ignore_range_deletions;
+ // A callback to determine whether relevant keys for this scan exist in a
+ // given table based on the table's properties. The callback is passed the
+ // properties of each table during iteration. If the callback returns false,
+ // the table will not be scanned.
+ // Default: empty (every table will be scanned)
+ std::function<bool(const TableProperties&)> table_filter;
+
ReadOptions();
ReadOptions(bool cksum, bool cache);
};
```
/cc danhhz
Closes https://github.com/facebook/rocksdb/pull/2265
Differential Revision: D5054262
Pulled By: yiwu-arbug
fbshipit-source-id: dd6b28f2bba6cb8466250d8c5c542d3c92785476
Summary:
Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db.
Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it.
Closes https://github.com/facebook/rocksdb/pull/3000
Differential Revision: D6050044
Pulled By: yiwu-arbug
fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632
Summary:
* make `checksum_type_string_map` available for lite
* comment out `FilesPerLevel` in lite mode.
* travis and legocastle lite build also build `all` target and run tests
Closes https://github.com/facebook/rocksdb/pull/3015
Differential Revision: D6069822
Pulled By: yiwu-arbug
fbshipit-source-id: 9fe92ac220e711e9e6ed4e921bd25ef4314796a0
Summary:
The referencing logic is super confusing so added a comment at the part that took me longest to figure out.
Closes https://github.com/facebook/rocksdb/pull/2996
Differential Revision: D6034969
Pulled By: ajkr
fbshipit-source-id: 9cc2e744c1f79d6d57d378f86ed59238a5f583db
Summary:
Right now in `PutCFImpl` we always increment NUMBER_KEYS_UPDATED counter for both in-place update or insertion. This PR fixes this by using the correct counter for either case.
Closes https://github.com/facebook/rocksdb/pull/2986
Differential Revision: D6016300
Pulled By: miasantreble
fbshipit-source-id: 0aed327522e659450d533d1c47d3a9f568fac65d
Summary:
The file numbers assigned post-repair were sometimes smaller than older files' numbers due to `LogAndApply` saving the wrong next file number in the manifest.
- Mark the highest file seen during repair as used before `LogAndApply` so the correct next file number will be stored.
- Renamed `MarkFileNumberUsedDuringRecovery` to `MarkFileNumberUsed` since now it's used during repair in addition to during recovery
- Added `TEST_Current_Next_FileNo` to expose the next file number for the unit test.
Closes https://github.com/facebook/rocksdb/pull/2988
Differential Revision: D6018083
Pulled By: ajkr
fbshipit-source-id: 3f25cbf74439cb8f16dd12af90b67f9f9f75e718
Summary:
Hi,
As part of some optimization, we're using multiple DB locations (tmpfs and spindle) to store data and configured max_bytes_for_level_multiplier_additional. But, max_bytes_for_level_multiplier_additional is not used to compute the actual size for the level while picking the DB location. So, even if DB location does not have space, RocksDB mistakenly puts the level at that location.
Can someone pls. verify the fix? Let me know any other changes required.
Thanks,
Jay
Closes https://github.com/facebook/rocksdb/pull/2704
Differential Revision: D5992515
Pulled By: ajkr
fbshipit-source-id: cbbc6c0e0a7dbdca91c72e0f37b218c4cec57e28
Summary:
On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed.
Closes https://github.com/facebook/rocksdb/pull/2981
Differential Revision: D6001471
Pulled By: yiwu-arbug
fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe
Summary:
Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator:
* check if key has been committed. If not, output uncommitted keys AS-IS.
* use SnapshotChecker to check if key is visible to a snapshot when in need.
* do not output key with seq = 0 if the key is not committed.
Closes https://github.com/facebook/rocksdb/pull/2926
Differential Revision: D5902907
Pulled By: yiwu-arbug
fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320
Summary:
Add a new function in Listener to let the caller know when rocksdb
is stalling writes.
Closes https://github.com/facebook/rocksdb/pull/2897
Differential Revision: D5860124
Pulled By: schischi
fbshipit-source-id: ee791606169aa64f772c86f817cebf02624e05e1
Summary:
We need to tell the iterator the compaction output file's level so it can apply proper optimizations, like pinning filter and index blocks when user enables `pin_l0_filter_and_index_blocks_in_cache` and the output file's level is zero.
Closes https://github.com/facebook/rocksdb/pull/2949
Differential Revision: D5945597
Pulled By: ajkr
fbshipit-source-id: 2389decf9026ffaa32d45801a77d002529f64a62
Summary:
Also made the test more easier to understand:
- changed the value size to ~1MB.
- switched to NoCompression. We don't anyway need compression in this test for dynamic options.
The test failures started happening starting from: #2893 .
Closes https://github.com/facebook/rocksdb/pull/2957
Differential Revision: D5959392
Pulled By: sagar0
fbshipit-source-id: 2d55641e429246328bc6d10fcb9ef540d6ce07da
Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886
Differential Revision: D5838431
Pulled By: yiwu-arbug
fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
Summary:
There's no point populating the block cache during this read. The key we read is guaranteed to be overwritten with a new `kValueType` key immediately afterwards, so can't be accessed again. A user was seeing high turnover of data blocks, at least partially due to this.
Closes https://github.com/facebook/rocksdb/pull/2959
Differential Revision: D5961672
Pulled By: ajkr
fbshipit-source-id: e7cb27c156c5db3b32af355c780efb99dbdf087c
Summary:
PR 2893 introduced a variable that is only used in TEST_SYNC_POINT_CALLBACK. When RocksDB is not built in debug mode, this method is not compiled in, and the variable is unused, which triggers a compiler error.
This patch reverts the corresponding part of #2893.
Closes https://github.com/facebook/rocksdb/pull/2956
Reviewed By: yiwu-arbug
Differential Revision: D5955679
Pulled By: asandryh
fbshipit-source-id: ac4a8e85b22da7f02efb117cd2e4a6e07ba73390
Summary:
It was broken when `NotifyCollectTableCollectorsOnFinish` was introduced. That function called `Finish` on each of the `TablePropertiesCollector`s, and `CompactOnDeletionCollector::Finish()` was resetting all its internal state. Then, when we checked whether compaction is necessary, the flag had already been cleared.
Fixed above issue by avoiding resetting internal state during `Finish()`. Multiple calls to `Finish()` are allowed, but callers cannot invoke `AddUserKey()` on the collector after any finishes.
Closes https://github.com/facebook/rocksdb/pull/2936
Differential Revision: D5918659
Pulled By: ajkr
fbshipit-source-id: 4f05e9d80e50ee762ba1e611d8d22620029dca6b
Summary:
Recover txns from the WAL. Also added some unit tests.
Closes https://github.com/facebook/rocksdb/pull/2901
Differential Revision: D5859596
Pulled By: maysamyabandeh
fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0
Summary:
For every merge operand encountered for a key in the read path we now have the ability to decide whether to look further (to retrieve more merge operands for the key) or stop and invoke the merge operator to return the value. The user needs to override `ShouldMerge()` method with a condition to terminate search when true to avail this facility.
This has a couple of advantages:
1. It helps in limiting the number of merge operands that are looked at to compute a value as part of a user Get operation.
2. It allows to peek at a merge key-value to see if further merge operands need to look at.
Example: Limiting the number of merge operands that are looked at: Lets say you have 10 merge operands for a key spread over various levels. If you only want RocksDB to look at the latest two merge operands instead of all 10 to compute the value, it is now possible with this PR. You can set the condition in `ShouldMerge()` to return true when the size of the operand list is 2. Look at the example implementation in the unit test. Without this PR, a Get might look at all the 10 merge operands in different levels before invoking the merge-operator.
Added a new unit test.
Made sure that there is no perf regression by running benchmarks.
Command line to Load data:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="mergerandom" --merge_operator="uint64add" --num=10000000
...
mergerandom : 12.861 micros/op 77757 ops/sec; 8.6 MB/s ( updates:10000000)
```
**ReadRandomMergeRandom bechmark results:**
Command line:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="readrandommergerandom" --merge_operator="uint64add" --num=10000000
```
Base -- Without this code change (on commit fc7476b):
```
readrandommergerandom : 38.586 micros/op 25916 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8)
```
With this code change:
```
readrandommergerandom : 38.653 micros/op 25870 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8)
```
Closes https://github.com/facebook/rocksdb/pull/2923
Differential Revision: D5898239
Pulled By: sagar0
fbshipit-source-id: daefa325019f77968639a75c851d46352c2303ef
Summary:
SUMMARY
Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
TEST PLAN
ran make check
all passed
Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
Closes https://github.com/facebook/rocksdb/pull/2893
Reviewed By: yiwu-arbug
Differential Revision: D5845814
Pulled By: TheRushingWookie
fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
Summary:
Context/problem:
- CFs may be flushed at different times
- A WAL can only be deleted after all CFs have flushed beyond end of that WAL.
- Point-in-time recovery might stop upon reaching the first corruption.
- Some CFs may have already flushed beyond that point, while others haven't. We should fail the Open() instead of proceeding with inconsistent CFs.
Closes https://github.com/facebook/rocksdb/pull/2900
Differential Revision: D5863281
Pulled By: miasantreble
fbshipit-source-id: 180dbaf83d96c804cff49b3c406312a4ae61313e
Summary:
Problem:
- `DB::SanitizeOptions` strips trailing slash from `wal_dir` but not `dbname`
- We check whether `wal_dir` and `dbname` refer to the same directory using string equality: https://github.com/facebook/rocksdb/blob/master/db/repair.cc#L258
- Providing `dbname` with trailing slash causes default `wal_dir` to be misidentified as a separate directory.
- Then the repair tries to add all SST files to the `VersionEdit` twice (once for `dbname` dir, once for `wal_dir`) and fails with coredump.
Solution:
- Add a new `Env` function, `AreFilesSame`, which uses device and inode number to check whether files are the same. It's currently only implemented in `PosixEnv`.
- Migrate repair to use `AreFilesSame` to check whether `dbname` and `wal_dir` are same. If unsupported, falls back to string comparison.
Closes https://github.com/facebook/rocksdb/pull/2827
Differential Revision: D5761349
Pulled By: ajkr
fbshipit-source-id: c839d548678b742af1166d60b09abd94e5476238
Summary:
Fix for [2461](https://github.com/facebook/rocksdb/issues/2461).
Problem: When using multiple db_paths setting with RocksDB, RocksDB incorrectly calculates the size of L1 in LevelCompactionBuilder::GetPathId.
max_bytes_for_level_base is used as L0 size and L1 size is calculated as (L0 size * max_bytes_for_level_multiplier). However, L1 size should be max_bytes_for_level_base.
Solution: Use max_bytes_for_level_base as L1 size. Also, use L1 size as the estimated size of L0.
Closes https://github.com/facebook/rocksdb/pull/2903
Differential Revision: D5885442
Pulled By: maysamyabandeh
fbshipit-source-id: 036da1c9298d173b9b80479cc6661ee4b7a951f6
Summary:
This is caught when I was testing #2886.
Closes https://github.com/facebook/rocksdb/pull/2907
Differential Revision: D5863153
Pulled By: yiwu-arbug
fbshipit-source-id: 8c54759ba1a0dc101f24ab50423e35731300612d
Summary:
By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions.
Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch.
Closes https://github.com/facebook/rocksdb/pull/2885
Differential Revision: D5837843
Pulled By: maysamyabandeh
fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c
Summary:
Our current implementation of (semi-)copy constructor of DBOptions and ColumnFamilyOptions seems to intend value by value copy, which is what the default copy constructor does anyway. Moreover not using the default constructor has the risk of forgetting to add newly added options.
As an example, allow_2pc seems to be forgotten in the copy constructor which was causing one of the unit tests not seeing its effect.
Closes https://github.com/facebook/rocksdb/pull/2888
Differential Revision: D5846368
Pulled By: maysamyabandeh
fbshipit-source-id: 1ee92a2aeae93886754b7bc039c3411ea2458683
Summary:
Three small optimizations:
(1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree.
(2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads.
(3) Move the switching direction logic in MergingIterator::Next() to a separate function.
These three in total improves readseq performance by about 3% in my benchmark setting.
Closes https://github.com/facebook/rocksdb/pull/2880
Differential Revision: D5829252
Pulled By: siying
fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3
Summary:
Move uncommon code paths in RangeDelAggregator::ShouldDelete() and IterKey::EnlargeBufferIfNeeded() to a separate function, so that the inlined strcuture can be more optimized.
Optimize it because these places show up in CPU profiling, though minimum. The performance is really hard measure. I ran db_bench with readseq benchmark against in-memory DB many times. The variation is big, but it seems to show 1% improvements.
Closes https://github.com/facebook/rocksdb/pull/2877
Differential Revision: D5828123
Pulled By: siying
fbshipit-source-id: 41a49e229f91e9f8409f85cc6f0dc70e31334e4b
Summary:
Added possibility to set save points in transactions and then rollback to them
Closes https://github.com/facebook/rocksdb/pull/2876
Differential Revision: D5825829
Pulled By: yiwu-arbug
fbshipit-source-id: 62168992340bbcddecdaea3baa2a678475d1429d
Summary:
Bug report: https://www.facebook.com/groups/rocksdb.dev/permalink/1389452781153232/
Non-empty `level0_compactions_in_progress_` was aborting `CompactFiles` after incrementing `bg_compaction_scheduled_`, and in that case we never decremented it. This blocked future compactions and prevented DB close as we wait for scheduled compactions to finish/abort during close.
I eliminated `CompactFiles`'s dependency on `level0_compactions_in_progress_`. Since it takes a contiguous span of L0 files -- through the last L0 file if any L1+ files are included -- it's fine to run in parallel with other compactions involving L0. We make the same assumption in intra-L0 compaction.
Closes https://github.com/facebook/rocksdb/pull/2849
Differential Revision: D5780440
Pulled By: ajkr
fbshipit-source-id: 15b15d3faf5a699aed4b82a58352d4a7bb23e027
Summary:
Added some bindings for `OptimisticTransactionDB` in C API
Closes https://github.com/facebook/rocksdb/pull/2823
Differential Revision: D5820672
Pulled By: yiwu-arbug
fbshipit-source-id: 7efd17f619cc0741feddd2050b8fc856f9288350
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868
Differential Revision: D5804152
Pulled By: axxufb
fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
Summary:
options.enable_pipelined_write and options.concurrent_prepare are incompatible and should not be set together.
Closes https://github.com/facebook/rocksdb/pull/2875
Differential Revision: D5818358
Pulled By: maysamyabandeh
fbshipit-source-id: dad862508f00817ab302f8b61729accf38315fb8
Summary:
Changes:
* added check for value before merge is called on code path that should check if key exists
Closes https://github.com/facebook/rocksdb/pull/2814
Reviewed By: IslamAbdelRahman
Differential Revision: D5743966
Pulled By: armishra
fbshipit-source-id: 6ac4283bc510c8ca50827d87ef0ba631f2b33b18
Summary:
it sometimes takes more than 10 minutes (i.e., times out) on our internal CI. mainly because bzip is super slow. so I reduced the amount of work it tries to do.
Closes https://github.com/facebook/rocksdb/pull/2856
Differential Revision: D5795883
Pulled By: ajkr
fbshipit-source-id: e69f986ae60b44ecc26b6b024abd0f13bdf3a3c5
Summary:
Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress.
I have to disable the final key word in debug build, as a hack test class depends on overriding the class.
Closes https://github.com/facebook/rocksdb/pull/2860
Differential Revision: D5800461
Pulled By: siying
fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53
Summary:
DBIter is referenced in ArenaWrappedDBIter, which is a simple wrapper. If DBIter is final, some virtual function call can be avoided. Some functions can even be inlined, like DBIter.value() to ArenaWrappedDBIter.value() and DBIter.key() to ArenaWrappedDBIter.key(). The performance gain is hard to measure. I just ran the memory-only benchmark for readseq and saw it didn't regress. There shouldn't be any harm doing it. Just give compiler more choices.
Closes https://github.com/facebook/rocksdb/pull/2859
Differential Revision: D5799888
Pulled By: siying
fbshipit-source-id: 829788f91310c40282dcfb7e412e6ef489931143
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850
Differential Revision: D5787375
Pulled By: maysamyabandeh
fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
Summary:
Some of these names, like `MEMTABLE_COMPACTION`, did not mean anything. Tried to give them descriptive names.
Closes https://github.com/facebook/rocksdb/pull/2852
Differential Revision: D5782822
Pulled By: ajkr
fbshipit-source-id: f2695c4124af4073da4492d7135bae2411220f3a
Summary:
if we're moving any L0 files down, we need to include older L0 files since they may contain older versions of the keys being moved down.
Closes https://github.com/facebook/rocksdb/pull/2845
Differential Revision: D5773800
Pulled By: ajkr
fbshipit-source-id: 9f0770a8eaaeea4c87df2e7a2a1d65bf9d7f4f7e
Summary:
This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
Closes https://github.com/facebook/rocksdb/pull/2794
Differential Revision: D5717676
Pulled By: Tema
fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
Summary:
* db/range_del_aggregator.cc (AddTombstone): Avoid a potential
use-after-move bug. The original code would both use and move
`tombstone` in a context where the order of those operations is
not specified. The fix is to perform the use on a new, preceding
statement.
Author: meyering
Closes https://github.com/facebook/rocksdb/pull/2796
Differential Revision: D5721163
Pulled By: ajkr
fbshipit-source-id: a1d328d6a77a17c6425e8069860a202e615e2f48
Summary:
Divide the old snapshots to two lists: a few that fit into a cached array and the rest in a vector, which is expected to be empty in normal cases. The former is to optimize concurrent reads from snapshots without requiring locks. It is done by an array of std::atomic, from which std::memory_order_acquire reads are compiled to simple read instructions in most of the x86_64 architectures.
Closes https://github.com/facebook/rocksdb/pull/2758
Differential Revision: D5660504
Pulled By: maysamyabandeh
fbshipit-source-id: 524fcf9a8e7f90a92324536456912a99aaa6740c
Summary:
Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels.
This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted.
Closes https://github.com/facebook/rocksdb/pull/2740
Differential Revision: D5629354
Pulled By: yiwu-arbug
fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d
Summary:
It should hold db mutex while accessing max_total_in_memory_state_.
Closes https://github.com/facebook/rocksdb/pull/2784
Differential Revision: D5696536
Pulled By: yiwu-arbug
fbshipit-source-id: 45430634d7fe11909b38e42e5f169f618681c4ee
Summary:
store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length.
Closes https://github.com/facebook/rocksdb/pull/2781
Differential Revision: D5694702
Pulled By: ajkr
fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd
Summary:
One of the core assumptions of DeleteRange is that files containing portions of the same range tombstone are treated as a single unit from the perspective of compaction picker. Need better tests for this. This PR adds the tests for manual compaction.
Closes https://github.com/facebook/rocksdb/pull/2769
Differential Revision: D5676677
Pulled By: ajkr
fbshipit-source-id: 1b4b3382b300ff7048b872911405fdf900e4fbec
Summary:
Solves #2632
Added OptimisticTransactionDB to the C API.
Added missing merge operations to Transaction.
Added missing get_for_update operation to transaction
If required I will create tests for this another day.
Closes https://github.com/facebook/rocksdb/pull/2633
Differential Revision: D5600906
Pulled By: yiwu-arbug
fbshipit-source-id: da23e4484433d8f59d471f778ff2ae210e3fe4eb
Summary:
add this counter stat to track usage of deletion-dropping optimization. if usage is low, we can delete it to prevent bugs like #2726.
Closes https://github.com/facebook/rocksdb/pull/2761
Differential Revision: D5665421
Pulled By: ajkr
fbshipit-source-id: 881befa2d199838dac88709e7b376a43d304e3d4
Summary:
With this PR, we can measure read-amp for queries where perf_context is enabled as follows:
```
SetPerfLevel(kEnableCount);
Get(1, "foo");
double read_amp = static_cast<double>(get_perf_context()->block_read_byte / get_perf_context()->get_read_bytes);
SetPerfLevel(kDisable);
```
Our internal infra enables perf_context for a sampling of queries. So we'll be able to compute the read-amp for the sample set, which can give us a good estimate of read-amp.
Closes https://github.com/facebook/rocksdb/pull/2749
Differential Revision: D5647240
Pulled By: ajkr
fbshipit-source-id: ad73550b06990cf040cc4528fa885360f308ec12
Summary:
Added a function `MergeOperator::DoesAllowSingleMergeOperand()` to allow invoking a merge operator even with a single merge operand, if overriden.
This is needed for Cassandra-on-RocksDB work. All Cassandra writes are through merges and this will allow a single merge-value to be updated in the merge-operator invoked via a compaction, if needed, due to an expired TTL.
Closes https://github.com/facebook/rocksdb/pull/2721
Differential Revision: D5608706
Pulled By: sagar0
fbshipit-source-id: f299f9f91c4d1ac26e48bd5906e122c1c5e5f3fc
Summary:
the range delete tombstones in memtable should be added to the aggregator even when the memtable's prefix bloom filter tells us the lookup key's not there. This bug could cause data to temporarily reappear until the memtable containing range deletions is flushed.
Reported in #2743.
Closes https://github.com/facebook/rocksdb/pull/2745
Differential Revision: D5639007
Pulled By: ajkr
fbshipit-source-id: 04fc6facb6f978340a3f639536f4ca7c0d73dfc9
Summary:
We forgot to recompute compaction scores after picking a universal compaction like we do in level compaction (a34b2e388e/db/compaction_picker.cc (L691-L695)). This leads to a fairness issue where we waste compactions on CFs/DB instances that don't need it while others can starve.
Previously, ccecf3f4fb fixed the issue for the read-amp-based compaction case; this PR avoids the issue earlier and also for size-ratio-based compactions.
Closes https://github.com/facebook/rocksdb/pull/2688
Differential Revision: D5566191
Pulled By: ajkr
fbshipit-source-id: 010bccb2a107f6a76f3d3022b90aadce5cc48feb
Summary:
Implement the main body of WritePrepared pseudo code. This includes PrepareInternal and CommitInternal, as well as AddCommitted which updates the commit map. It also provides a IsInSnapshot method that could be later called form the read path to decide if a version is in the read snapshot or it should other be skipped.
This patch lacks unit tests and does not attempt to offer an efficient implementation. The idea is that to have the API specified so that we can work on related tasks in parallel.
Closes https://github.com/facebook/rocksdb/pull/2713
Differential Revision: D5640021
Pulled By: maysamyabandeh
fbshipit-source-id: bfa7a05e8d8498811fab714ce4b9c21530514e1c
Summary:
Similar to the bug fixed by https://github.com/facebook/rocksdb/pull/2726, FIFO with compaction and kCompactionStyleNone during user customized CompactFiles() with output level to be 0 can suffer from the same problem. Fix it by leveraging the bottommost_level_ flag.
Closes https://github.com/facebook/rocksdb/pull/2735
Differential Revision: D5626906
Pulled By: siying
fbshipit-source-id: 2b148d0461c61dbd986d74655e384419ae442158
Summary:
`KeyNotExistsBeyondOutputLevel` didn't consider L0 files' key-ranges. So if a key only was covered by older L0 files' key-ranges, we would incorrectly drop deletions of that key. This PR just skips the deletion-dropping optimization when output level is L0.
Closes https://github.com/facebook/rocksdb/pull/2726
Differential Revision: D5617286
Pulled By: ajkr
fbshipit-source-id: 4bff1396b06d49a828ba4542f249191052915bce
Summary:
It would be super helpful to not have to recompile rocksdb to get this performance tweak for mechanical disks.
I have signed the CLA.
Closes https://github.com/facebook/rocksdb/pull/2718
Differential Revision: D5606994
Pulled By: yiwu-arbug
fbshipit-source-id: c05e92bad0d03bd38211af1e1ced0d0d1e02f634
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708
Differential Revision: D5593091
Pulled By: siying
fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
Summary:
We need a tool to check any sst file corruption in the db.
It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
Closes https://github.com/facebook/rocksdb/pull/2498
Differential Revision: D5324269
Pulled By: lightmark
fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
Summary:
We should search wal_dir in Repairer::FindFiles function, and avoid use
LogFileNmae(dbname, number) to get WAL file's name, which will get a wrong
WAL filename. as following:
```
[WARN] [/home/liuchang/Workspace/rocksdb/db/repair.cc:310] Log #3: ignoring conversion error: IO error: While opening a file for sequentially reading: /tmp/rocksdbtest-1000/repair_test/000003.log: No such file or directory
```
I have added a new test case to repair_test.cc, which try to repair db with all WAL options.
Signed-off-by: Chang Liu <liuchang0812@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/2692
Differential Revision: D5575888
Pulled By: ajkr
fbshipit-source-id: 5b93e9f85cddc01663ccecd87631fa723ac466a3
Summary:
This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
Closes https://github.com/facebook/rocksdb/pull/2691
Differential Revision: D5569464
Pulled By: maysamyabandeh
fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325
Summary:
In the condition:
```
if (range_del_agg != nullptr &&
range_del_agg->ShouldDelete(
iter->key(),
RangeDelAggregator::RangePositioningMode::kForwardTraversal) &&
filter != CompactionFilter::Decision::kRemoveAndSkipUntil) {
...
}
```
it could be possible that all the work done in `range_del_agg->ShouldDelete` is wasted due to not having the right `filter` value later on.
Instead, check `filter` value before even calling `range_del_agg->ShouldDelete`, which is a much more involved function.
Closes https://github.com/facebook/rocksdb/pull/2690
Differential Revision: D5568931
Pulled By: sagar0
fbshipit-source-id: 17512d52360425c7ae9de7675383f5d7bc3dad58
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580
Differential Revision: D5422916
Pulled By: ajkr
fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
Summary:
This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
Closes https://github.com/facebook/rocksdb/pull/2676
Differential Revision: D5549998
Pulled By: maysamyabandeh
fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
Summary:
this value ``` Compaction::is_trivial_move_ ``` uninitialized .
under universal compaction , we enable ``` CompactionOptionsUniversal::allow_trivial_move ``` ,
9b11d4345a/db/compaction.cc (L245)
here is a disastrous bug , some sst trivial move to target level without overlap check ...
THEN , DATABASE DAMAGED , WE GOT A LEVEL WITH OVERLAP !
Closes https://github.com/facebook/rocksdb/pull/2634
Differential Revision: D5530722
Pulled By: siying
fbshipit-source-id: 425ab55bca5967110377d634258360bcf88c200e
Summary:
we were passing `record_read_stats` (a bool) as the `hist_type` argument, which meant we were updating either `rocksdb.db.get.micros` (`hist_type == 0`) or `rocksdb.db.write.micros` (`hist_type == 1`) with wrong data.
Closes https://github.com/facebook/rocksdb/pull/2666
Differential Revision: D5520384
Pulled By: ajkr
fbshipit-source-id: 2f7c956aec32f8b58c5c18845ac478e0230c9516
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645
Differential Revision: D5502723
Pulled By: siying
fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
Summary:
- FIFOCompactionWithTTLTest was flaky when run in parallel earlier, and hence it was disabled. Fixed it now.
- Also, faking sleep now instead of really sleeping to make tests more realistic by using TTLs like 1 hour and 1 day.
Closes https://github.com/facebook/rocksdb/pull/2650
Differential Revision: D5506038
Pulled By: sagar0
fbshipit-source-id: deb429a527f045e3e2c5138b547c3e8ac8586aa2
Summary:
Breaking commit: d12691b86f
In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction.
This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed.
Closes https://github.com/facebook/rocksdb/pull/2662
Differential Revision: D5515108
Pulled By: ajkr
fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18
Summary:
platform_dependent tests in Travis now builds all tests, which is not needed. Only build those tests we need to run.
Closes https://github.com/facebook/rocksdb/pull/2647
Differential Revision: D5513954
Pulled By: siying
fbshipit-source-id: 4d540b146124e70dd25586c47939d19f93655b0a
Summary:
Post-compaction work holds onto db mutex for the longest time (found by tracing lock acquires/releases with LTTng and correlating timestamps with our info log). Further experimentation showed `TableCache::EraseHandle` is responsible for ~86% of time mutex is held. We can just release the handle outside the db mutex.
Closes https://github.com/facebook/rocksdb/pull/2654
Differential Revision: D5507126
Pulled By: ajkr
fbshipit-source-id: 703c01ddf2aea16bc0f9e33c08935d78aa6b781d
Summary:
A FIFO compaction picker test is accidentally testing against an instance of level compaction picker.
Closes https://github.com/facebook/rocksdb/pull/2641
Differential Revision: D5495390
Pulled By: sagar0
fbshipit-source-id: 301962736f629b1c499570fb504cdbe66bacb46f
Summary:
We initially had disabled support for write_options.sync when concurrent_prepare_ is set. We later added this support but the statement that asserts this combination is not used was left there. This patch cleans it up.
Closes https://github.com/facebook/rocksdb/pull/2642
Differential Revision: D5496101
Pulled By: maysamyabandeh
fbshipit-source-id: becbc503446f2a51bee24cc861958c090c724ec2
Summary:
The test is failing occasionally on the assert: `ASSERT_TRUE(writer->state == WriteThread::State::STATE_INIT)`. This is because the test don't make the leader wait for long enough before updating state for its followers. The patch move the update to `threads_waiting` to the end of `WriteThread::JoinBatchGroup:Wait` callback to avoid this happening.
Also adding `WriteThread::JoinBatchGroup:Start` and have each thread wait there while another thread is linking to the linked-list. This is to make the check of `is_leader` more deterministic.
Also changing two while-loops of `compare_exchange_strong` to plain `fetch_add`, to make it look cleaner.
Closes https://github.com/facebook/rocksdb/pull/2640
Differential Revision: D5491525
Pulled By: yiwu-arbug
fbshipit-source-id: 6e897f122082bd6f98e6d51b31a25e5fd0a3fb82
Summary:
We will divide by zero if `stats.micros` is zero, just add a simple check
This happens sometimes during running tests and UBSAN complains
Closes https://github.com/facebook/rocksdb/pull/2631
Differential Revision: D5481455
Pulled By: IslamAbdelRahman
fbshipit-source-id: 69aa24e64e21de15d9e2b8009adf01675fcc6598
Summary:
I haven't looked to see if a class variable inside a loop like this is always initialised.
Closes https://github.com/facebook/rocksdb/pull/2602
Differential Revision: D5475937
Pulled By: IslamAbdelRahman
fbshipit-source-id: 8570b308f9a4b49e2a56ccc9e9b84d7c46568c15
Summary:
Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
Closes https://github.com/facebook/rocksdb/pull/2621
Differential Revision: D5464500
Pulled By: siying
fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
Summary:
The previous implementation of caching `file_size` index made no sense. It only remembered the original span of locked files starting from beginning of `file_size`. We should remember the index after all compactions that have been considered but rejected. This will reduce the work we do while holding the db mutex.
Closes https://github.com/facebook/rocksdb/pull/2624
Differential Revision: D5468152
Pulled By: ajkr
fbshipit-source-id: ab92a4bffe76f9f174d861bb5812b974d1013400
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
This diff addresses two problems. Both problems cause us to miss scheduling desirable compactions. One side effect is compaction picking can spam logs, as there's no delay after failed attempts to pick compactions.
1. If a compaction pulled in a locked input-level file due to user-key overlap, we would not consider picking another file from the same input level.
2. If a compaction pulled in a locked output-level file due to user-key overlap, we would not consider picking any other compaction on any level.
The code changes are dependent, which is why I solved both problems in a single diff.
- Moved input-level `ExpandInputsToCleanCut` into the loop inside `PickFileToCompact`. This gives two benefits: (1) if it fails, we will try the next-largest file on the same input level; (2) we get the fully-expanded input-level key-range with which we can check for pending compactions in output level.
- Added another call to `ExpandInputsToCleanCut` inside `PickFileToCompact`'s to check for compaction conflicts in output level.
- Deleted call to `IsRangeInCompaction` in `PickFileToCompact`, as `ExpandInputsToCleanCut` also correctly handles the case where original output-level files (i.e., ones not pulled in due to user-key overlap) are pending compaction.
Closes https://github.com/facebook/rocksdb/pull/2615
Differential Revision: D5454643
Pulled By: ajkr
fbshipit-source-id: ea3fb5477d83e97148951af3fd4558d2039e9872
Summary:
I decided not even to keep it as an INFO-level log as it is too normal for compactions to be skipped due to locked input files. Removing logging here makes us consistent with how we treat locked files that weren't pulled in due to overlap.
We may want some error handling on line 422, which should never happen when called by `LevelCompactionBuilder::PickCompaction`, as `SetupInitialFiles` skips compactions where overlap causes the output level to pull in locked files.
Closes https://github.com/facebook/rocksdb/pull/2617
Differential Revision: D5458502
Pulled By: ajkr
fbshipit-source-id: c2e5f867c0a77c1812ce4242ab3e085b3eee0bae
Summary:
This patch enables using PinnableSlice for RowCache, changes include
not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle.
Closes https://github.com/facebook/rocksdb/pull/2492
Differential Revision: D5316216
Pulled By: maysamyabandeh
fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362
Summary:
Fix column_family_test with LITE build. I need this patch to fix 5.6 branch.
Closes https://github.com/facebook/rocksdb/pull/2597
Differential Revision: D5437171
Pulled By: yiwu-arbug
fbshipit-source-id: 88b9dc5925a6b47af10c1b41bc5b07c4251a84b5
Summary:
this modify allows third-party tables able to support delete range
Closes https://github.com/facebook/rocksdb/pull/2035
Differential Revision: D5407973
Pulled By: ajkr
fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418
Summary:
FIFOCompactionWithTTLTests are flaky when run in parallel, as there is a time element involved to it. Temporarily disabling them while I investigate a more robust testing solution like, say, mocking time.
Closes https://github.com/facebook/rocksdb/pull/2548
Differential Revision: D5386084
Pulled By: sagar0
fbshipit-source-id: 262886b25bdf091021d8553e780443a985e9bac4
Summary:
Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
Closes https://github.com/facebook/rocksdb/pull/2526
Differential Revision: D5358689
Pulled By: ajkr
fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42
Summary:
Adding/Correcting inline comments and clarify the sync rules. To make it simple to reason, the rules are a big general which ended up to some extra synchronizations. However such synchronizations are not on the fast path, and they are worth the simplicity.
Closes https://github.com/facebook/rocksdb/pull/2517
Differential Revision: D5348239
Pulled By: maysamyabandeh
fbshipit-source-id: ff2e59fb1e568c122d2cdbf598310f3613b7d212
Summary:
Issue: #2478Fix: #2503
The bug happened when all of these conditions were satisfied:
- A subcompaction generates no keys
- `RangeDelAggregator::ShouldAddTombstones()` returns true because there's at least one non-obsoleted range deletion in its map
- None of the non-obsolete tombstones overlap with the subcompaction key-range
Under those conditions, we were creating a dedicated file for range deletions which was left empty, thus causing an error in VersionEdit.
I verified this test case fails before the #2503 fix and passes after.
Closes https://github.com/facebook/rocksdb/pull/2521
Differential Revision: D5352568
Pulled By: ajkr
fbshipit-source-id: f619cae39984ce9bb9b7a4e7a9ac0f2bb2ce43e9
Summary:
AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add.
This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is:
- vanilla: 68618 tps
- removing mutex on AddDBStats (unsafe): 69767 tps
- fetch_add for all AddDBStats: 69200 tps
- fetch_add only for concurrently access AddDBStats (this patch): 69579 tps
Closes https://github.com/facebook/rocksdb/pull/2505
Differential Revision: D5330656
Pulled By: maysamyabandeh
fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383
Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.
It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507
Differential Revision: D5345702
Pulled By: al13n321
fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
Summary:
"make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way.
Closes https://github.com/facebook/rocksdb/pull/2508
Differential Revision: D5341710
Pulled By: siying
fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6
Summary:
This is to resolve the asan complains. In the meanwhile I am working on clarifying/revisiting the sync rules.
Closes https://github.com/facebook/rocksdb/pull/2510
Differential Revision: D5338660
Pulled By: yiwu-arbug
fbshipit-source-id: ce6f6e0826d43a2c0bfa4328a00c78f73cd6498a
Summary:
Introducing FIFO compactions with TTL.
FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
- On Flush: Set to the time of flush.
- On Compaction: Set to the max creation_time of all the files involved in the compaction.
- On Repair and Recovery: Set to the time of repair/recovery.
- Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
- the creation_time of all files involved in compaction is 0.
- the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
**Test Plan:**
Added tests.
**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found)
```
With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```
SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480
Differential Revision: D5305116
Pulled By: sagar0
fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.
It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).
The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.
To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:
```
ENCRYPTED_ENV=1 make check_some
```
There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424
Differential Revision: D5322178
Pulled By: sdwilsh
fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
Summary:
With a regression bug was introduced two years ago, by 6e9fbeb27c , we fail to check return status of fsync call. This can cause we miss the information from the file system and can potentially cause corrupted data which we could have been detected.
Closes https://github.com/facebook/rocksdb/pull/2495
Reviewed By: ajkr
Differential Revision: D5321949
Pulled By: siying
fbshipit-source-id: c68117914bb40700198fc37d0e4c63163a8a1031
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345
Differential Revision: D5210732
Pulled By: maysamyabandeh
fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
Summary:
Some users want to prevent rocksdb from entering read-only mode in certain error cases. This diff gives them a callback, `OnBackgroundError`, that they can use to achieve it.
- call `OnBackgroundError` every time we consider setting `bg_error_`. Use its result to assign `bg_error_` but not to change the function's return status.
- classified calls using `BackgroundErrorReason` to give the callback some info about where the error happened
- renamed `ParanoidCheck` to something more specific so we can provide a clear `BackgroundErrorReason`
- unit tests for the most common cases: flush or compaction errors
Closes https://github.com/facebook/rocksdb/pull/2477
Differential Revision: D5300190
Pulled By: ajkr
fbshipit-source-id: a0ea4564249719b83428e3f4c6ca2c49e366e9b3
Summary:
CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.
Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
Closes https://github.com/facebook/rocksdb/pull/2475
Differential Revision: D5298053
Pulled By: siying
fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
Summary:
DBImpl's instance variables should only be accessed with mutex held. I moved an assert later to uphold this rule.
DBTest.LastWriteBufferDelay test was sporadically failing TSAN because it tried to flush around the same time the db was destroyed, so the variable was accessed simultaneously by two threads.
Closes https://github.com/facebook/rocksdb/pull/2471
Differential Revision: D5286857
Pulled By: ajkr
fbshipit-source-id: 435abd84efa601f667c254e320b0bb5a434b971f
Summary:
Make default impl return NoSupported so the db_blob
tests exist in a meaningful manner.
Replace std::thread to port::Thread
Closes https://github.com/facebook/rocksdb/pull/2465
Differential Revision: D5275563
Pulled By: yiwu-arbug
fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433
Differential Revision: D5216946
Pulled By: ajkr
fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
Summary:
We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
Closes https://github.com/facebook/rocksdb/pull/2417
Differential Revision: D5193528
Pulled By: siying
fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df
Summary:
UBSAN crashes when it run the test. Disabling it for UBSAN.
Closes https://github.com/facebook/rocksdb/pull/2427
Differential Revision: D5210897
Pulled By: yiwu-arbug
fbshipit-source-id: 2f5a876807c98d8db79ab9581965f7e6b29d4163
Summary:
If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown.
Closes https://github.com/facebook/rocksdb/pull/2369
Differential Revision: D5127619
Pulled By: siying
fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0
Summary:
… headers
https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.
We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.
make check -j64
Closes https://github.com/facebook/rocksdb/pull/2380
Differential Revision: D5177896
Pulled By: lightmark
fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
Summary:
Fixes the following scenario:
1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`.
2. Do a compaction.
3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`.
4. At some point compaction filter returns `kRemoveAndSkipUntil`.
5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded.
The fix is to make compaction iterator use `total_order_seek = true`.
The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice).
Closes https://github.com/facebook/rocksdb/pull/2349
Differential Revision: D5110388
Pulled By: lightmark
fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350
Differential Revision: D5110648
Pulled By: siying
fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
Summary:
Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:
- adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
- updates MemTable to create MemTableRep's using the new overload.
Closes https://github.com/facebook/rocksdb/pull/2346
Differential Revision: D5108061
Pulled By: ajkr
fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616
Summary:
rocksdb::Random is not thread-safe. Have one Random for each thread instead.
Closes https://github.com/facebook/rocksdb/pull/2400
Differential Revision: D5173919
Pulled By: yiwu-arbug
fbshipit-source-id: 1a99c7b877f3893eb22355af49e321bcad4e53e6
Summary:
The range deletion meta-block iterators weren't getting cleaned up properly since they don't support arena allocation. I didn't implement arena support since, in the general case, each iterator is used only once and separately from all other iterators, so there should be no benefit to data locality.
Anyways, this diff fixes up #2370 by treating range deletion iterators as non-arena-allocated.
Closes https://github.com/facebook/rocksdb/pull/2399
Differential Revision: D5171119
Pulled By: ajkr
fbshipit-source-id: bef6f5c4c5905a124f4993945aed4bd86e2807d8
Summary:
also changed the `>` in the comparison against `level0_file_num_compaction_trigger` into a `>=` since exactly `level0_file_num_compaction_trigger` can trigger a compaction from L0.
Closes https://github.com/facebook/rocksdb/pull/2179
Differential Revision: D4915772
Pulled By: ajkr
fbshipit-source-id: e38fec6253de6f9a40e61734615c6670d84038aa
Summary:
This fixes a compilation failure on Linux when the system libc is not
glibc. jemalloc's configure script incorrectly assumes that glibc is
always used on Linux systems, producing glibc-style signatures; when
the system libc is e.g. musl, the following error is observed:
```
[ 0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o
In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0,
from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77:
/x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void*)' has a different exception specifier
size_t malloc_usable_size(void *);
^~~~~~~~~~~~~~~~~~
In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0:
/go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()'
# define je_malloc_usable_size malloc_usable_size
^
/go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size'
JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size(
^~~~~~~~~~~~~~~~~~~~~
CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed
```
This works around the issue by rearranging the sources such that
jemalloc's headers are never in the same scope as the system's malloc
header. The jemalloc issue has been reported as well, see:
https://github.com/jemalloc/jemalloc/issues/778.
cc tschottdorf
Closes https://github.com/facebook/rocksdb/pull/2188
Differential Revision: D5163048
Pulled By: siying
fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d
Summary:
Previously we returned NotSupported when ingesting files into a database containing any range deletions. This diff adds the support.
- Flush if any memtable contains range deletions overlapping the to-be-ingested file
- Place to-be-ingested file before any level that contains range deletions overlapping it.
- Added support for `Version` to return iterators over range deletions in a given level. Previously, we piggybacked getting range deletions onto `Version`'s `Get()` / `AddIterator()` functions by passing them a `RangeDelAggregator*`. But file ingestion needs to get iterators over range deletions, not populate an aggregator (since the aggregator does collapsing and doesn't expose the actual ranges).
Closes https://github.com/facebook/rocksdb/pull/2370
Differential Revision: D5127648
Pulled By: ajkr
fbshipit-source-id: 816faeb9708adfa5287962bafdde717db56e3f1a
Summary:
Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.
Stacking on #2375.
Closes https://github.com/facebook/rocksdb/pull/2385
Differential Revision: D5148358
Pulled By: yiwu-arbug
fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
Summary:
Add a histogram in statistics to help users understand how many merge operands they merge.
Closes https://github.com/facebook/rocksdb/pull/2373
Differential Revision: D5139983
Pulled By: siying
fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182
Summary:
Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested.
Closes https://github.com/facebook/rocksdb/pull/2361
Differential Revision: D5139402
Pulled By: lightmark
fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88
Summary:
This log message shouldn't be a warning; some services are seeing high warning count due to this.
The count for the below line is a few hundreds of millions, as per Logview:
```
[rocksdb/src/db/column_family.cc:729] [checkpoints] Increasing compaction threads because we have 2 level-0 files
```
Closes https://github.com/facebook/rocksdb/pull/2364
Differential Revision: D5123565
Pulled By: sagar0
fbshipit-source-id: a07ce499a4f82f0ebde9cda9f4948fb9df6a734c
Summary:
stop calling Close() at the end of tests holding a compaction pressure token since it causes the write controller to be deleted while it's still needed. these calls were pointless anyways since Close() is already called in the test's destructor.
Closes https://github.com/facebook/rocksdb/pull/2367
Differential Revision: D5125906
Pulled By: ajkr
fbshipit-source-id: 6cad8673e5546a82ff602ac0ba59cc3f68dbde46
Summary:
- `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
- `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
- `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
- The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
Closes https://github.com/facebook/rocksdb/pull/2205
Differential Revision: D4937461
Pulled By: ajkr
fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
Summary:
It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude.
Closes https://github.com/facebook/rocksdb/pull/2357
Differential Revision: D5115324
Pulled By: siying
fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754
Summary:
`cf_stats_snapshot_.seconds_up` appears to be never updated, unlike `db_stats_snapshot_.seconds_up`, which is updated here: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.cc#L883
This leads to wrong information in the log, for example:
```
** Compaction Stats [default] **
....
Uptime(secs): 85591.2 total, 85591.2 interval
```
Even though DB's interval is correctly logged as 60 seconds:
```
** DB Stats **
Uptime(secs): 85591.2 total, 637.8 interval
```
Closes https://github.com/facebook/rocksdb/pull/2338
Differential Revision: D5114131
Pulled By: sagar0
fbshipit-source-id: 85243a38213236ccbb601a7f7aaa8865eaa8083c
Summary:
Fix build error in db_iter.cc when running clang-analyzer.
```
CC db/db_iter.o
db/db_iter.cc:938:21: error: no matching constructor for initialization of 'rocksdb::ParsedInternalKey'
ParsedInternalKey ikey(Slice(), 0, 0);
^ ~~~~~~~~~~~~~
./db/dbformat.h:84:3: note: candidate constructor not viable: no known conversion from 'int' to 'rocksdb::ValueType' for 3rd argument
ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t)
^
./db/dbformat.h:78:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided
struct ParsedInternalKey {
^
./db/dbformat.h:78:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided
./db/dbformat.h:83:3: note: candidate constructor not viable: requires 0 arguments, but 3 were provided
ParsedInternalKey() { } // Intentionally left uninitialized (for speed)
^
1 error generated.
```
Closes https://github.com/facebook/rocksdb/pull/2354
Differential Revision: D5115751
Pulled By: sagar0
fbshipit-source-id: b0e386d4e935e4725b07761c3ca5f7a8cbde3692
Summary:
Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.
Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
Closes https://github.com/facebook/rocksdb/pull/2204
Differential Revision: D4936256
Pulled By: ajkr
fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337
Differential Revision: D5100261
Pulled By: lightmark
fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
Summary:
PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that
1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit.
2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes.
Merging #2056 and #2058 into this PR.
Closes https://github.com/facebook/rocksdb/pull/2286
Differential Revision: D5054606
Pulled By: yiwu-arbug
fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d
Summary:
Fixing two types of clang-analyzer false positives:
* db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated.
* ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running.
Closes https://github.com/facebook/rocksdb/pull/2334
Differential Revision: D5093801
Pulled By: yiwu-arbug
fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d
Summary:
TSAN shows warning of data race of EnvCounter::num_new_writable_file_. Make it atomic.
Closes https://github.com/facebook/rocksdb/pull/2331
Differential Revision: D5089215
Pulled By: siying
fbshipit-source-id: 15f6dcfb770a3310cbb6337c22482c8b330daffc
Summary:
First cut for early review; there are few conceptual points to answer and some code structure issues.
For conceptual points -
- restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
- we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
- Modifying AssignLevelForIngestedFile seems the place we we'd handle that.
On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
Closes https://github.com/facebook/rocksdb/pull/2144
Differential Revision: D4873732
Pulled By: lightmark
fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
Summary:
I've added functions to the C API to support Transactions as requested in #1637 and to support Checkpoint.
I have also added the corresponding tests to c_test.c
For now, the following is omitted:
1. Optimistic Transactions
2. The column family variation of functions
Closes https://github.com/facebook/rocksdb/pull/2236
Differential Revision: D4989510
Pulled By: yiwu-arbug
fbshipit-source-id: 518cb39f76d5e9ec9690d633fcdc014b98958071
Summary:
fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose.
test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check
Closes https://github.com/facebook/rocksdb/pull/2271
Differential Revision: D5038133
Pulled By: lightmark
fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8
Summary:
- added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available.
- refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred
Closes https://github.com/facebook/rocksdb/pull/2272
Differential Revision: D5038093
Pulled By: ajkr
fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
Summary:
Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek
Actually I am not sure why this behavior existed in the first place
Closes https://github.com/facebook/rocksdb/pull/2255
Differential Revision: D5010721
Pulled By: maysamyabandeh
fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4
Summary:
Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
Closes https://github.com/facebook/rocksdb/pull/2248
Differential Revision: D5001578
Pulled By: yiwu-arbug
fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
Summary:
- downcase includes for case-sensitive filesystems
- give targets the same name (librocksdb) on all platforms
With this patch it is possible to cross-compile RocksDB for Windows
from a Linux host using mingw.
cc yuslepukhin orgads
Closes https://github.com/facebook/rocksdb/pull/2107
Differential Revision: D4849784
Pulled By: siying
fbshipit-source-id: ad26ed6b4d393851aa6551e6aa4201faba82ef60
Summary:
Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
[a b c d][f g h]
Then the index key for the first block will be 'e'.
then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.
This diff can reduce RA in most cases.
Closes https://github.com/facebook/rocksdb/pull/2239
Differential Revision: D4990693
Pulled By: lightmark
fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163
Differential Revision: D4895953
Pulled By: siying
fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
Summary:
Changed dynamic leveling to stop setting the base level's size bound below `max_bytes_for_level_base`.
Behavior for config where `max_bytes_for_level_base == level0_file_num_compaction_trigger * write_buffer_size` and same amount of data in L0 and base-level:
- Before #2027, compaction scoring would favor base-level due to dividing by size smaller than `max_bytes_for_level_base`.
- After #2027, L0 and Lbase get equal scores. The disadvantage is L0 is often compacted before reaching the num files trigger since `write_buffer_size` can be bigger than the dynamically chosen base-level size. This increases write-amp.
- After this diff, L0 and Lbase still get equal scores. Now it takes `level0_file_num_compaction_trigger` files of size `write_buffer_size` to trigger L0 compaction by size, fixing the write-amp problem above.
Closes https://github.com/facebook/rocksdb/pull/2123
Differential Revision: D4861570
Pulled By: ajkr
fbshipit-source-id: 467ddef56ed1f647c14d86bb018bcb044c39b964
Summary:
When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249.
Test setup:
- command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize`
- filesystem: xfs
before this diff:
`fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)`
after this diff:
`fallocate(10, 01, 0, 1977) = 0`
Closes https://github.com/facebook/rocksdb/pull/2252
Differential Revision: D5007275
Pulled By: ajkr
fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880
Summary:
Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache.
Closes https://github.com/facebook/rocksdb/pull/2185
Differential Revision: D4979189
Pulled By: yiwu-arbug
fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda
Summary:
A data race between a manual and an auto compaction can cause a scheduled automatic compaction to be cancelled and never rescheduled again. This may cause a condition of hanging forever. Fix this by always making sure the cancelled compaction is put back to the compaction queue.
Closes https://github.com/facebook/rocksdb/pull/2238
Differential Revision: D4984591
Pulled By: siying
fbshipit-source-id: 3ab153886403c7b991896dcb2158b96cac12f227
Summary:
ColumnFamilyData::ConstructNewMemtable is called out of DB mutex, and it asserts current_ is not empty, but current_ should only be accessed inside DB mutex. Remove this assert to make TSAN happy.
Closes https://github.com/facebook/rocksdb/pull/2235
Differential Revision: D4978531
Pulled By: siying
fbshipit-source-id: 423685a7dae88ed3faaa9e1b9ccb3427ac704a4b