Commit Graph

1089 Commits

Author SHA1 Message Date
Sebastiano Peluso
fcd7e03832 Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (#6072)
Summary:
This change ignores the value of BackupableDBOptions::max_valid_backups_to_open when a BackupEngine is not read-only.

Issue: https://github.com/facebook/rocksdb/issues/4997

Note on tests: I had to remove test case WriteOnlyEngine of BackupableDBTest because it was not consistent with the new semantic of BackupableDBOptions::max_valid_backups_to_open. Maybe, we should think about adding a new interface for append-only BackupEngines. On the other hand, I changed LimitBackupsOpened test case to use a read-only BackupEngine, and I added a new specific test case for the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6072

Reviewed By: pdillinger

Differential Revision: D18687364

Pulled By: sebastianopeluso

fbshipit-source-id: 77bc1f927d623964d59137a93de123bbd719da4e
2019-11-26 10:00:31 -08:00
Levi Tamasi
279c488395 Mark blob files not needed by any memtables/SSTs obsolete (#6032)
Summary:
The patch adds logic to mark no longer needed blob files obsolete upon database open
and whenever a flush or compaction completes. Unneeded blob files are detected by
iterating through live immutable non-TTL blob files starting from the lowest-numbered one,
and stopping when a blob file used by any SSTs or potentially used by memtables is found.
(The latter is determined by comparing the sequence number at which the blob file
became immutable with the largest sequence number received in flush notifications.)

In addition, the patch cleans up the logic around closing and obsoleting blob files and
enforces invariants around this area (blob files are now guaranteed to go through the
stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this
order).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032

Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`.

Differential Revision: D18495610

Pulled By: ltamasi

fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961
2019-11-18 16:30:06 -08:00
Maysam Yabandeh
0058daef7b Disable SmallestUnCommittedSeq in Valgrind run (#6035)
Summary:
SmallestUnCommittedSeq sometimes takes too long when run under Valgrind. The patch disables it when the tests are run under Valgrind.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6035

Differential Revision: D18509198

Pulled By: maysamyabandeh

fbshipit-source-id: 1191443b9fedb6b9c50d6b76f5c92371f5030230
2019-11-14 14:41:52 -08:00
Peter Dillinger
e8e7fb1dcf More fixes to auto-GarbageCollect in BackupEngine (#6023)
Summary:
Production:
* Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory.
* Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories.
* Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023

Test Plan:
* Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.)
* Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false.
* Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected.

Differential Revision: D18453559

Pulled By: pdillinger

fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4
2019-11-14 06:20:18 -08:00
anand76
6c7b1a0cc7 Batched MultiGet API for multiple column families (#5816)
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.

As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816

Test Plan:
make check
make asan_check
asan_crash_test

Differential Revision: D18408676

Pulled By: anand1976

fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
2019-11-12 13:52:55 -08:00
Levi Tamasi
8e7aa62813 BlobDB: Maintain mapping between blob files and SSTs (#6020)
Summary:
The patch adds logic to BlobDB to maintain the mapping between blob files
and SSTs for which the blob file in question is the oldest blob file referenced
by the SST file. The mapping is initialized during database open based on the
information retrieved using `GetLiveFilesMetaData`, and updated after
flushes/compactions based on the information received through the `EventListener`
interface (or, in the case of manual compactions issued through the `CompactFiles`
API, the `CompactionJobInfo` object).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6020

Test Plan: Added a unit test; also tested using the BlobDB mode of `db_bench`.

Differential Revision: D18410508

Pulled By: ltamasi

fbshipit-source-id: dd9e778af781cfdb0d7056298c54ba9cebdd54a5
2019-11-11 14:01:34 -08:00
Peter Dillinger
aa63abf698 Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015)
Summary:
Only if there is a crash, power failure, or I/O error in
DeleteBackup, shared or private files from the backup might be left
behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
by GarbageCollect. This makes the BackupEngine API "leaky by default."
Even if it means a modest performance hit, I think we should make
Delete and Purge do as they say, with ongoing best effort: i.e. future
calls will attempt to finish any incomplete work from earlier calls.

This change does that by having DeleteBackup and PurgeOldBackups do a
GarbageCollect, unless (to minimize performance hit) this BackupEngine
has already done a GarbageCollect and there have been no
deletion-related I/O errors in that GarbageCollect or since then.

Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.

Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015

Test Plan: Updated unit tests

Differential Revision: D18401333

Pulled By: pdillinger

fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925
2019-11-08 19:15:35 -08:00
Levi Tamasi
f80050fa8f Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011)
Summary:
The patch exposes the file numbers of the SSTs as well as the oldest blob
files they contain a reference to through the GetColumnFamilyMetaData/
GetLiveFilesMetaData interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011

Test Plan:
Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest
wasn't really testing anything because the generated memtables were never
flushed, so the metadata structure was essentially empty.)

Differential Revision: D18361697

Pulled By: ltamasi

fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9
2019-11-07 14:04:16 -08:00
Yanqin Jin
67e735dbf9 Rename BlockBasedTable::ReadMetaBlock (#6009)
Summary:
According to
https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format,
the block read by BlockBasedTable::ReadMetaBlock is actually the meta index
block. Therefore, it is better to rename the function to ReadMetaIndexBlock.

This PR also applies some format change to existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6009

Test Plan: make check

Differential Revision: D18333238

Pulled By: riversand963

fbshipit-source-id: 2c4340a29b3edba53d19c132cbfd04caf6242aed
2019-11-05 17:19:11 -08:00
Sergei Petrunia
230bcae7b6 Add a limited support for iteration bounds into BaseDeltaIterator (#5403)
Summary:
For MDEV-19670: MyRocks: key lookups into deleted data are very slow

BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_
walk above the iterate_upper_bound if base_iterator_ is not valid
anymore.

== Rationale ==
The most straightforward way would be to make the delta_iterator
(which is a rocksdb::WBWIIterator) to support iterator bounds. But
checking for bounds has an extra CPU overhead.

So we put the check into BaseDeltaIterator, and only make it when
base_iterator_ is not valid.

(note: We could take it even further, and move the check a few lines
down, and only check iterator bounds ourselves if base_iterator_ is
not valid AND delta_iterator_ hit a tombstone).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403

Differential Revision: D15863092

Pulled By: maysamyabandeh

fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b
2019-11-05 11:39:36 -08:00
Maysam Yabandeh
52733b4498 WritePrepared: Fix flaky test MaxCatchupWithNewSnapshot (#5850)
Summary:
MaxCatchupWithNewSnapshot tests that the snapshot sequence number will be larger than the max sequence number when the snapshot was taken. However since the test does not have access to the max sequence number when the snapshot was taken, it uses max sequence number after that, which could have advanced the snapshot by then, thus making the test flaky.
The fix is to compare with max sequence number before the snapshot was taken, which is a lower bound for the value when the snapshot was taken.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5850

Test Plan: ~/gtest-parallel/gtest-parallel --repeat=12800 ./write_prepared_transaction_test --gtest_filter="*MaxCatchupWithNewSnapshot*"

Differential Revision: D17608926

Pulled By: maysamyabandeh

fbshipit-source-id: b122ae5a27f982b290bd60da852e28d3c5eb0136
2019-11-04 16:23:57 -08:00
Sagar Vemuri
4c9aa30a62 Auto enable Periodic Compactions if a Compaction Filter is used (#5865)
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.

Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.

`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.

This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865

Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`

Differential Revision: D17659180

Pulled By: sagar0

fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
2019-10-29 15:05:51 -07:00
Peter Dillinger
ca7ccbe2ea Misc hashing updates / upgrades (#5909)
Summary:
- Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09).
- Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions.
- Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range.
- Use preview version of XXH3 instead of MurmurHash64A for NPHash64
-- Had to update cache_test to increase probability of passing for any given hash function.
- Use fastrange64 instead of % with uses of NPHash64
-- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision.
- Set default seed for NPHash64 because specifying a seed rarely makes sense for it.
- Removed unnecessary include xxhash.h in a popular .h file
- Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated.

Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909

Differential Revision: D18125196

Pulled By: pdillinger

fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d
2019-10-24 17:16:46 -07:00
Levi Tamasi
a59dc843a4 Move blob_index.h to db/ (#5919)
Summary:
Extracted from PR https://github.com/facebook/rocksdb/issues/5903 for technical reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5919

Test Plan: make check

Differential Revision: D17910132

Pulled By: ltamasi

fbshipit-source-id: 6ecbb8d6e84b2a1d1f28575ad48ac3cc65833eb5
2019-10-14 12:54:05 -07:00
Andrew Kryczka
b00761eea6 Fix block cache ID uniqueness for Windows builds (#5844)
Summary:
Since we do not evict a file's blocks from block cache before that file
is deleted, we require a file's cache ID prefix is both unique and
non-reusable. However, the Windows functionality we were relying on only
guaranteed uniqueness. That meant a newly created file could be assigned
the same cache ID prefix as a deleted file. If the newly created file
had block offsets matching the deleted file, full cache keys could be
exactly the same, resulting in obsolete data blocks returned from cache
when trying to read from the new file.

We noticed this when running on FAT32 where compaction was writing out
of order keys due to reading obsolete blocks from its input files. The
functionality is documented as behaving the same on NTFS, although I
wasn't able to repro it there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844

Test Plan:
we had a reliable repro of out-of-order keys on FAT32 that
was fixed by this change

Differential Revision: D17752442

fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
2019-10-11 18:19:31 -07:00
jsteemann
da3b2840cb save a few redundant container lookups (#5875)
Summary:
This PR eliminates repeated lookups in associative or ordered containers when a single lookup suffices.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5875

Differential Revision: D17753172

Pulled By: anand1976

fbshipit-source-id: 796b02b760082521d8c42a1cb65a76bf0e6c1b8e
2019-10-07 12:28:09 -07:00
sdong
e8263dbdaa Apply formatter to recent 200+ commits. (#5830)
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830

Test Plan: Run all existing tests.

Differential Revision: D17488031

fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
2019-09-20 12:04:26 -07:00
sdong
c06b54d0c6 Apply formatter on recent 45 commits. (#5827)
Summary:
Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827

Test Plan: Run all existing tests.

Differential Revision: D17483727

fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f
2019-09-19 12:34:17 -07:00
Levi Tamasi
2cbb61eadb Make clang-analyzer happy (#5821)
Summary:
clang-analyzer has uncovered a bunch of places where the code is relying
on pointers being valid and one case (in VectorIterator) where a moved-from
object is being used:

In file included from db/range_tombstone_fragmenter.cc:17:
./util/vector_iterator.h:23:18: warning: Method called on moved-from object 'keys' of type 'std::vector'
        current_(keys.size()) {
                 ^~~~~~~~~~~
1 warning generated.
utilities/persistent_cache/block_cache_tier_file.cc:39:14: warning: Called C++ object pointer is null
  Status s = env->NewRandomAccessFile(filepath, file, opt);
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:47:19: warning: Called C++ object pointer is null
  Status status = env_->GetFileSize(Path(), size);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:290:14: warning: Called C++ object pointer is null
  Status s = env_->FileExists(Path());
             ^~~~~~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:363:35: warning: Called C++ object pointer is null
    CacheWriteBuffer* const buf = alloc_->Allocate();
                                  ^~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:399:41: warning: Called C++ object pointer is null
  const uint64_t file_off = buf_doff_ * alloc_->BufferSize();
                                        ^~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:463:33: warning: Called C++ object pointer is null
  size_t start_idx = lba.off_ / alloc_->BufferSize();
                                ^~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:515:5: warning: Called C++ object pointer is null
    alloc_->Deallocate(bufs_[i]);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
7 warnings generated.
ar: creating librocksdb_debug.a
utilities/memory/memory_test.cc:68:25: warning: Called C++ object pointer is null
      cache_set->insert(db->GetDBOptions().row_cache.get());
                        ^~~~~~~~~~~~~~~~~~
1 warning generated.

The patch fixes these by adding assertions and explicitly passing in zero
when initializing VectorIterator::current_ (which preserves the existing
behavior).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5821

Test Plan: Ran make check and make analyze to make sure the warnings have disappeared.

Differential Revision: D17455949

Pulled By: ltamasi

fbshipit-source-id: 363619618ea649a0674287f9f3b3393e390571ee
2019-09-18 15:25:48 -07:00
Maysam Yabandeh
638d239507 Charge block cache for cache internal usage (#5797)
Summary:
For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
This PR builds up on https://github.com/facebook/rocksdb/issues/4258
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797

Test Plan:
- Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
- The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.

Differential Revision: D17396833

Pulled By: maysamyabandeh

fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
2019-09-16 15:26:21 -07:00
sdong
b931f84e56 Divide file_reader_writer.h and .cc (#5803)
Summary:
file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803

Test Plan: Build whole project using make and cmake.

Differential Revision: D17374550

fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
2019-09-16 10:33:51 -07:00
anand76
83a6a614e9 Refactor ArenaWrappedDBIter into separate files (#5801)
Summary:
Move definition and implementation for ArenaWrappedDBIter into its own .h/.cc files. Also, change inlining of functions to better comply with the Google C++ style guide.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5801

Test Plan: make check

Differential Revision: D17371012

Pulled By: anand1976

fbshipit-source-id: c1361abc2851575111e357a63d88be3b3d6cb341
2019-09-13 13:50:43 -07:00
Shylock Hg
9eb3e1f77d Use delete to disable automatic generated methods. (#5009)
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009

Differential Revision: D17288733

fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
2019-09-11 18:09:00 -07:00
Wilfried Goesgens
fbab9913e2 upgrade gtest 1.7.0 => 1.8.1 for json result writing
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332

Differential Revision: D17242232

fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64
2019-09-09 11:24:11 -07:00
Maysam Yabandeh
78b8cfc7ec WriteUnPrepared: Split ReadYourOwnWriteStress to three (#5776)
Summary:
ReadYourOwnWriteStress occasionally times out on some platforms. The patch splits it to three.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5776

Differential Revision: D17231743

Pulled By: maysamyabandeh

fbshipit-source-id: d42eeaf22f61a48d50f9c404d98b1081ae8dac94
2019-09-06 15:25:26 -07:00
Manuel Ung
2208cc0196 Fix build break in TransactionBaseImpl::TrackKey (#5771)
Summary:
Fix build broken in https://github.com/facebook/rocksdb/pull/5696.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5771

Differential Revision: D17217665

Pulled By: lth

fbshipit-source-id: 7aa84a2a9b4feb7a3ab1cab174e09276430fe042
2019-09-06 10:18:04 -07:00
奏之章
533e47709c Fix WriteBatchWithIndex with MergeOperator bug (#5577)
Summary:
```
TEST_F(WriteBatchWithIndexTest, TestGetFromBatchAndDBMerge3) {
  DB* db;
  Options options;

  options.create_if_missing = true;
  std::string dbname = test::PerThreadDBPath("write_batch_with_index_test");

  options.merge_operator = MergeOperators::CreateFromStringId("stringappend");

  DestroyDB(dbname, options);
  Status s = DB::Open(options, dbname, &db);
  assert(s.ok());

  ReadOptions read_options;
  WriteOptions write_options;
  FlushOptions flush_options;
  std::string value;

  WriteBatchWithIndex batch;

  ASSERT_OK(db->Put(write_options, "A", "1"));
  ASSERT_OK(db->Flush(flush_options, db->DefaultColumnFamily()));
  ASSERT_OK(batch.Merge("A", "2"));

  ASSERT_OK(batch.GetFromBatchAndDB(db, read_options, "A", &value));
  ASSERT_EQ(value, "1,2");

  delete db;
  DestroyDB(dbname, options);
}
```
Fix ASSERT in batch.GetFromBatchAndDB()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5577

Differential Revision: D16379847

fbshipit-source-id: b1320e24ec8e71350c525083cc0a16180a63f752
2019-09-05 17:52:14 -07:00
jsteemann
19e8c9b64f use c++17's try_emplace if available (#5696)
Summary:
This avoids rehashing the key in TrackKey() in case the key is not already
in the map of tracked keys, which will happen at least once per key used in a
transaction.

Additionally fix two typos.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5696

Differential Revision: D17210178

Pulled By: lth

fbshipit-source-id: 7e2c28e9e505c1d1c1535d435250cf2b191a6fdf
2019-09-05 13:59:40 -07:00
Maysam Yabandeh
f9fb9f1421 Add a unit test to detect infinite loops with reseek optimizations (#5727)
Summary:
Iterators reseek to the target key after iterating over max_sequential_skip_in_iterations invalid values. The logic is susceptible to an infinite loop bug, which has been present with WritePrepared Transactions up until 6.2 release. Although the bug is not present on master, the patch adds a unit test to prevent it from resurfacing again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5727

Differential Revision: D16952759

Pulled By: maysamyabandeh

fbshipit-source-id: d0d973dddc8dfabd5a794931232aa4c862c74f51
2019-09-04 14:31:10 -07:00
Pratik Dhandharia
1b4c104a67 replace some reinterpret_cast with static_cast_with_check (#5740)
Summary:
This PR focuses on replacing some of the reinterpret_cast<DBImpl*> to static_cast_with_check<DBImpl, DB>.

Files impacted:

./db/db_impl/db_impl_compaction_flush.cc
./db/write_batch.cc
./utilities/blob_db/blob_db_impl.cc
./utilities/transactions/pessimistic_transaction_db.cc
./utilities/transactions/transaction_base.cc
./utilities/transactions/write_prepared_txn_db.cc
./utilities/transactions/write_unprepared_txn_db.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5740

Differential Revision: D17055691

Pulled By: pdhandharia

fbshipit-source-id: 0f8034d1b32eade56e37d59c04b7bf236a81d8e8
2019-08-27 10:59:11 -07:00
Zhongyi Xie
2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
sdong
e0515607bc Blacklist TransactionTest.GetWithoutSnapshot from valgrind_test (#5715)
Summary:
In valgrind_test, TransactionTest.GetWithoutSnapshot ran 2 hours and still didn't finish. Black list from valgrind_test to prevent timeout.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5715

Test Plan: run "make valgrind_test" and see whether the test is still generated.

Differential Revision: D16866009

fbshipit-source-id: 92c78049b0bc1c2b9a0dfc1b7c8a9206b36f02f0
2019-08-16 15:36:49 -07:00
Manuel Ung
7785f61132 WriteUnPrepared: Fix bug in savepoints (#5703)
Summary:
Fix a bug in write unprepared savepoints. When flushing the write batch according to savepoint boundaries, we were forgetting to flush the last write batch after the last savepoint, meaning that some data was not written to DB.

Also, add a small optimization where we avoid flushing empty batches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5703

Differential Revision: D16811996

Pulled By: lth

fbshipit-source-id: 600c7e0e520ad7a8fad32d77e11d932453e68e3f
2019-08-14 16:15:46 -07:00
Levi Tamasi
0a97125ec0 Fix data races in BlobDB (#5698)
Summary:
Some accesses to blob_files_ and open_ttl_files_ in BlobDBImpl, as well
as to expiration_range_ in BlobFile were not properly synchronized.
The patch fixes this and also makes sure the invariant that obsolete_files_
is a subset of blob_files_ holds even when an attempt to delete an obsolete
blob file fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5698

Test Plan:
COMPILE_WITH_TSAN=1 make blob_db_test
gtest-parallel --repeat=1000 ./blob_db_test --gtest_filter="*ShutdownWait*"

The test fails with TSAN errors ~20 times out of 1000 without the patch but
completes successfully 1000 out of 1000 times with the fix.

Differential Revision: D16793235

Pulled By: ltamasi

fbshipit-source-id: 8034b987598d4fdc9f15098d4589cc49cde484e9
2019-08-14 16:10:36 -07:00
Manuel Ung
4c70cb7306 WriteUnPrepared: support iterating while writing to transaction (#5699)
Summary:
In MyRocks, there are cases where we write while iterating through keys. This currently breaks WBWIIterator, because if a write batch flushes during iteration, the delta iterator would point to invalid memory.

For now, fix by disallowing flush if there are active iterators. In the future, we will loop through all the iterators on a transaction, and refresh the iterators when a write batch is flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5699

Differential Revision: D16794157

Pulled By: lth

fbshipit-source-id: 5d5bf70688bd68fe58e8a766475ae88fd1be3190
2019-08-14 14:28:53 -07:00
Zhongyi Xie
90cd6c2bb1 Fix double deletion in transaction_test (#5700)
Summary:
Fix the following clang analyze failures:
```
In file included from utilities/transactions/transaction_test.cc:8:
./utilities/transactions/transaction_test.h:174:14: warning: Attempt to delete released memory
      delete root_db;
             ^
```
The destructor of StackableDB already deletes the root db and there is no need to delete the db separately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5700

Test Plan: USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j24 analyze

Differential Revision: D16800579

Pulled By: maysamyabandeh

fbshipit-source-id: 64c2d70f23e07e6a15242add97c744902ea33be5
2019-08-13 21:54:55 -07:00
Manuel Ung
8a678a50ba WriteUnPrepared: Relax restriction on iterators and writes with no snapshot (#5697)
Summary:
Currently, if a write is done without a snapshot, then `largest_validated_seq_` is set to `kMaxSequenceNumber`. This is too aggressive, because an iterator with a snapshot created after this write should be valid.

Set `largest_validated_seq_` to `GetLastPublishedSequence` instead. The variable means that no keys in the current tracked key set has changed by other transactions since `largest_validated_seq_`.

Also, do some extra cleanup in Clear() for safety.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5697

Differential Revision: D16788613

Pulled By: lth

fbshipit-source-id: f2aa40b8b12e0c0cf9e38c940fecc8f1cc0d2385
2019-08-13 13:11:51 -07:00
Maysam Yabandeh
64855979ae WriteUnPrepared: Pass snap_released to the callback (#5691)
Summary:
With changes made in https://github.com/facebook/rocksdb/pull/5664 we meant to pass snap_released parameter of ::IsInSnapshot from the read callbacks. Although the variable was defined, passing it to the callback in WritePreparedTxnReadCallback was missing, which is fixed in this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5691

Differential Revision: D16767310

Pulled By: maysamyabandeh

fbshipit-source-id: 3bf53f5964a2756a66ceef7c8f6b3ac75f102f48
2019-08-12 12:20:46 -07:00
Manuel Ung
6f0f82de87 WriteUnPrepared: increase test coverage in transaction_test (#5658)
Summary:
The changes transaction_test to set `txn_db_options.default_write_batch_flush_threshold = 1` in order to give better test coverage for WriteUnprepared.

As part of the change, some tests had to be updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5658

Differential Revision: D16740468

Pulled By: lth

fbshipit-source-id: 3821eec20baf13917c8c1fab444332f75a509de9
2019-08-12 12:16:04 -07:00
Maysam Yabandeh
12eaacb71d WritePrepared: Fix SmallestUnCommittedSeq bug (#5683)
Summary:
SmallestUnCommittedSeq reads two data structures, prepared_txns_ and delayed_prepared_. These two are updated in CheckPreparedAgainstMax when max_evicted_seq_ advances some prepared entires. To avoid the cost of acquiring a mutex, the read from them in SmallestUnCommittedSeq is not atomic. This creates a potential race condition.
The fix is to read the two data structures in the reverse order of their update. CheckPreparedAgainstMax copies the prepared entry to delayed_prepared_ before removing it from prepared_txns_ and SmallestUnCommittedSeq looks into prepared_txns_ before reading delayed_prepared_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5683

Differential Revision: D16744699

Pulled By: maysamyabandeh

fbshipit-source-id: b1bdb134018beb0b9de58827f512662bea35cad0
2019-08-09 16:40:00 -07:00
haoyuhuang
6e78fe3c8d Pysim more algorithms (#5644)
Summary:
This PR adds four more eviction policies.
- OPT [1]
- Hyperbolic caching [2]
- ARC [3]
- GreedyDualSize [4]

[1] L. A. Belady. 1966. A Study of Replacement Algorithms for a Virtual-storage Computer. IBM Syst. J. 5, 2 (June 1966), 78-101. DOI=http://dx.doi.org/10.1147/sj.52.0078
[2] Aaron Blankstein, Siddhartha Sen, and Michael J. Freedman. 2017. Hyperbolic caching: flexible caching for web applications. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '17). USENIX Association, Berkeley, CA, USA, 499-511.
[3] Nimrod Megiddo and Dharmendra S. Modha. 2003. ARC: A Self-Tuning, Low Overhead Replacement Cache. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST '03). USENIX Association, Berkeley, CA, USA, 115-130.
[4] N. Young. The k-server dual and loose competitiveness for paging. Algorithmica, June 1994, vol. 11,(no.6):525-41. Rewritten version of ''On-line caching as cache size varies'', in The 2nd Annual ACM-SIAM Symposium on Discrete Algorithms, 241-250, 1991.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5644

Differential Revision: D16548817

Pulled By: HaoyuHuang

fbshipit-source-id: 838f76db9179f07911abaab46c97e1c929cfcd63
2019-08-06 18:50:59 -07:00
Vijay Nadimpalli
d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Maysam Yabandeh
208556ee13 WritePrepared: fix Get without snapshot (#5664)
Summary:
if read_options.snapshot is not set, ::Get will take the last sequence number after taking a super-version and uses that as the sequence number. Theoretically max_eviceted_seq_ could advance this sequence number. This could lead ::IsInSnapshot that will be invoked by the ReadCallback to notice the absence of the snapshot. In this case, the ReadCallback should have passed a non-value to snap_released so that it could be set by the ::IsInSnapshot. The patch does that, and adds a unit test to verify it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5664

Differential Revision: D16614033

Pulled By: maysamyabandeh

fbshipit-source-id: 06fb3fd4aacd75806ed1a1acec7961f5d02486f2
2019-08-05 13:41:21 -07:00
Maysam Yabandeh
e579e32eaa Disable ReadYourOwnWriteStress when run under Valgrind (#5671)
Summary:
It sometimes times out when run under valgrind taking around 20m. The patch skips the test under Valgrind.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5671

Differential Revision: D16652382

Pulled By: maysamyabandeh

fbshipit-source-id: 0f6f4f76d37337d56226b689e01b14523dd07aae
2019-08-05 13:35:39 -07:00
Manuel Ung
f622ca2c7c WriteUnPrepared: savepoint support (#5627)
Summary:
Add savepoint support when the current transaction has flushed unprepared batches.

Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch.

For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist.

eg. If we have the following:
```
Put(A)
Put(B)
Put(C)
SetSavePoint()
Put(D)
Put(E)
SetSavePoint()
Put(F)
```

Then we will write out 3 separate unprepared batches:
```
Put(A) 1
Put(B) 1
Put(C) 1
Put(D) 2
Put(E) 2
Put(F) 3
```

This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627

Differential Revision: D16584130

Pulled By: lth

fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa
2019-07-31 13:39:39 -07:00
Manuel Ung
d599135a03 WriteUnPrepared: use WriteUnpreparedTxnReadCallback for ValidateSnapshot (#5657)
Summary:
In DeferSnapshotSavePointTest, writes were failing with snapshot validation error because the key with the latest sequence number was an unprepared key from the current transaction.

Fix this by passing down the correct read callback.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5657

Differential Revision: D16582466

Pulled By: lth

fbshipit-source-id: 11645dac0e7c1374d917ef5fdf757d13c1d1108d
2019-07-31 10:44:56 -07:00
Manuel Ung
399f477818 WriteUnPrepared: Use WriteUnpreparedTxnReadCallback for MultiGet (#5634)
Summary:
The `TransactionTest.MultiGetBatchedTest` were failing with unprepared batches because we were not using the correct callbacks. Override MultiGet to pass down the correct ReadCallback. A similar problem is also fixed in WritePrepared.

This PR also fixes an issue similar to (https://github.com/facebook/rocksdb/pull/5147), but for MultiGet instead of Get.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5634

Differential Revision: D16552674

Pulled By: lth

fbshipit-source-id: 736eaf8e919c6b13d5f5655b1c0d36b57ad04804
2019-07-29 17:56:13 -07:00
haoyuhuang
e648c1d9eb Cache simulator: Optimize hybrid row-block cache. (#5616)
Summary:
This PR optimizes the hybrid row-block cache simulator. If a Get request hits the cache, we treat all its future accesses as hits.

Consider a Get request (no snapshot) accesses multiple files, e.g, file1, file2, file3. We construct the row key as "fdnumber_key_0". Before this PR, if it hits the cache when searching the key in file1, we continue to process its accesses in file2 and file3 which is unnecessary.

With this PR, if "file1_key_0" is in the cache, we treat all future accesses of this Get request as hits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5616

Differential Revision: D16453187

Pulled By: HaoyuHuang

fbshipit-source-id: 56f3169cc322322305baaf5543226a0824fae19f
2019-07-29 10:58:15 -07:00
Manuel Ung
80d7067cb2 Use int64_t instead of ssize_t (#5638)
Summary:
The ssize_t type was introduced in https://github.com/facebook/rocksdb/pull/5633, but it seems like it's a POSIX specific type.

I just need a signed type to represent number of bytes, so use int64_t instead. It seems like we have a typedef from SSIZE_T for Windows, but it doesn't seem like we ever include "port/port.h" in our public header files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5638

Differential Revision: D16526269

Pulled By: lth

fbshipit-source-id: 8d3a5c41003951b74b29bc5f1d949b2b22da0cee
2019-07-26 16:36:49 -07:00
Levi Tamasi
3f89af1c39 Reduce the number of random iterations in compact_on_deletion_collector_test (#5635)
Summary:
This test frequently times out under TSAN; reducing the number of random
iterations to make it complete faster.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5635

Test Plan: buck test mode/dev-tsan internal_repo_rocksdb/repo:compact_on_deletion_collector_test

Differential Revision: D16523505

Pulled By: ltamasi

fbshipit-source-id: 6a69909bce9d204c891150fcb3d536547b3253d0
2019-07-26 15:53:34 -07:00
Manuel Ung
41df734830 WriteUnPrepared: Add new variable write_batch_flush_threshold (#5633)
Summary:
Instead of reusing `TransactionOptions::max_write_batch_size` for determining when to flush a write batch for write unprepared, add a new variable called `write_batch_flush_threshold` for this use case instead.

Also add `TransactionDBOptions::default_write_batch_flush_threshold` which sets the default value if `TransactionOptions::write_batch_flush_threshold` is unspecified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5633

Differential Revision: D16520364

Pulled By: lth

fbshipit-source-id: d75ae5a2141ce7708982d5069dc3f0b58d250e8c
2019-07-26 12:56:26 -07:00
Manuel Ung
230b909da8 Fix PopSavePoint to merge info into the previous savepoint (#5628)
Summary:
Transaction::RollbackToSavePoint undos the modification made since the SavePoint beginning, and also unlocks the corresponding keys, which are tracked in the last SavePoint. Currently ::PopSavePoint simply discard these tracked keys, leaving them locked in the lock manager. This breaks a subsequent ::RollbackToSavePoint behavior as it loses track of such keys, and thus cannot unlock them. The patch fixes ::PopSavePoint by passing on the track key information to the previous SavePoint.
Fixes https://github.com/facebook/rocksdb/issues/5618
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5628

Differential Revision: D16505325

Pulled By: lth

fbshipit-source-id: 2bc3b30963ab4d36d996d1f66543c93abf358980
2019-07-26 11:39:30 -07:00
Manuel Ung
66b524a911 Simplify WriteUnpreparedTxnReadCallback and fix some comments (#5621)
Summary:
Simplify WriteUnpreparedTxnReadCallback so we just have one function `CalcMaxVisibleSeq`. Also, there's no need for the read callback to hold onto the transaction any more, so just hold the set of unprep_seqs, reducing about of indirection in `IsVisibleFullCheck`.

Also, some comments about using transaction snapshot were out of date, so remove them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5621

Differential Revision: D16459883

Pulled By: lth

fbshipit-source-id: cd581323fd18982e817d99af57b6eaba59e599bb
2019-07-24 10:25:26 -07:00
Mark Rambacher
cfcf045acc The ObjectRegistry class replaces the Registrar and NewCustomObjects.… (#5293)
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects.  Objects are registered with the registry by Type (the class must implement the static const char *Type() method).

This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered.  This could be useful if, for example, one Option allowed for a dynamic library and one did not.

When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.

Test plan (on riversand963's  devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293

Differential Revision: D16363396

Pulled By: riversand963

fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
2019-07-23 17:13:05 -07:00
Manuel Ung
eae832740b WriteUnPrepared: improve read your own write functionality (#5573)
Summary:
There are a number of fixes in this PR (with most bugs found via the added stress tests):
1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
4. Add stress test to test reading own writes.
5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
7. Add some more comments everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573

Differential Revision: D16276089

Pulled By: lth

fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
2019-07-23 08:08:19 -07:00
haoyuhuang
3778470061 Block cache analyzer: Compute correlation of features and human readable trace file. (#5596)
Summary:
- Compute correlation between a few features and predictions, e.g., number of accesses since the last access vs number of accesses till the next access on a block.
- Output human readable trace file so python can consume it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5596

Test Plan: make clean && USE_CLANG=1 make check -j32

Differential Revision: D16373200

Pulled By: HaoyuHuang

fbshipit-source-id: c848d26bc2e9210461f317d7dbee42d55be5a0cc
2019-07-22 17:51:34 -07:00
haoyuhuang
8a008d4170 Block access tracing: Trace referenced key for Get on non-data blocks. (#5548)
Summary:
This PR traces the referenced key for Get for all types of blocks. This is useful when evaluating hybrid row-block caches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5548

Test Plan: make clean && USE_CLANG=1 make check -j32

Differential Revision: D16157979

Pulled By: HaoyuHuang

fbshipit-source-id: f6327411c9deb74e35e22a35f66cdbae09ab9d87
2019-07-17 13:05:58 -07:00
Venki Pallipadi
22ce462450 Export Import sst files (#5495)
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135

This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
//   is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                  const std::string& export_dir,
                                  ExportImportFilesMetaData** metadata);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
//     an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
    const ColumnFamilyOptions& options, const std::string& column_family_name,
    const ImportColumnFamilyOptions& import_options,
    const ExportImportFilesMetaData& metadata,
    ColumnFamilyHandle** handle);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495

Differential Revision: D16018881

fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
2019-07-17 12:27:14 -07:00
Manuel Ung
0acaa1a846 WriteUnPrepared: use tracked_keys_ to track keys needed for rollback (#5562)
Summary:
Currently, we are tracking keys we need to rollback via a separate structure specific to WriteUnprepared in write_set_keys_.

We already have a data structure called tracked_keys_ used to track which keys to unlock on transaction termination. This is exactly what we want, since we should only rollback keys that we have locked anyway.

Save some memory by reusing that data structure instead of making our own.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5562

Differential Revision: D16206484

Pulled By: lth

fbshipit-source-id: 5894d2b824a4b19062d84adbd6e6e86f00047488
2019-07-16 15:24:56 -07:00
Sergei Petrunia
61876614dc Fix MyRocks compile warnings-treated-as-errors on Fedora 30, gcc 9.1.1 (#5553)
Summary:
- Provide assignment operator in CompactionStats
- Provide a copy constructor for FileDescriptor
- Remove std::move from "return std::move(t)" in BoundedQueue
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5553

Differential Revision: D16230170

fbshipit-source-id: fd7c6e52390b2db1be24141e25649cf62424d078
2019-07-12 17:30:51 -07:00
haoyuhuang
1a59b6e2a9 Cache simulator: Add a ghost cache for admission control and a hybrid row-block cache. (#5534)
Summary:
This PR adds a ghost cache for admission control. Specifically, it admits an entry on its second access.
It also adds a hybrid row-block cache that caches the referenced key-value pairs of a Get/MultiGet request instead of its blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5534

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16101124

Pulled By: HaoyuHuang

fbshipit-source-id: b99edda6418a888e94eb40f71ece45d375e234b1
2019-07-11 12:43:29 -07:00
haoyuhuang
6ca3feed5c Fix -Werror=shadow (#5546)
Summary:
This PR fixes shadow errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5546

Test Plan: make clean && make check -j32 && make clean && USE_CLANG=1 make check -j32 && make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16147841

Pulled By: HaoyuHuang

fbshipit-source-id: 1043500d70c134185f537ab4c3900452752f1534
2019-07-08 00:12:43 -07:00
Yanqin Jin
7c76a7fba2 Support GetAllKeyVersions() for non-default cf (#5544)
Summary:
Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families.

Test plan (devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test
$./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions
```
All other unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544

Differential Revision: D16147551

Pulled By: riversand963

fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168
2019-07-07 22:43:52 -07:00
anand76
e0d9d57750 Fix bugs in WAL trash file handling (#5520)
Summary:
1. Cleanup WAL trash files on open
2. Don't apply deletion rate limit if WAL dir is different from db dir
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5520

Test Plan: Add new unit tests and make check

Differential Revision: D16096750

Pulled By: anand1976

fbshipit-source-id: 6f07858ad864b754b711db416f0389c45ede599b
2019-07-06 21:07:32 -07:00
haoyuhuang
66464d1fde Remove multiple declarations o kMicrosInSecond.
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5526

Test Plan:
OPT=-g V=1 make J=1 unity_test -j32
make clean && make -j32

Differential Revision: D16079315

Pulled By: HaoyuHuang

fbshipit-source-id: 294ab439cf0db8dd5da44e30eabf0cbb2bb8c4f6
2019-07-01 15:15:12 -07:00
haoyuhuang
9f0bd56889 Cache simulator: Refactor the cache simulator so that we can add alternative policies easily (#5517)
Summary:
This PR creates cache_simulator.h file. It contains a CacheSimulator that runs against a block cache trace record. We can add alternative cache simulators derived from CacheSimulator later. For example, this PR adds a PrioritizedCacheSimulator that inserts filter/index/uncompressed dictionary blocks with high priority.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5517

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16043689

Pulled By: HaoyuHuang

fbshipit-source-id: 65f28ed52b866ffb0e6eceffd7f9ca7c45bb680d
2019-07-01 12:46:32 -07:00
feilongliu
0b0cb6f1a2 Fix segfalut in ~DBWithTTLImpl() when called after Close() (#5485)
Summary:
~DBWithTTLImpl() fails after calling Close() function (will invoke the
Close() function of DBImpl), because the Close() function deletes
default_cf_handle_ which is used in the GetOptions() function called
in ~DBWithTTLImpl(), hence lead to segfault.

Fix by creating a Close() function for the DBWithTTLImpl class and do
the close and the work originally in ~DBWithTTLImpl(). If the Close()
function is not called, it will be called in the ~DBWithTTLImpl()
function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5485

Test Plan: make clean;  USE_CLANG=1 make all check -j

Differential Revision: D15924498

fbshipit-source-id: 567397fb972961059083a1ae0f9f99ff74872b78
2019-06-20 13:08:17 -07:00
Vaibhav Gogte
f46a2a0375 Export Cache::GetCharge (#5476)
Summary:
Exporting GetCharge to cache.hh
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5476

Differential Revision: D15881882

Pulled By: riversand963

fbshipit-source-id: 3d99084d10059b4fcaaaba240606ed50bc23351c
2019-06-18 17:35:41 -07:00
haoyuhuang
2d1dd5bce7 Support computing miss ratio curves using sim_cache. (#5449)
Summary:
This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G".

When we replay the trace, we also perform lookups and inserts on the simulated caches.
In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file.

This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449

Test Plan:
Added tests for block_cache_trace_analyzer.
COMPILE_WITH_ASAN=1 make check -j32.

Differential Revision: D15797073

Pulled By: HaoyuHuang

fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408
2019-06-17 16:41:12 -07:00
Maysam Yabandeh
60f3ec2ca5 Fix appveyor compliant about passing const to thread (#5447)
Summary:
CLANG would complain if we pass const to lambda function and appveyor complains if we don't (https://github.com/facebook/rocksdb/pull/5443). The patch fixes that by using the default capture mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5447

Differential Revision: D15788722

Pulled By: maysamyabandeh

fbshipit-source-id: 47e7f49264afe31fdafe42cb8bf93da126abfca9
2019-06-12 15:06:22 -07:00
Maysam Yabandeh
4a285d0dd3 Remove passing const variable to thread (#5443)
Summary:
CLANG complains that passing const to thread is not necessary. The patch removes it form PreparedHeap::Concurrent test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5443

Differential Revision: D15781598

Pulled By: maysamyabandeh

fbshipit-source-id: 3aceb05d96182fa4726d6d37eed45fd3aac4c016
2019-06-12 09:45:57 -07:00
Maysam Yabandeh
773f914a40 WritePrepared: switch PreparedHeap from priority_queue to deque (#5436)
Summary:
Internally PreparedHeap is currently using a priority_queue. The rationale was the in the initial design PreparedHeap::AddPrepared could be called in arbitrary order. With the recent optimizations, we call ::AddPrepared only from the main write queue, which results into in-order insertion into PreparedHeap. The patch thus replaces the underlying priority_queue with a more efficient deque implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5436

Differential Revision: D15752147

Pulled By: maysamyabandeh

fbshipit-source-id: e6960f2b2097e13137dded1ceeff3b10b03b0aeb
2019-06-11 19:55:14 -07:00
Manuel Ung
ca1aee2a19 WriteUnprepared: commit only from the 2nd queue (#5439)
Summary:
This is a port of this PR into WriteUnprepared:
https://github.com/facebook/rocksdb/pull/5014

This also reverts this test change to restore some flaky write unprepared
tests: https://github.com/facebook/rocksdb/pull/5315

Tested with:
$ gtest-parallel ./transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 --repeat=128
[128/128] MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/9 (18250 ms)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5439

Differential Revision: D15761405

Pulled By: lth

fbshipit-source-id: ae2581fd942d8a5b3f9278fd6bc3c1ac0b2c964c
2019-06-11 18:01:39 -07:00
sdong
58c4aee42e TransactionUtil::CheckKey() to skip unnecessary history (#4941)
Summary:
If a memtable definitely covers a key, there isn't a need to check older memtables.
We can skip them by checking the earliest sequence number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941

Differential Revision: D13932666

fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b
2019-06-11 11:46:42 -07:00
Maysam Yabandeh
c292dc8540 WritePrepared: reduce prepared_mutex_ overhead (#5420)
Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420

Differential Revision: D15741985

Pulled By: maysamyabandeh

fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
2019-06-10 11:53:31 -07:00
Zhongyi Xie
d68f9f4580 simplify include directive involving inttypes (#5402)
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402

Differential Revision: D15701195

Pulled By: miasantreble

fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
2019-06-06 13:56:07 -07:00
Maysam Yabandeh
ae05a83e19 Call ValidateOptions from SetOptions (#5368)
Summary:
Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368

Differential Revision: D15540101

Pulled By: maysamyabandeh

fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4
2019-06-03 19:49:57 -07:00
Siying Dong
5851cb7fdb Move util/trace_replay.* to trace_replay/ (#5376)
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376

Differential Revision: D15550938

Pulled By: siying

fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
2019-06-03 13:25:26 -07:00
Siying Dong
000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00
Vijay Nadimpalli
cae22c53fb Make format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5395

Differential Revision: D15581698

Pulled By: vjnadimpalli

fbshipit-source-id: f415972f16e784b1361714c202b97defcab46767
2019-05-31 15:24:43 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Siying Dong
cb094e13bb Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370)
Summary:
Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370

Differential Revision: D15570413

Pulled By: siying

fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e
2019-05-31 10:50:19 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Vijay Nadimpalli
50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Siying Dong
545d206040 Move some file related files outside util/ (#5375)
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375

Differential Revision: D15550935

Pulled By: siying

fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
2019-05-29 20:47:06 -07:00
Maysam Yabandeh
eab4f49a2c WritePrepared: skip_concurrency_control option (#5330)
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330

Differential Revision: D15525932

Pulled By: maysamyabandeh

fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
2019-05-28 16:29:45 -07:00
Maysam Yabandeh
f5576c3317 WritePrepared: disableWAL in commit without prepare (#5327)
Summary:
When committing a transaction without prepare, WritePrepared simply writes the batch to db and add the commit entry to CommitCache. When two_write_queues=true, following the rule of committing only from 2nd write queue, the first write, writes the batch and the only thing the 2nd write does is to write the commit entry to CommitCache. Currently the write batch in 2nd write is set to an empty LogData entry, while the write to the WAL could simply be entirely disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5327

Differential Revision: D15424546

Pulled By: maysamyabandeh

fbshipit-source-id: 3d9ea3922d5196984c584d62a3ed57e1f7ca7b9f
2019-05-28 14:21:52 -07:00
Maysam Yabandeh
5c0e304170 WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313)
Summary:
WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313

Differential Revision: D15379977

Pulled By: maysamyabandeh

fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
2019-05-20 07:49:20 -07:00
Maysam Yabandeh
c71f5bb9aa Disable WriteUnPrepared stress tests (#5315)
Summary:
They are kind of flaky at the moment. Will re-enable it when flakiness is fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5315

Differential Revision: D15382744

Pulled By: maysamyabandeh

fbshipit-source-id: 8b2f9d81a4bb34bfd51481727a682d5cd063c5e3
2019-05-16 15:39:33 -07:00
Maysam Yabandeh
f0e8216197 WritePrepared: Fix deadlock in WriteRecoverableState (#5306)
Summary:
The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306

Differential Revision: D15341458

Pulled By: maysamyabandeh

fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
2019-05-15 13:53:54 -07:00
Thomas Fersch
a42757607d Use pre-increment instead of post-increment for iterators (#5296)
Summary:
Google C++ style guide indicates pre-increment should be used for iterators: https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement. Replaced all instances of ' it++' by ' ++it' (where type is iterator). So this covers the cases where iterators are named 'it'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5296

Differential Revision: D15301256

Pulled By: tfersch

fbshipit-source-id: 2803483c1392504ad3b281d21db615429c71114b
2019-05-15 13:19:15 -07:00
Maysam Yabandeh
f383641a1d Unordered Writes (#5218)
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.

Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)

Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)

- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)

Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218

Differential Revision: D15219029

Pulled By: maysamyabandeh

fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
2019-05-13 17:47:21 -07:00
anand76
1c8cbf315f Extend MultiGet batching to Transactions (#5210)
Summary:
MultiGet batching was implemented in #5011 in order to reduce CPU utilization when looking up multiple keys at once. This PR implements corresponding ```MultiGet``` and ```MultiGetSingleCFForUpdate``` in ```rocksdb::Transaction``` that call the underlying batching implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5210

Differential Revision: D15048164

Pulled By: anand1976

fbshipit-source-id: c52f6043102ab0cbc723f4cba2a7b7d1767f6f52
2019-04-23 14:11:26 -07:00
Andrew Kryczka
8272a6de57 Optionally wait on bytes_per_sync to smooth I/O (#5183)
Summary:
The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.

My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.

Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.

There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).

The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183

Differential Revision: D14953553

Pulled By: siying

fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
2019-04-22 11:51:39 -07:00
jsteemann
de76909464 refactor SavePoints (#5192)
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.

Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
  allocated `SavePoints` struct. The destructor of `WriteBatch` simply
  deletes this pointer. However, the copy constructor of WriteBatch
  just copied that pointer, meaning that copying a WriteBatch with
  active savepoints will very likely have crashed before. Now a proper
  copy of the savepoints is made in the copy constructor, and not just
  a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
  the underlying container. A deque is a bit over the top here, as we
  only need access to the most recent savepoint (i.e. stack.top()) but
  never any elements at the front. std::deque is rather expensive to
  initialize in common environments. For example, the STL implementation
  shipped with GNU g++ will perform a heap allocation of more than 500
  bytes to create an empty deque object. Although the `save_points_`
  container is created lazily by RocksDB, moving from a deque to a plain
  `std::vector` is much more memory-efficient. So `save_points_` is now
  a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
  making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192

Differential Revision: D15024074

Pulled By: maysamyabandeh

fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
2019-04-19 20:33:04 -07:00
Fosco Marotto
6c2bf9e916 Add copyright headers per FB open-source checkup tool. (#5199)
Summary:
internal task: T35568575
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199

Differential Revision: D14962794

Pulled By: gfosco

fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe
2019-04-18 10:55:01 -07:00
Zhongyi Xie
baa5302447 Avoid double-compacting data in bottom level in manual compactions (#5138)
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138

Differential Revision: D14940106

Pulled By: miasantreble

fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
2019-04-16 23:32:20 -07:00
jsteemann
8295d364e2 Improve transaction lock details (#5193)
Summary:
This branch contains two small improvements:
* Create `LockMap` entries using `std::make_shared`. This saves one heap allocation per LockMap entry but also locates the control block and the LockMap object closely together in memory, which can help with caching
* Reorder the members of `TrackedTrxInfo`, so that the resulting struct uses less memory (at least on 64bit systems)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5193

Differential Revision: D14934536

Pulled By: maysamyabandeh

fbshipit-source-id: f7b49812bb4b6029eef9d131e7cd56260df5b28e
2019-04-15 10:44:03 -07:00
Manuel Ung
d655a3aab7 Remove extraneous call to TrackKey (#5173)
Summary:
In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this.

For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint.

This is another 2-3% cpu improvement.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173

Differential Revision: D14883809

Pulled By: lth

fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3
2019-04-12 16:37:12 -07:00