Commit Graph

888 Commits

Author SHA1 Message Date
Yanqin Jin
e66199d848 First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.

Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.

Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949

Reviewed By: anand1976

Differential Revision: D22026020

Pulled By: riversand963

fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
2020-06-24 19:07:08 -07:00
Zitan Chen
88db97b06d Add a DB Session ID (#6959)
Summary:
Added DB::GetDbSessionId by using the same format and machinery as DB::GetDbIdentity.
The DB Session ID is generated (and therefore, updated) each time a DB object is opened. It is written to the LOG file right after the line of “DB SUMMARY”.
A test for the uniqueness, for different openings and during the same opening, is also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6959

Test Plan: Passed make check

Reviewed By: zhichao-cao

Differential Revision: D21951721

Pulled By: gg814

fbshipit-source-id: 958a48a612db49a39998ea703cded45987d3fa8b
2020-06-15 10:47:02 -07:00
sdong
afa3518839 Revert "Update googletest from 1.8.1 to 1.10.0 (#6808)" (#6923)
Summary:
This reverts commit 8d87e9cea1.

Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923

Reviewed By: pdillinger

Differential Revision: D21864799

fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
2020-06-03 15:55:03 -07:00
Peter Dillinger
14eca6bf04 For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.

It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.

So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.

Also includes miscellaneous comment improvements / clarifications.

Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784

Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...

    [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
    db/db_test.cc:1562: Failure
    Expected: (size) <= (11 * 100), actual: 9478 vs 1100

Other tests updated to reflect consistent accounting of metadata.

Reviewed By: siying

Differential Revision: D21334706

Pulled By: pdillinger

fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
2020-06-02 12:30:23 -07:00
Adam Retter
8d87e9cea1 Update googletest from 1.8.1 to 1.10.0 (#6808)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808

Reviewed By: anand1976

Differential Revision: D21483984

Pulled By: pdillinger

fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598
2020-06-01 20:33:42 -07:00
Levi Tamasi
a00ddf1574 Expose the set of live blob files from Version/VersionSet (#6785)
Summary:
The patch adds logic that returns the set of live blob files from
`Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to
live table files), and also cleans up the code a bit, for example, by
exposing only the numbers of table files as opposed to the earlier
`FileDescriptor`s that no clients used. Moreover, the patch extends
the `GetLiveFiles` API so that it also exposes blob files in the current version.
Similarly to https://github.com/facebook/rocksdb/pull/6755,
this is a building block for identifying and purging obsolete blob files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D21336210

Pulled By: ltamasi

fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821
2020-05-04 15:08:13 -07:00
Peter Dillinger
31da5e34c1 C++20 compatibility (#6697)
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:

* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697

Test Plan: make check and CI

Reviewed By: cheng-chang

Differential Revision: D21020318

Pulled By: pdillinger

fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
2020-04-20 13:24:25 -07:00
Akanksha Mahajan
a0faff126d Report kFilesMarkedForCompaction for delete triggered compactions (#6680)
Summary:
Summary : Set manual_compaction false in case of DeleteTriggeredCompaction object so that kFilesMarkedForComapaction can be reported.
          Added a DeletionTriggeredUniversalCompactionMarking test case for Deletion Triggered compaction in case of Universal Compaction.

Test Plan : make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6680

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D20945946

Pulled By: akankshamahajan15

fbshipit-source-id: af84e417bd7127652aaae9143c560d1ab3815d25
2020-04-10 15:30:38 -07:00
Cheng Chang
402da454cb Migrate AppVeyor to CircleCI (#6518)
Summary:
CircleCI is the new recommended CI system internally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6518

Test Plan: Watch https://app.circleci.com/pipelines/github/facebook/rocksdb

Differential Revision: D20454743

Pulled By: cheng-chang

fbshipit-source-id: 39031568d6c1d3d25b7fbd78fa9a0e6067ddc47c
2020-03-13 21:58:51 -07:00
Levi Tamasi
c15e85bdcb Move BlobDB related files under db/ to db/blob/ (#6519)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519

Test Plan:
```
make all
make check
```

Differential Revision: D20400691

Pulled By: ltamasi

fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94
2020-03-12 11:00:56 -07:00
Cheng Chang
afb97094ae Skip high levels with no key falling in the range in CompactRange (#6482)
Summary:
In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped.
This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482

Test Plan:
A new test ManualCompactionTest::SkipLevel is added.

Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index.

Also updated db_compaction_test and db_test to use correct range for full compaction.

Differential Revision: D20251149

Pulled By: cheng-chang

fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b
2020-03-04 20:15:25 -08:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
sdong
69c8614815 Avoid to get manifest file size when recovering from it. (#6369)
Summary:
Right now RocksDB gets manifest file size before recovering from it. The information is available in LogReader. Use it instead to prevent one file system call.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6369

Test Plan: Run all existing tests

Differential Revision: D19714872

fbshipit-source-id: 0144be324d403c99e3da875ea2feccc8f64e883d
2020-02-04 11:39:23 -08:00
sdong
f10f135938 Fix regression bug of hash index with iterator total order seek (#6328)
Summary:
https://github.com/facebook/rocksdb/pull/6028 introduces a bug for hash index in SST files. If a table reader is created when total order seek is used, prefix_extractor might be passed into table reader as null. While later when prefix seek is used, the same table reader used, hash index is checked but prefix extractor is null and the program would crash.
Fix the issue by fixing http://github.com/facebook/rocksdb/pull/6028 in the way that prefix_extractor is preserved but ReadOptions.total_order_seek is checked

Also, a null pointer check is added so that a bug like this won't cause segfault in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6328

Test Plan: Add a unit test that would fail without the fix. Stress test that reproduces the crash would pass.

Differential Revision: D19586751

fbshipit-source-id: 8de77690167ddf5a77a01e167cf89430b1bfba42
2020-01-27 15:44:54 -08:00
Connor1996
3e26a94ba1 Add oldest snapshot sequence property (#6228)
Summary:
Add oldest snapshot sequence property, so we can use `db.GetProperty("rocksdb.oldest-snapshot-sequence")` to get the sequence number of the oldest snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6228

Differential Revision: D19264145

Pulled By: maysamyabandeh

fbshipit-source-id: 67fbe5304d89cbc475bd404e30d1299f7b11c010
2020-01-07 08:36:44 -08:00
sdong
aa1857e2df Support options.max_open_files = -1 with periodic_compaction_seconds (#6090)
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090

Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.

Differential Revision: D18702268

fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
2019-11-26 21:39:56 -08:00
sdong
77eab5c85a Make default value of options.ttl to be 30 days when it is supported. (#6073)
Summary:
By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases.

Make the default UINT64_MAX - 1 to indicate that it is not overridden by users.

Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX  too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073

Test Plan: Add unit tests for it.

Differential Revision: D18669626

fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8
2019-11-26 10:00:32 -08:00
sdong
d8c28e692a Support options.ttl with options.max_open_files = -1 (#6060)
Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.

Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060

Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.

Differential Revision: D18631623

fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
2019-11-22 21:23:00 -08:00
Levi Tamasi
f80050fa8f Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011)
Summary:
The patch exposes the file numbers of the SSTs as well as the oldest blob
files they contain a reference to through the GetColumnFamilyMetaData/
GetLiveFilesMetaData interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011

Test Plan:
Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest
wasn't really testing anything because the generated memtables were never
flushed, so the metadata structure was essentially empty.)

Differential Revision: D18361697

Pulled By: ltamasi

fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9
2019-11-07 14:04:16 -08:00
Vijay Nadimpalli
1075c376ef Fix for lite build (#5971)
Summary:
Fix for lite build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5971

Test Plan: make J=1 -j64  LITE=1 all check

Differential Revision: D18148306

Pulled By: vjnadimpalli

fbshipit-source-id: 5b9a3edc3e73e054fee6b96e6f6e583cecc898f3
2019-10-25 18:22:24 -07:00
Vijay Nadimpalli
ec880436c1 API to get file_creation_time of the oldest file in the DB (#5948)
Summary:
Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948

Test Plan: Added unit test.

Differential Revision: D18056151

Pulled By: vjnadimpalli

fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d
2019-10-25 11:53:57 -07:00
sdong
e8263dbdaa Apply formatter to recent 200+ commits. (#5830)
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830

Test Plan: Run all existing tests.

Differential Revision: D17488031

fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
2019-09-20 12:04:26 -07:00
andrew
622683000c Allow users to stop manual compactions (#3971)
Summary:
Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971

Test Plan: add tests at more spots.

Differential Revision: D17369043

fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c
2019-09-16 21:01:47 -07:00
sdong
b931f84e56 Divide file_reader_writer.h and .cc (#5803)
Summary:
file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803

Test Plan: Build whole project using make and cmake.

Differential Revision: D17374550

fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
2019-09-16 10:33:51 -07:00
Wilfried Goesgens
fbab9913e2 upgrade gtest 1.7.0 => 1.8.1 for json result writing
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332

Differential Revision: D17242232

fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64
2019-09-09 11:24:11 -07:00
Affan Dar
229e6fbe0e Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765)
Summary:
Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.

Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.

Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765

Differential Revision: D17172717

Pulled By: affandar

fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
2019-09-04 12:10:17 -07:00
Zhongyi Xie
2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
sdong
e1c468d16f Do readahead in VerifyChecksum() (#5713)
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713

Test Plan: Add a new unit test.

Differential Revision: D16860874

fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
2019-08-16 16:42:56 -07:00
Yanqin Jin
5d9a67e718 Support loading custom objects in unit tests (#5676)
Summary:
Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc.
This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning.
By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop.
Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
All unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676

Differential Revision: D16679157

Pulled By: riversand963

fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d
2019-08-09 15:12:08 -07:00
Vijay Nadimpalli
d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Eli Pozniansky
4834dab578 Improve CPU Efficiency of ApproximateSize (part 2) (#5609)
Summary:
In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609

Differential Revision: D16433481

Pulled By: elipoz

fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
2019-07-31 08:50:00 -07:00
Eli Pozniansky
9625a2bc2b Added SizeApproximationOptions to DB::GetApproximateSizes (#5626)
Summary:
The new DB::GetApproximateSizes with SizeApproximationOptions argument, which allows to add more options/knobs to the DB::GetApproximateSizes call (beyond only the include_flags)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5626

Differential Revision: D16496913

Pulled By: elipoz

fbshipit-source-id: ee8c6c182330a285fa056ecfc3905a592b451720
2019-07-25 22:42:30 -07:00
Venki Pallipadi
22ce462450 Export Import sst files (#5495)
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135

This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
//   is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                  const std::string& export_dir,
                                  ExportImportFilesMetaData** metadata);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
//     an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
    const ColumnFamilyOptions& options, const std::string& column_family_name,
    const ImportColumnFamilyOptions& import_options,
    const ExportImportFilesMetaData& metadata,
    ColumnFamilyHandle** handle);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495

Differential Revision: D16018881

fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
2019-07-17 12:27:14 -07:00
Zhongyi Xie
24f73436fb sanitize and limit block_size under 4GB (#5492)
Summary:
`Block::restart_index_`, `Block::restarts_`, and `Block::current_` are defined as uint32_t but  `BlockBasedTableOptions::block_size` is defined as a size_t so user might see corruption as in https://github.com/facebook/rocksdb/issues/5486.
This PR adds a check in `BlockBasedTableFactory::SanitizeOptions` to disallow such configurations.
yiwu-arbug
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5492

Differential Revision: D15914047

Pulled By: miasantreble

fbshipit-source-id: c943f153d967e15aee7f2795730ab8259e2be201
2019-06-20 11:45:08 -07:00
Sagar Vemuri
f1219644ec Validate CF Options when creating a new column family (#5453)
Summary:
It seems like CF Options are not properly validated  when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations,  will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`).

**Test Plan:**
Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions`
```
TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions
```
Also ran gtest-parallel to make sure the new test is not flaky.
```
TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000
[10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453

Differential Revision: D15816851

Pulled By: sagar0

fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5
2019-06-14 14:11:10 -07:00
Maysam Yabandeh
f43edff9ac Disable kPipelinedWrite in MultiThreaded (#5442)
Summary:
TSAN tests report a race condition. We temporarily exclude kPipelinedWrite from MultiThreaded until the race condition is fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5442

Differential Revision: D15782349

Pulled By: maysamyabandeh

fbshipit-source-id: 42b4f9b3fa9137f0675e13ad132c0a06800c1bdd
2019-06-12 10:37:40 -07:00
Levi Tamasi
a94aef6596 Fix DBTest.DynamicMiscOptions so it passes even with Snappy disabled (#5438)
Summary:
This affects our "no compression" automated tests. Since PR #5368, DBTest.DynamicMiscOptions has been failing with:

db/db_test.cc:4889: Failure
dbfull()->SetOptions({{"compression", "kSnappyCompression"}})
Invalid argument: Compression type Snappy is not linked with the binary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5438

Differential Revision: D15752100

Pulled By: ltamasi

fbshipit-source-id: 3f19eff7cafc03b333965be0203c5853d2a9cb71
2019-06-10 18:47:58 -07:00
Maysam Yabandeh
ae05a83e19 Call ValidateOptions from SetOptions (#5368)
Summary:
Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368

Differential Revision: D15540101

Pulled By: maysamyabandeh

fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4
2019-06-03 19:49:57 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Vijay Nadimpalli
50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Siying Dong
545d206040 Move some file related files outside util/ (#5375)
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375

Differential Revision: D15550935

Pulled By: siying

fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
2019-05-29 20:47:06 -07:00
anand76
fefd4b98c5 Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.

Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency

The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.

Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).

Batch   Sizes

1        | 2        | 4         | 8      | 16  | 32

Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)

Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135

Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62

Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891

dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011

Differential Revision: D14348703

Pulled By: anand1976

fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
2019-04-11 14:28:26 -07:00
Siying Dong
aef763b6d6 Make statistics's stats_level change thread-safe (#5030)
Summary:
Right now, users can change statistics.stats_level while DB is running, but TSAN may report
data race. We make stats_level_ to be atomic, and access them using accessors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030

Differential Revision: D14267519

Pulled By: siying

fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73
2019-03-01 10:42:09 -08:00
Aubin Sanyal
3231a2e581 Deprecate ttl option from CompactionOptionsFIFO (#4965)
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965

Differential Revision: D14072960

Pulled By: sagar0

fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
2019-02-15 09:51:41 -08:00
Michael Liu
ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Yanqin Jin
a69d4deefb Atomic ingest (#4895)
Summary:
Make file ingestion atomic.

 as title.
Ingesting external SST files into multiple column families should be atomic. If
a crash occurs and db reopens, either all column families have successfully
ingested the files before the crash, or non of the ingestions have any effect
on the state of the db.

Also add unit tests for atomic ingestion.

Note that the unit test here does not cover the case of incomplete atomic group
in the MANIFEST, which is covered in VersionSetTest already.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895

Differential Revision: D13718245

Pulled By: riversand963

fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
2019-02-12 19:16:17 -08:00
Siying Dong
cf3a671733 Remove cuckoo hash memtable (#4953)
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953

Differential Revision: D13979264

Pulled By: siying

fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
2019-02-07 16:15:27 -08:00
Zhongyi Xie
71cae59a99 exclude test CompactFilesShouldTriggerAutoCompaction from ROCKSDB_LITE (#4950)
Summary:
This will fix the following build error:

> db/db_test.cc: In member function ‘virtual void rocksdb::DBTest_CompactFilesShouldTriggerAutoCompaction_Test::TestBody()’:
> db/db_test.cc:5462:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5490:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5499:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
>    db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4950

Differential Revision: D13965378

Pulled By: miasantreble

fbshipit-source-id: a975435476fe555b1cd9d5da263ee3da3acdea56
2019-02-05 17:01:11 -08:00