2016-02-10 00:12:00 +01:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-16 01:03:42 +02:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2013-10-16 23:59:46 +02:00
|
|
|
//
|
2011-03-18 23:37:00 +01:00
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
|
|
|
#include "db/builder.h"
|
|
|
|
|
2015-08-24 20:11:12 +02:00
|
|
|
#include <algorithm>
|
Simplify querying of merge results
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.
Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.
Test Plan: make clean all check
Reviewers: rven, sdong, yhchiang, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43353
2015-08-18 02:34:38 +02:00
|
|
|
#include <deque>
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 19:04:30 +02:00
|
|
|
#include <vector>
|
Simplify querying of merge results
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.
Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.
Test Plan: make clean all check
Reviewers: rven, sdong, yhchiang, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43353
2015-08-18 02:34:38 +02:00
|
|
|
|
2020-09-15 06:10:09 +02:00
|
|
|
#include "db/blob/blob_file_builder.h"
|
2019-05-31 20:52:59 +02:00
|
|
|
#include "db/compaction/compaction_iterator.h"
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
#include "db/event_helpers.h"
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
2015-08-13 23:35:54 +02:00
|
|
|
#include "db/internal_stats.h"
|
2013-03-21 23:59:47 +01:00
|
|
|
#include "db/merge_helper.h"
|
2020-10-01 19:08:52 +02:00
|
|
|
#include "db/output_validator.h"
|
2018-12-18 02:26:56 +01:00
|
|
|
#include "db/range_del_aggregator.h"
|
2011-03-18 23:37:00 +01:00
|
|
|
#include "db/table_cache.h"
|
|
|
|
#include "db/version_edit.h"
|
2021-03-18 04:43:22 +01:00
|
|
|
#include "file/file_util.h"
|
2019-05-30 05:44:08 +02:00
|
|
|
#include "file/filename.h"
|
2019-09-16 19:31:27 +02:00
|
|
|
#include "file/read_write_util.h"
|
|
|
|
#include "file/writable_file_writer.h"
|
2017-04-06 04:02:00 +02:00
|
|
|
#include "monitoring/iostats_context_imp.h"
|
|
|
|
#include "monitoring/thread_status_util.h"
|
2021-02-11 07:18:33 +01:00
|
|
|
#include "options/options_helper.h"
|
2013-08-23 17:38:13 +02:00
|
|
|
#include "rocksdb/db.h"
|
|
|
|
#include "rocksdb/env.h"
|
|
|
|
#include "rocksdb/iterator.h"
|
2013-10-30 18:52:33 +01:00
|
|
|
#include "rocksdb/options.h"
|
2014-02-04 04:48:45 +01:00
|
|
|
#include "rocksdb/table.h"
|
2019-05-30 23:47:29 +02:00
|
|
|
#include "table/block_based/block_based_table_builder.h"
|
2018-06-22 06:16:49 +02:00
|
|
|
#include "table/format.h"
|
2015-10-13 00:06:38 +02:00
|
|
|
#include "table/internal_iterator.h"
|
2019-05-31 02:39:43 +02:00
|
|
|
#include "test_util/sync_point.h"
|
2013-06-05 20:06:21 +02:00
|
|
|
#include "util/stop_watch.h"
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2020-02-20 21:07:53 +01:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2013-10-29 01:54:09 +01:00
|
|
|
class TableFactory;
|
|
|
|
|
2021-04-29 15:59:53 +02:00
|
|
|
TableBuilder* NewTableBuilder(const TableBuilderOptions& tboptions,
|
|
|
|
WritableFileWriter* file) {
|
|
|
|
assert((tboptions.column_family_id ==
|
2016-04-07 08:10:32 +02:00
|
|
|
TablePropertiesCollectorFactory::Context::kUnknownColumnFamily) ==
|
2021-04-29 15:59:53 +02:00
|
|
|
tboptions.column_family_name.empty());
|
|
|
|
return tboptions.ioptions.table_factory->NewTableBuilder(tboptions, file);
|
2013-10-29 01:54:09 +01:00
|
|
|
}
|
|
|
|
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 19:04:30 +02:00
|
|
|
Status BuildTable(
|
2020-11-13 03:43:30 +01:00
|
|
|
const std::string& dbname, VersionSet* versions,
|
2021-04-29 15:59:53 +02:00
|
|
|
const ImmutableDBOptions& db_options, const TableBuilderOptions& tboptions,
|
|
|
|
const FileOptions& file_options, TableCache* table_cache,
|
|
|
|
InternalIterator* iter,
|
2018-12-17 22:12:22 +01:00
|
|
|
std::vector<std::unique_ptr<FragmentedRangeTombstoneIterator>>
|
|
|
|
range_del_iters,
|
2020-09-15 06:10:09 +02:00
|
|
|
FileMetaData* meta, std::vector<BlobFileAddition>* blob_file_additions,
|
2016-04-07 08:10:32 +02:00
|
|
|
std::vector<SequenceNumber> snapshots,
|
Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
Test Plan: unit tests, db bench
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb, yoshinorim
Differential Revision: https://reviews.facebook.net/D50475
2015-10-16 01:37:15 +02:00
|
|
|
SequenceNumber earliest_write_conflict_snapshot,
|
CompactionIterator sees consistent view of which keys are committed (#9830)
Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**
`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:
```
inline bool KeyCommitted(SequenceNumber seq) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```
With that being said, we focus on write-prepared/write-unprepared transactions.
A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.
Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.
To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.
```
inline bool KeyCommitted(SequenceNumber sequence) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
SnapshotCheckerResult::kInSnapshot;
}
```
As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D35561162
Pulled By: riversand963
fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
2022-04-14 20:11:04 +02:00
|
|
|
SequenceNumber job_snapshot, SnapshotChecker* snapshot_checker,
|
|
|
|
bool paranoid_file_checks, InternalStats* internal_stats,
|
|
|
|
IOStatus* io_status, const std::shared_ptr<IOTracer>& io_tracer,
|
2021-09-17 02:17:40 +02:00
|
|
|
BlobFileCreationReason blob_creation_reason, EventLogger* event_logger,
|
Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
2021-04-30 22:49:24 +02:00
|
|
|
int job_id, const Env::IOPriority io_priority,
|
2021-04-29 15:59:53 +02:00
|
|
|
TableProperties* table_properties, Env::WriteLifeTimeHint write_hint,
|
|
|
|
const std::string* full_history_ts_low,
|
2021-06-18 13:56:43 +02:00
|
|
|
BlobFileCompletionCallback* blob_callback, uint64_t* num_input_entries,
|
|
|
|
uint64_t* memtable_payload_bytes, uint64_t* memtable_garbage_bytes) {
|
2021-04-29 15:59:53 +02:00
|
|
|
assert((tboptions.column_family_id ==
|
2016-04-07 08:10:32 +02:00
|
|
|
TablePropertiesCollectorFactory::Context::kUnknownColumnFamily) ==
|
2021-04-29 15:59:53 +02:00
|
|
|
tboptions.column_family_name.empty());
|
|
|
|
auto& mutable_cf_options = tboptions.moptions;
|
|
|
|
auto& ioptions = tboptions.ioptions;
|
2015-05-16 08:22:22 +02:00
|
|
|
// Reports the IOStats for flush for every following bytes.
|
|
|
|
const size_t kReportFlushIOStatsEvery = 1048576;
|
2020-10-01 19:08:52 +02:00
|
|
|
OutputValidator output_validator(
|
2021-04-29 15:59:53 +02:00
|
|
|
tboptions.internal_comparator,
|
2020-10-01 19:08:52 +02:00
|
|
|
/*enable_order_check=*/
|
|
|
|
mutable_cf_options.check_flush_compaction_key_order,
|
|
|
|
/*enable_hash=*/paranoid_file_checks);
|
2011-03-18 23:37:00 +01:00
|
|
|
Status s;
|
2014-06-14 00:54:19 +02:00
|
|
|
meta->fd.file_size = 0;
|
2011-03-18 23:37:00 +01:00
|
|
|
iter->SeekToFirst();
|
2018-12-18 02:26:56 +01:00
|
|
|
std::unique_ptr<CompactionRangeDelAggregator> range_del_agg(
|
2021-04-29 15:59:53 +02:00
|
|
|
new CompactionRangeDelAggregator(&tboptions.internal_comparator,
|
|
|
|
snapshots));
|
2021-05-21 01:06:12 +02:00
|
|
|
uint64_t num_unfragmented_tombstones = 0;
|
2021-06-18 13:56:43 +02:00
|
|
|
uint64_t total_tombstone_payload_bytes = 0;
|
2018-12-17 22:12:22 +01:00
|
|
|
for (auto& range_del_iter : range_del_iters) {
|
2021-05-21 01:06:12 +02:00
|
|
|
num_unfragmented_tombstones +=
|
|
|
|
range_del_iter->num_unfragmented_tombstones();
|
2021-06-18 13:56:43 +02:00
|
|
|
total_tombstone_payload_bytes +=
|
|
|
|
range_del_iter->total_tombstone_payload_bytes();
|
2018-12-17 22:12:22 +01:00
|
|
|
range_del_agg->AddTombstones(std::move(range_del_iter));
|
2016-11-01 04:35:54 +01:00
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2018-04-06 04:49:06 +02:00
|
|
|
std::string fname = TableFileName(ioptions.cf_paths, meta->fd.GetNumber(),
|
2014-07-02 18:54:20 +02:00
|
|
|
meta->fd.GetPathId());
|
2020-09-15 06:10:09 +02:00
|
|
|
std::vector<std::string> blob_file_paths;
|
2020-08-25 19:44:39 +02:00
|
|
|
std::string file_checksum = kUnknownFileChecksum;
|
|
|
|
std::string file_checksum_func_name = kUnknownFileChecksumFuncName;
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
#ifndef ROCKSDB_LITE
|
2021-04-29 15:59:53 +02:00
|
|
|
EventHelpers::NotifyTableFileCreationStarted(ioptions.listeners, dbname,
|
|
|
|
tboptions.column_family_name,
|
Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
2021-04-30 22:49:24 +02:00
|
|
|
fname, job_id, tboptions.reason);
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
#endif // !ROCKSDB_LITE
|
2020-11-13 03:43:30 +01:00
|
|
|
Env* env = db_options.env;
|
|
|
|
assert(env);
|
|
|
|
FileSystem* fs = db_options.fs.get();
|
|
|
|
assert(fs);
|
2021-01-26 07:07:26 +01:00
|
|
|
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
TableProperties tp;
|
2022-05-04 19:15:30 +02:00
|
|
|
bool table_file_created = false;
|
Range deletion performance improvements + cleanup (#4014)
Summary:
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.
I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.
With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.
ajkr please take a look when you have a minute!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014
Differential Revision: D8773253
Pulled By: ajkr
fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
2018-07-12 23:28:10 +02:00
|
|
|
if (iter->Valid() || !range_del_agg->IsEmpty()) {
|
2021-05-08 01:00:37 +02:00
|
|
|
std::unique_ptr<CompactionFilter> compaction_filter;
|
|
|
|
if (ioptions.compaction_filter_factory != nullptr &&
|
|
|
|
ioptions.compaction_filter_factory->ShouldFilterTableFileCreation(
|
|
|
|
tboptions.reason)) {
|
|
|
|
CompactionFilter::Context context;
|
|
|
|
context.is_full_compaction = false;
|
|
|
|
context.is_manual_compaction = false;
|
|
|
|
context.column_family_id = tboptions.column_family_id;
|
|
|
|
context.reason = tboptions.reason;
|
|
|
|
compaction_filter =
|
|
|
|
ioptions.compaction_filter_factory->CreateCompactionFilter(context);
|
|
|
|
if (compaction_filter != nullptr &&
|
|
|
|
!compaction_filter->IgnoreSnapshots()) {
|
|
|
|
s.PermitUncheckedError();
|
|
|
|
return Status::NotSupported(
|
|
|
|
"CompactionFilter::IgnoreSnapshots() = false is not supported "
|
|
|
|
"anymore.");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
TableBuilder* builder;
|
2018-11-09 20:17:34 +01:00
|
|
|
std::unique_ptr<WritableFileWriter> file_writer;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
{
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 23:47:08 +01:00
|
|
|
std::unique_ptr<FSWritableFile> file;
|
2017-04-13 22:07:33 +02:00
|
|
|
#ifndef NDEBUG
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 23:47:08 +01:00
|
|
|
bool use_direct_writes = file_options.use_direct_writes;
|
2017-04-13 22:07:33 +02:00
|
|
|
TEST_SYNC_POINT_CALLBACK("BuildTable:create_file", &use_direct_writes);
|
|
|
|
#endif // !NDEBUG
|
2020-09-17 00:45:30 +02:00
|
|
|
IOStatus io_s = NewWritableFile(fs, fname, &file, file_options);
|
2020-09-29 18:47:33 +02:00
|
|
|
assert(s.ok());
|
2020-07-15 20:02:44 +02:00
|
|
|
s = io_s;
|
|
|
|
if (io_status->ok()) {
|
|
|
|
*io_status = io_s;
|
|
|
|
}
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
if (!s.ok()) {
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
EventHelpers::LogAndNotifyTableFileCreationFinished(
|
2021-04-29 15:59:53 +02:00
|
|
|
event_logger, ioptions.listeners, dbname,
|
|
|
|
tboptions.column_family_name, fname, job_id, meta->fd,
|
Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
2021-04-30 22:49:24 +02:00
|
|
|
kInvalidBlobFileNumber, tp, tboptions.reason, s, file_checksum,
|
2021-04-29 15:59:53 +02:00
|
|
|
file_checksum_func_name);
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
return s;
|
|
|
|
}
|
2022-05-04 19:15:30 +02:00
|
|
|
|
|
|
|
table_file_created = true;
|
2021-02-11 07:18:33 +01:00
|
|
|
FileTypeSet tmp_set = ioptions.checksum_handoff_file_types;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
file->SetIOPriority(io_priority);
|
2017-11-10 18:25:26 +01:00
|
|
|
file->SetWriteLifeTimeHint(write_hint);
|
2020-02-11 00:42:46 +01:00
|
|
|
file_writer.reset(new WritableFileWriter(
|
2021-03-15 12:32:24 +01:00
|
|
|
std::move(file), fname, file_options, ioptions.clock, io_tracer,
|
2021-04-26 21:43:02 +02:00
|
|
|
ioptions.stats, ioptions.listeners,
|
2021-04-23 05:42:50 +02:00
|
|
|
ioptions.file_checksum_gen_factory.get(),
|
Using existing crc32c checksum in checksum handoff for Manifest and WAL (#8412)
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.
In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412
Test Plan: make check, add new testing cases.
Reviewed By: anand1976
Differential Revision: D29151545
Pulled By: zhichao-cao
fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
2021-06-25 09:46:33 +02:00
|
|
|
tmp_set.Contains(FileType::kTableFile), false));
|
2020-02-11 00:42:46 +01:00
|
|
|
|
2021-04-29 15:59:53 +02:00
|
|
|
builder = NewTableBuilder(tboptions, file_writer.get());
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
2015-07-18 01:16:11 +02:00
|
|
|
}
|
2013-02-28 23:09:30 +01:00
|
|
|
|
2021-05-08 01:00:37 +02:00
|
|
|
MergeHelper merge(
|
|
|
|
env, tboptions.internal_comparator.user_comparator(),
|
|
|
|
ioptions.merge_operator.get(), compaction_filter.get(), ioptions.logger,
|
|
|
|
true /* internal key corruption is not ok */,
|
|
|
|
snapshots.empty() ? 0 : snapshots.back(), snapshot_checker);
|
2013-03-21 23:59:47 +01:00
|
|
|
|
2020-09-15 06:10:09 +02:00
|
|
|
std::unique_ptr<BlobFileBuilder> blob_file_builder(
|
|
|
|
(mutable_cf_options.enable_blob_files && blob_file_additions)
|
2021-09-17 02:17:40 +02:00
|
|
|
? new BlobFileBuilder(
|
|
|
|
versions, fs, &ioptions, &mutable_cf_options, &file_options,
|
|
|
|
job_id, tboptions.column_family_id,
|
|
|
|
tboptions.column_family_name, io_priority, write_hint,
|
|
|
|
io_tracer, blob_callback, blob_creation_reason,
|
|
|
|
&blob_file_paths, blob_file_additions)
|
2020-09-15 06:10:09 +02:00
|
|
|
: nullptr);
|
|
|
|
|
Compaction Support for Range Deletion
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
2016-10-18 21:04:56 +02:00
|
|
|
CompactionIterator c_iter(
|
2021-04-29 15:59:53 +02:00
|
|
|
iter, tboptions.internal_comparator.user_comparator(), &merge,
|
|
|
|
kMaxSequenceNumber, &snapshots, earliest_write_conflict_snapshot,
|
CompactionIterator sees consistent view of which keys are committed (#9830)
Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**
`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:
```
inline bool KeyCommitted(SequenceNumber seq) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```
With that being said, we focus on write-prepared/write-unprepared transactions.
A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.
Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.
To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.
```
inline bool KeyCommitted(SequenceNumber sequence) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
SnapshotCheckerResult::kInSnapshot;
}
```
As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D35561162
Pulled By: riversand963
fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
2022-04-14 20:11:04 +02:00
|
|
|
job_snapshot, snapshot_checker, env,
|
|
|
|
ShouldReportDetailedTime(env, ioptions.stats),
|
2020-09-15 06:10:09 +02:00
|
|
|
true /* internal key corruption is not ok */, range_del_agg.get(),
|
2020-11-13 03:43:30 +01:00
|
|
|
blob_file_builder.get(), ioptions.allow_data_in_errors,
|
2021-05-08 01:00:37 +02:00
|
|
|
/*compaction=*/nullptr, compaction_filter.get(),
|
|
|
|
/*shutting_down=*/nullptr,
|
2022-04-11 19:26:55 +02:00
|
|
|
/*manual_compaction_paused=*/nullptr,
|
2021-06-07 20:40:31 +02:00
|
|
|
/*manual_compaction_canceled=*/nullptr, db_options.info_log,
|
|
|
|
full_history_ts_low);
|
2020-09-15 06:10:09 +02:00
|
|
|
|
2015-09-10 23:35:25 +02:00
|
|
|
c_iter.SeekToFirst();
|
|
|
|
for (; c_iter.Valid(); c_iter.Next()) {
|
|
|
|
const Slice& key = c_iter.key();
|
|
|
|
const Slice& value = c_iter.value();
|
2019-10-15 00:19:31 +02:00
|
|
|
const ParsedInternalKey& ikey = c_iter.ikey();
|
2020-10-01 19:08:52 +02:00
|
|
|
// Generate a rolling 64-bit hash of the key and values
|
Memtable "MemPurge" prototype (#8454)
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
2021-07-02 14:22:03 +02:00
|
|
|
// Note :
|
|
|
|
// Here "key" integrates 'sequence_number'+'kType'+'user key'.
|
2020-10-01 19:08:52 +02:00
|
|
|
s = output_validator.Add(key, value);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
2020-07-22 20:03:29 +02:00
|
|
|
}
|
2015-09-10 23:35:25 +02:00
|
|
|
builder->Add(key, value);
|
2022-04-16 05:25:48 +02:00
|
|
|
|
|
|
|
s = meta->UpdateBoundaries(key, value, ikey.sequence, ikey.type);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
2015-09-10 23:35:25 +02:00
|
|
|
|
|
|
|
// TODO(noetzli): Update stats after flush, too.
|
2015-08-24 20:11:12 +02:00
|
|
|
if (io_priority == Env::IO_HIGH &&
|
|
|
|
IOSTATS(bytes_written) >= kReportFlushIOStatsEvery) {
|
2015-09-25 22:34:49 +02:00
|
|
|
ThreadStatusUtil::SetThreadOperationProperty(
|
2015-08-24 20:11:12 +02:00
|
|
|
ThreadStatus::FLUSH_BYTES_WRITTEN, IOSTATS(bytes_written));
|
2013-02-28 23:09:30 +01:00
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
2020-10-03 07:09:28 +02:00
|
|
|
if (!s.ok()) {
|
|
|
|
c_iter.status().PermitUncheckedError();
|
|
|
|
} else if (!c_iter.status().ok()) {
|
|
|
|
s = c_iter.status();
|
|
|
|
}
|
2020-10-26 21:50:03 +01:00
|
|
|
|
2020-10-01 19:08:52 +02:00
|
|
|
if (s.ok()) {
|
|
|
|
auto range_del_it = range_del_agg->NewIterator();
|
|
|
|
for (range_del_it->SeekToFirst(); range_del_it->Valid();
|
|
|
|
range_del_it->Next()) {
|
|
|
|
auto tombstone = range_del_it->Tombstone();
|
|
|
|
auto kv = tombstone.Serialize();
|
|
|
|
builder->Add(kv.first.Encode(), kv.second);
|
|
|
|
meta->UpdateBoundariesForRange(kv.first, tombstone.SerializeEndKey(),
|
2021-04-29 15:59:53 +02:00
|
|
|
tombstone.seq_,
|
|
|
|
tboptions.internal_comparator);
|
2020-10-01 19:08:52 +02:00
|
|
|
}
|
2020-09-15 06:10:09 +02:00
|
|
|
}
|
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-28 00:03:05 +01:00
|
|
|
TEST_SYNC_POINT("BuildTable:BeforeFinishBuildTable");
|
2020-09-15 06:10:09 +02:00
|
|
|
const bool empty = builder->IsEmpty();
|
2021-05-21 01:06:12 +02:00
|
|
|
if (num_input_entries != nullptr) {
|
|
|
|
*num_input_entries =
|
|
|
|
c_iter.num_input_entry_scanned() + num_unfragmented_tombstones;
|
|
|
|
}
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 20:42:56 +02:00
|
|
|
if (!s.ok() || empty) {
|
2011-03-18 23:37:00 +01:00
|
|
|
builder->Abandon();
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 20:42:56 +02:00
|
|
|
} else {
|
|
|
|
s = builder->Finish();
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
2020-07-15 20:02:44 +02:00
|
|
|
if (io_status->ok()) {
|
2020-09-17 00:45:30 +02:00
|
|
|
*io_status = builder->io_status();
|
2020-07-15 20:02:44 +02:00
|
|
|
}
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 20:42:56 +02:00
|
|
|
|
|
|
|
if (s.ok() && !empty) {
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
uint64_t file_size = builder->FileSize();
|
|
|
|
meta->fd.file_size = file_size;
|
2015-06-04 21:03:40 +02:00
|
|
|
meta->marked_for_compaction = builder->NeedCompact();
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
2015-05-13 00:53:55 +02:00
|
|
|
assert(meta->fd.GetFileSize() > 0);
|
2018-06-27 05:18:43 +02:00
|
|
|
tp = builder->GetTableProperties(); // refresh now that builder is finished
|
2021-06-18 13:56:43 +02:00
|
|
|
if (memtable_payload_bytes != nullptr &&
|
|
|
|
memtable_garbage_bytes != nullptr) {
|
|
|
|
const CompactionIterationStats& ci_stats = c_iter.iter_stats();
|
|
|
|
uint64_t total_payload_bytes = ci_stats.total_input_raw_key_bytes +
|
|
|
|
ci_stats.total_input_raw_value_bytes +
|
|
|
|
total_tombstone_payload_bytes;
|
|
|
|
uint64_t total_payload_bytes_written =
|
|
|
|
(tp.raw_key_size + tp.raw_value_size);
|
|
|
|
// Prevent underflow, which may still happen at this point
|
|
|
|
// since we only support inserts, deletes, and deleteRanges.
|
|
|
|
if (total_payload_bytes_written <= total_payload_bytes) {
|
|
|
|
*memtable_payload_bytes = total_payload_bytes;
|
|
|
|
*memtable_garbage_bytes =
|
|
|
|
total_payload_bytes - total_payload_bytes_written;
|
|
|
|
} else {
|
|
|
|
*memtable_payload_bytes = 0;
|
|
|
|
*memtable_garbage_bytes = 0;
|
|
|
|
}
|
|
|
|
}
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
2015-05-13 00:53:55 +02:00
|
|
|
if (table_properties) {
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
*table_properties = tp;
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
2015-05-13 00:53:55 +02:00
|
|
|
}
|
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
delete builder;
|
|
|
|
|
|
|
|
// Finish and check for file errors
|
2020-09-20 02:56:39 +02:00
|
|
|
TEST_SYNC_POINT("BuildTable:BeforeSyncTable");
|
2017-02-13 19:54:38 +01:00
|
|
|
if (s.ok() && !empty) {
|
2021-04-26 21:43:02 +02:00
|
|
|
StopWatch sw(ioptions.clock, ioptions.stats, TABLE_SYNC_MICROS);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-28 00:03:05 +01:00
|
|
|
*io_status = file_writer->Sync(ioptions.use_fsync);
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
2020-09-20 02:56:39 +02:00
|
|
|
TEST_SYNC_POINT("BuildTable:BeforeCloseTableFile");
|
2020-07-15 20:02:44 +02:00
|
|
|
if (s.ok() && io_status->ok() && !empty) {
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-28 00:03:05 +01:00
|
|
|
*io_status = file_writer->Close();
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
2020-07-15 20:02:44 +02:00
|
|
|
if (s.ok() && io_status->ok() && !empty) {
|
2020-03-30 00:57:02 +02:00
|
|
|
// Add the checksum information to file metadata.
|
|
|
|
meta->file_checksum = file_writer->GetFileChecksum();
|
|
|
|
meta->file_checksum_func_name = file_writer->GetFileChecksumFuncName();
|
2020-08-25 19:44:39 +02:00
|
|
|
file_checksum = meta->file_checksum;
|
|
|
|
file_checksum_func_name = meta->file_checksum_func_name;
|
2020-03-30 00:57:02 +02:00
|
|
|
}
|
|
|
|
|
2020-07-15 20:02:44 +02:00
|
|
|
if (s.ok()) {
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-28 00:03:05 +01:00
|
|
|
s = *io_status;
|
|
|
|
}
|
|
|
|
|
2020-10-26 21:50:03 +01:00
|
|
|
if (blob_file_builder) {
|
|
|
|
if (s.ok()) {
|
|
|
|
s = blob_file_builder->Finish();
|
2021-03-18 04:43:22 +01:00
|
|
|
} else {
|
2021-09-17 02:17:40 +02:00
|
|
|
blob_file_builder->Abandon(s);
|
2020-10-26 21:50:03 +01:00
|
|
|
}
|
|
|
|
blob_file_builder.reset();
|
|
|
|
}
|
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-28 00:03:05 +01:00
|
|
|
// TODO Also check the IO status when create the Iterator.
|
2011-03-18 23:37:00 +01:00
|
|
|
|
2022-03-15 22:45:34 +01:00
|
|
|
TEST_SYNC_POINT("BuildTable:BeforeOutputValidation");
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 20:42:56 +02:00
|
|
|
if (s.ok() && !empty) {
|
2011-03-18 23:37:00 +01:00
|
|
|
// Verify that the table is usable
|
2017-04-13 22:07:33 +02:00
|
|
|
// We set for_compaction to false and don't OptimizeForCompactionTableRead
|
|
|
|
// here because this is a special case after we finish the table building
|
|
|
|
// No matter whether use_direct_io_for_flush_and_compaction is true,
|
|
|
|
// we will regrad this verification as user reads since the goal is
|
|
|
|
// to cache it here for further user reads
|
2020-08-04 00:21:56 +02:00
|
|
|
ReadOptions read_options;
|
2015-10-13 00:06:38 +02:00
|
|
|
std::unique_ptr<InternalIterator> it(table_cache->NewIterator(
|
2021-04-29 15:59:53 +02:00
|
|
|
read_options, file_options, tboptions.internal_comparator, *meta,
|
2022-01-21 20:36:36 +01:00
|
|
|
nullptr /* range_del_agg */, mutable_cf_options.prefix_extractor,
|
|
|
|
nullptr,
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
2015-08-13 23:35:54 +02:00
|
|
|
(internal_stats == nullptr) ? nullptr
|
|
|
|
: internal_stats->GetFileReadHist(0),
|
2019-06-20 23:28:22 +02:00
|
|
|
TableReaderCaller::kFlush, /*arena=*/nullptr,
|
Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
2021-04-30 22:49:24 +02:00
|
|
|
/*skip_filter=*/false, tboptions.level_at_creation,
|
2020-06-10 01:49:07 +02:00
|
|
|
MaxFileSizeForL0MetaPin(mutable_cf_options),
|
|
|
|
/*smallest_compaction_key=*/nullptr,
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
2020-04-16 02:37:23 +02:00
|
|
|
/*largest_compaction_key*/ nullptr,
|
|
|
|
/*allow_unprepared_value*/ false));
|
2011-03-18 23:37:00 +01:00
|
|
|
s = it->status();
|
2015-04-18 00:26:50 +02:00
|
|
|
if (s.ok() && paranoid_file_checks) {
|
2021-04-29 15:59:53 +02:00
|
|
|
OutputValidator file_validator(tboptions.internal_comparator,
|
2020-10-01 19:08:52 +02:00
|
|
|
/*enable_order_check=*/true,
|
|
|
|
/*enable_hash=*/true);
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 20:42:56 +02:00
|
|
|
for (it->SeekToFirst(); it->Valid(); it->Next()) {
|
2020-07-22 20:03:29 +02:00
|
|
|
// Generate a rolling 64-bit hash of the key and values
|
2020-10-01 19:08:52 +02:00
|
|
|
file_validator.Add(it->key(), it->value()).PermitUncheckedError();
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 20:42:56 +02:00
|
|
|
}
|
2015-04-18 00:26:50 +02:00
|
|
|
s = it->status();
|
2020-10-01 19:08:52 +02:00
|
|
|
if (s.ok() && !output_validator.CompareValidator(file_validator)) {
|
2020-08-06 23:24:24 +02:00
|
|
|
s = Status::Corruption("Paranoid checksums do not match");
|
2020-07-22 20:03:29 +02:00
|
|
|
}
|
2015-04-18 00:26:50 +02:00
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Check for input iterator errors
|
|
|
|
if (!iter->status().ok()) {
|
|
|
|
s = iter->status();
|
|
|
|
}
|
|
|
|
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 20:42:56 +02:00
|
|
|
if (!s.ok() || meta->fd.GetFileSize() == 0) {
|
2020-10-26 21:50:03 +01:00
|
|
|
TEST_SYNC_POINT("BuildTable:BeforeDeleteFile");
|
|
|
|
|
2020-09-15 06:10:09 +02:00
|
|
|
constexpr IODebugContext* dbg = nullptr;
|
|
|
|
|
2022-05-04 19:15:30 +02:00
|
|
|
if (table_file_created) {
|
|
|
|
Status ignored = fs->DeleteFile(fname, IOOptions(), dbg);
|
|
|
|
ignored.PermitUncheckedError();
|
|
|
|
}
|
2020-09-15 06:10:09 +02:00
|
|
|
|
|
|
|
assert(blob_file_additions || blob_file_paths.empty());
|
|
|
|
|
|
|
|
if (blob_file_additions) {
|
|
|
|
for (const std::string& blob_file_path : blob_file_paths) {
|
2022-05-04 19:15:30 +02:00
|
|
|
Status ignored = DeleteDBFile(&db_options, blob_file_path, dbname,
|
|
|
|
/*force_bg=*/false, /*force_fg=*/false);
|
2020-09-29 18:47:33 +02:00
|
|
|
ignored.PermitUncheckedError();
|
2021-03-18 04:43:22 +01:00
|
|
|
TEST_SYNC_POINT("BuildTable::AfterDeleteFile");
|
2020-09-15 06:10:09 +02:00
|
|
|
}
|
|
|
|
}
|
2011-03-18 23:37:00 +01:00
|
|
|
}
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
|
2021-11-03 16:42:08 +01:00
|
|
|
Status status_for_listener = s;
|
2019-10-14 18:52:29 +02:00
|
|
|
if (meta->fd.GetFileSize() == 0) {
|
|
|
|
fname = "(nil)";
|
2021-11-03 16:42:08 +01:00
|
|
|
if (s.ok()) {
|
|
|
|
status_for_listener = Status::Aborted("Empty SST file not kept");
|
|
|
|
}
|
2019-10-14 18:52:29 +02:00
|
|
|
}
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
// Output to event logger and fire events.
|
2018-08-12 01:44:11 +02:00
|
|
|
EventHelpers::LogAndNotifyTableFileCreationFinished(
|
2021-04-29 15:59:53 +02:00
|
|
|
event_logger, ioptions.listeners, dbname, tboptions.column_family_name,
|
Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
2021-04-30 22:49:24 +02:00
|
|
|
fname, job_id, meta->fd, meta->oldest_blob_file_number, tp,
|
2021-11-03 16:42:08 +01:00
|
|
|
tboptions.reason, status_for_listener, file_checksum,
|
|
|
|
file_checksum_func_name);
|
Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 20:35:00 +02:00
|
|
|
|
2011-03-18 23:37:00 +01:00
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2020-02-20 21:07:53 +01:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|