Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:19:43 -07:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
#pragma once
|
|
|
|
|
2019-10-17 19:35:22 -07:00
|
|
|
#include <sstream>
|
|
|
|
#include <string>
|
|
|
|
|
Introduce a blob file reader class (#7461)
Summary:
The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
using the information available in blob references (e.g. blob file number, offset, and
size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
and iterators, and also for compaction/garbage collection.
When a `BlobFileReader` object is created (using the factory method `Create`),
it first checks whether the specified file is potentially valid by comparing the file
size against the combined size of the blob file header and footer (files smaller than
the threshold are considered malformed). Then, it opens the file, and reads and verifies
the header and footer. The verification involves magic number/CRC checks
as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
or TTL blob files.
Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
type passed by the caller (because of the presence of the header and footer, the
specified offset cannot be too close to the start/end of the file; also, the compression type
has to match the one in the blob file header), and retrieves and potentially verifies and
uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
`BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
and verifies the key/value size, the key itself, as well as the CRC of the blob record header
and the key/value pair.
In addition, the patch exposes the compression type from `BlobIndex` (both using an
accessor and via `DebugString`), and adds a blob file read latency histogram to
`InternalStats` that can be used with `BlobFileReader`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23999219
Pulled By: ltamasi
fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 15:43:23 -07:00
|
|
|
#include "rocksdb/compression_type.h"
|
Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:19:43 -07:00
|
|
|
#include "util/coding.h"
|
Introduce a blob file reader class (#7461)
Summary:
The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
using the information available in blob references (e.g. blob file number, offset, and
size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
and iterators, and also for compaction/garbage collection.
When a `BlobFileReader` object is created (using the factory method `Create`),
it first checks whether the specified file is potentially valid by comparing the file
size against the combined size of the blob file header and footer (files smaller than
the threshold are considered malformed). Then, it opens the file, and reads and verifies
the header and footer. The verification involves magic number/CRC checks
as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
or TTL blob files.
Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
type passed by the caller (because of the presence of the header and footer, the
specified offset cannot be too close to the start/end of the file; also, the compression type
has to match the one in the blob file header), and retrieves and potentially verifies and
uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
`BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
and verifies the key/value size, the key itself, as well as the CRC of the blob record header
and the key/value pair.
In addition, the patch exposes the compression type from `BlobIndex` (both using an
accessor and via `DebugString`), and adds a blob file read latency histogram to
`InternalStats` that can be used with `BlobFileReader`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23999219
Pulled By: ltamasi
fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 15:43:23 -07:00
|
|
|
#include "util/compression.h"
|
Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:19:43 -07:00
|
|
|
#include "util/string_util.h"
|
|
|
|
|
2020-02-20 12:07:53 -08:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:19:43 -07:00
|
|
|
|
|
|
|
// BlobIndex is a pointer to the blob and metadata of the blob. The index is
|
|
|
|
// stored in base DB as ValueType::kTypeBlobIndex.
|
|
|
|
// There are three types of blob index:
|
|
|
|
//
|
|
|
|
// kInlinedTTL:
|
|
|
|
// +------+------------+---------------+
|
|
|
|
// | type | expiration | value |
|
|
|
|
// +------+------------+---------------+
|
|
|
|
// | char | varint64 | variable size |
|
|
|
|
// +------+------------+---------------+
|
|
|
|
//
|
|
|
|
// kBlob:
|
|
|
|
// +------+-------------+----------+----------+-------------+
|
|
|
|
// | type | file number | offset | size | compression |
|
|
|
|
// +------+-------------+----------+----------+-------------+
|
|
|
|
// | char | varint64 | varint64 | varint64 | char |
|
|
|
|
// +------+-------------+----------+----------+-------------+
|
|
|
|
//
|
|
|
|
// kBlobTTL:
|
|
|
|
// +------+------------+-------------+----------+----------+-------------+
|
|
|
|
// | type | expiration | file number | offset | size | compression |
|
|
|
|
// +------+------------+-------------+----------+----------+-------------+
|
|
|
|
// | char | varint64 | varint64 | varint64 | varint64 | char |
|
|
|
|
// +------+------------+-------------+----------+----------+-------------+
|
|
|
|
//
|
|
|
|
// There isn't a kInlined (without TTL) type since we can store it as a plain
|
|
|
|
// value (i.e. ValueType::kTypeValue).
|
|
|
|
class BlobIndex {
|
|
|
|
public:
|
|
|
|
enum class Type : unsigned char {
|
|
|
|
kInlinedTTL = 0,
|
|
|
|
kBlob = 1,
|
|
|
|
kBlobTTL = 2,
|
|
|
|
kUnknown = 3,
|
|
|
|
};
|
|
|
|
|
|
|
|
BlobIndex() : type_(Type::kUnknown) {}
|
|
|
|
|
2021-09-17 18:43:32 -07:00
|
|
|
BlobIndex(const BlobIndex&) = default;
|
|
|
|
BlobIndex& operator=(const BlobIndex&) = default;
|
|
|
|
|
Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:19:43 -07:00
|
|
|
bool IsInlined() const { return type_ == Type::kInlinedTTL; }
|
|
|
|
|
|
|
|
bool HasTTL() const {
|
|
|
|
return type_ == Type::kInlinedTTL || type_ == Type::kBlobTTL;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t expiration() const {
|
|
|
|
assert(HasTTL());
|
|
|
|
return expiration_;
|
|
|
|
}
|
|
|
|
|
|
|
|
const Slice& value() const {
|
|
|
|
assert(IsInlined());
|
|
|
|
return value_;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t file_number() const {
|
|
|
|
assert(!IsInlined());
|
|
|
|
return file_number_;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t offset() const {
|
|
|
|
assert(!IsInlined());
|
|
|
|
return offset_;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t size() const {
|
|
|
|
assert(!IsInlined());
|
|
|
|
return size_;
|
|
|
|
}
|
|
|
|
|
Introduce a blob file reader class (#7461)
Summary:
The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
using the information available in blob references (e.g. blob file number, offset, and
size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
and iterators, and also for compaction/garbage collection.
When a `BlobFileReader` object is created (using the factory method `Create`),
it first checks whether the specified file is potentially valid by comparing the file
size against the combined size of the blob file header and footer (files smaller than
the threshold are considered malformed). Then, it opens the file, and reads and verifies
the header and footer. The verification involves magic number/CRC checks
as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
or TTL blob files.
Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
type passed by the caller (because of the presence of the header and footer, the
specified offset cannot be too close to the start/end of the file; also, the compression type
has to match the one in the blob file header), and retrieves and potentially verifies and
uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
`BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
and verifies the key/value size, the key itself, as well as the CRC of the blob record header
and the key/value pair.
In addition, the patch exposes the compression type from `BlobIndex` (both using an
accessor and via `DebugString`), and adds a blob file read latency histogram to
`InternalStats` that can be used with `BlobFileReader`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23999219
Pulled By: ltamasi
fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 15:43:23 -07:00
|
|
|
CompressionType compression() const {
|
|
|
|
assert(!IsInlined());
|
|
|
|
return compression_;
|
|
|
|
}
|
|
|
|
|
Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:19:43 -07:00
|
|
|
Status DecodeFrom(Slice slice) {
|
|
|
|
static const std::string kErrorMessage = "Error while decoding blob index";
|
|
|
|
assert(slice.size() > 0);
|
|
|
|
type_ = static_cast<Type>(*slice.data());
|
|
|
|
if (type_ >= Type::kUnknown) {
|
2022-05-06 13:03:58 -07:00
|
|
|
return Status::Corruption(kErrorMessage,
|
|
|
|
"Unknown blob index type: " +
|
|
|
|
std::to_string(static_cast<char>(type_)));
|
Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:19:43 -07:00
|
|
|
}
|
|
|
|
slice = Slice(slice.data() + 1, slice.size() - 1);
|
|
|
|
if (HasTTL()) {
|
|
|
|
if (!GetVarint64(&slice, &expiration_)) {
|
|
|
|
return Status::Corruption(kErrorMessage, "Corrupted expiration");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (IsInlined()) {
|
|
|
|
value_ = slice;
|
|
|
|
} else {
|
|
|
|
if (GetVarint64(&slice, &file_number_) && GetVarint64(&slice, &offset_) &&
|
|
|
|
GetVarint64(&slice, &size_) && slice.size() == 1) {
|
|
|
|
compression_ = static_cast<CompressionType>(*slice.data());
|
|
|
|
} else {
|
|
|
|
return Status::Corruption(kErrorMessage, "Corrupted blob offset");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
2019-12-11 17:17:56 -08:00
|
|
|
std::string DebugString(bool output_hex) const {
|
2019-10-17 19:35:22 -07:00
|
|
|
std::ostringstream oss;
|
|
|
|
|
|
|
|
if (IsInlined()) {
|
|
|
|
oss << "[inlined blob] value:" << value_.ToString(output_hex);
|
|
|
|
} else {
|
|
|
|
oss << "[blob ref] file:" << file_number_ << " offset:" << offset_
|
Introduce a blob file reader class (#7461)
Summary:
The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
using the information available in blob references (e.g. blob file number, offset, and
size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
and iterators, and also for compaction/garbage collection.
When a `BlobFileReader` object is created (using the factory method `Create`),
it first checks whether the specified file is potentially valid by comparing the file
size against the combined size of the blob file header and footer (files smaller than
the threshold are considered malformed). Then, it opens the file, and reads and verifies
the header and footer. The verification involves magic number/CRC checks
as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
or TTL blob files.
Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
type passed by the caller (because of the presence of the header and footer, the
specified offset cannot be too close to the start/end of the file; also, the compression type
has to match the one in the blob file header), and retrieves and potentially verifies and
uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
`BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
and verifies the key/value size, the key itself, as well as the CRC of the blob record header
and the key/value pair.
In addition, the patch exposes the compression type from `BlobIndex` (both using an
accessor and via `DebugString`), and adds a blob file read latency histogram to
`InternalStats` that can be used with `BlobFileReader`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23999219
Pulled By: ltamasi
fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
2020-10-07 15:43:23 -07:00
|
|
|
<< " size:" << size_
|
|
|
|
<< " compression: " << CompressionTypeToString(compression_);
|
2019-10-17 19:35:22 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (HasTTL()) {
|
|
|
|
oss << " exp:" << expiration_;
|
|
|
|
}
|
|
|
|
|
|
|
|
return oss.str();
|
|
|
|
}
|
|
|
|
|
Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:19:43 -07:00
|
|
|
static void EncodeInlinedTTL(std::string* dst, uint64_t expiration,
|
|
|
|
const Slice& value) {
|
|
|
|
assert(dst != nullptr);
|
|
|
|
dst->clear();
|
|
|
|
dst->reserve(1 + kMaxVarint64Length + value.size());
|
|
|
|
dst->push_back(static_cast<char>(Type::kInlinedTTL));
|
|
|
|
PutVarint64(dst, expiration);
|
|
|
|
dst->append(value.data(), value.size());
|
|
|
|
}
|
|
|
|
|
|
|
|
static void EncodeBlob(std::string* dst, uint64_t file_number,
|
|
|
|
uint64_t offset, uint64_t size,
|
|
|
|
CompressionType compression) {
|
|
|
|
assert(dst != nullptr);
|
|
|
|
dst->clear();
|
|
|
|
dst->reserve(kMaxVarint64Length * 3 + 2);
|
|
|
|
dst->push_back(static_cast<char>(Type::kBlob));
|
|
|
|
PutVarint64(dst, file_number);
|
|
|
|
PutVarint64(dst, offset);
|
|
|
|
PutVarint64(dst, size);
|
|
|
|
dst->push_back(static_cast<char>(compression));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void EncodeBlobTTL(std::string* dst, uint64_t expiration,
|
|
|
|
uint64_t file_number, uint64_t offset,
|
|
|
|
uint64_t size, CompressionType compression) {
|
|
|
|
assert(dst != nullptr);
|
|
|
|
dst->clear();
|
|
|
|
dst->reserve(kMaxVarint64Length * 4 + 2);
|
|
|
|
dst->push_back(static_cast<char>(Type::kBlobTTL));
|
|
|
|
PutVarint64(dst, expiration);
|
|
|
|
PutVarint64(dst, file_number);
|
|
|
|
PutVarint64(dst, offset);
|
|
|
|
PutVarint64(dst, size);
|
|
|
|
dst->push_back(static_cast<char>(compression));
|
|
|
|
}
|
|
|
|
|
|
|
|
private:
|
|
|
|
Type type_ = Type::kUnknown;
|
|
|
|
uint64_t expiration_ = 0;
|
|
|
|
Slice value_;
|
|
|
|
uint64_t file_number_ = 0;
|
|
|
|
uint64_t offset_ = 0;
|
|
|
|
uint64_t size_ = 0;
|
|
|
|
CompressionType compression_ = kNoCompression;
|
|
|
|
};
|
|
|
|
|
2020-02-20 12:07:53 -08:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|