rocksdb/util/hash.cc
Andrew Kryczka 78ee8564ad Integrity protection for live updates to WriteBatch (#7748)
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).

The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.

When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748

Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.

Reviewed By: pdillinger

Differential Revision: D25754492

Pulled By: ajkr

fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
2021-01-29 12:18:58 -08:00

101 lines
3.6 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#include "util/hash.h"
#include <string.h>
#include "port/lang.h"
#include "util/coding.h"
#include "util/xxhash.h"
namespace ROCKSDB_NAMESPACE {
uint64_t (*kGetSliceNPHash64UnseededFnPtr)(const Slice&) = &GetSliceHash64;
uint32_t Hash(const char* data, size_t n, uint32_t seed) {
// MurmurHash1 - fast but mediocre quality
// https://github.com/aappleby/smhasher/wiki/MurmurHash1
//
const uint32_t m = 0xc6a4a793;
const uint32_t r = 24;
const char* limit = data + n;
uint32_t h = static_cast<uint32_t>(seed ^ (n * m));
// Pick up four bytes at a time
while (data + 4 <= limit) {
uint32_t w = DecodeFixed32(data);
data += 4;
h += w;
h *= m;
h ^= (h >> 16);
}
// Pick up remaining bytes
switch (limit - data) {
// Note: The original hash implementation used data[i] << shift, which
// promotes the char to int and then performs the shift. If the char is
// negative, the shift is undefined behavior in C++. The hash algorithm is
// part of the format definition, so we cannot change it; to obtain the same
// behavior in a legal way we just cast to uint32_t, which will do
// sign-extension. To guarantee compatibility with architectures where chars
// are unsigned we first cast the char to int8_t.
case 3:
h += static_cast<uint32_t>(static_cast<int8_t>(data[2])) << 16;
FALLTHROUGH_INTENDED;
case 2:
h += static_cast<uint32_t>(static_cast<int8_t>(data[1])) << 8;
FALLTHROUGH_INTENDED;
case 1:
h += static_cast<uint32_t>(static_cast<int8_t>(data[0]));
h *= m;
h ^= (h >> r);
break;
}
return h;
}
// We are standardizing on a preview release of XXH3, because that's
// the best available at time of standardizing.
//
// In testing (mostly Intel Skylake), this hash function is much more
// thorough than Hash32 and is almost universally faster. Hash() only
// seems faster when passing runtime-sized keys of the same small size
// (less than about 24 bytes) thousands of times in a row; this seems
// to allow the branch predictor to work some magic. XXH3's speed is
// much less dependent on branch prediction.
//
// Hashing with a prefix extractor is potentially a common case of
// hashing objects of small, predictable size. We could consider
// bundling hash functions specialized for particular lengths with
// the prefix extractors.
uint64_t Hash64(const char* data, size_t n, uint64_t seed) {
return XXH3p_64bits_withSeed(data, n, seed);
}
uint64_t Hash64(const char* data, size_t n) {
// Same as seed = 0
return XXH3p_64bits(data, n);
}
uint64_t GetSlicePartsNPHash64(const SliceParts& data, uint64_t seed) {
// TODO(ajkr): use XXH3 streaming APIs to avoid the copy/allocation.
size_t concat_len = 0;
for (int i = 0; i < data.num_parts; ++i) {
concat_len += data.parts[i].size();
}
std::string concat_data;
concat_data.reserve(concat_len);
for (int i = 0; i < data.num_parts; ++i) {
concat_data.append(data.parts[i].data(), data.parts[i].size());
}
assert(concat_data.size() == concat_len);
return NPHash64(concat_data.data(), concat_len, seed);
}
} // namespace ROCKSDB_NAMESPACE