Compare commits

...

20 Commits

Author SHA1 Message Date
sdong
fcfb666d17 Make Huge Page TLB disabled by default in hash linked list memtable 2014-05-04 14:27:50 -07:00
sdong
808928fc99 Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, dhruba, leveldb, igor, yhchiang

Differential Revision: https://reviews.facebook.net/D18357

Conflicts:
	db/plain_table_db_test.cc
	util/options.cc
2014-04-30 14:57:41 -07:00
sdong
eb96dc003a Fix a bug in IterKey
Summary: IterKey set buffer_size_ to a wrong initial value, causing it to always allocate values from heap instead of stack if the key size is smaller. Fix it.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, dhruba, yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D18279
2014-04-23 19:45:58 -07:00
sdong
8db376d494 Fix a sign and unsign comparating in plain_table_db_test
Summary:

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2014-04-23 16:09:45 -07:00
sdong
73895c9478 Expose number of entries in mem tables to users
Summary: In this patch, two new DB properties are defined: rocksdb.num-immutable-mem-table and rocksdb.num-entries-imm-mem-tables, from where number of entries in mem tables can be exposed to users

Test Plan:
Cover the codes in db_test
make all check

Reviewers: haobo, ljin, igor

Reviewed By: igor

CC: nkg-, igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18207

Conflicts:
	db/db_test.cc
2014-04-22 22:38:44 -07:00
sdong
7b37f0d5af PlainTableReader to expose index size to users
Summary:
This is a temp solution to expose index sizes to users from PlainTableReader before we persistent them to files.
In this patch, the memory consumption of indexes used by PlainTableReader will be reported as two user defined properties, so that users can monitor them.

Test Plan:
Add a unit test.
make all check`

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, yhchiang, igor, ljin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18195
2014-04-22 19:31:12 -07:00
Igor Canadi
8cd08fdca4 Fix allocate test in 2.8.fb.trunk.
Merging:
c7076a7a05
and
05c168658e

into 2.8.fb.trunk
2014-04-21 18:26:53 -07:00
Igor Canadi
ce068c09dd Flush before Sync() 2014-04-21 17:47:48 -07:00
Igor Canadi
4b7b1949d4 REmove occurrences of kBlockBasedTableWithWholeKeyHashIndex 2014-04-21 12:00:31 -07:00
Igor Canadi
9e04ce7645 Don't execute WholeKeyPrefix test 2014-04-21 11:23:47 -07:00
sdong
13dc9c7f56 Use a different approach to make sure BlockBasedTableReader can use hash index on older files
Summary:
A recent commit e37dd216f9 makes sure hash index can be used when reading existing files. This patch uses another way to achieve the approach:
(1) Currently, always writing kBinarySearch to files, despite of BlockBasedTableOptions.IndexType setting.
(2) When reading a file, read out the field, and make sure it is kBinarySearch, while always use index type by users.

The reason for doing it is, to reserve kHashSearch property on disk to future. If now we write out binary index for both of kHashSearch and kBinarySearch. We have to use a new flag in the future for hash index on disk, otherwise compatibility would break. Also, we want the real index type and type shown in properties block to be consistent.

Test Plan: make all check

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: igor, ljin, yhchiang, xjin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18009

Conflicts:
	table/block_based_table_reader.cc
2014-04-18 14:46:44 -07:00
sdong
034b494774 Fix bugs introduced by D17961
Summary:
D17961 has two bugs:
(1) two level iterator fails to populate FileMetaData.table_reader, causing performance regression.
(2) table cache handle the !status.ok() case in the wrong place, causing seg fault which shouldn't happen.

Test Plan: make all check

Reviewers: ljin, igor, haobo

Reviewed By: ljin

CC: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17991

Conflicts:
	db/version_set.cc
2014-04-18 14:40:58 -07:00
sdong
d705755e52 Minimize accessing multiple objects in Version::Get()
Summary:
One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
(1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
(2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
(3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17739

Conflicts:
	db/db_impl.cc
	db/db_impl_readonly.cc
	db/table_cache.cc
	db/version_edit.h
	db/version_set.cc
	db/version_set.h
2014-04-18 14:27:06 -07:00
sdong
26f36347d0 RocksDB 2.8 to be able to read files generated by 2.6
Summary:
From 2.6 to 2.7, property block name is renamed from rocksdb.stats to rocksdb.properties. Older properties were not able to be loaded. In 2.8, we seem to have added some logic that uses property block without checking null pointers, which create segment faults.

In this patch, we fix it by:
(1) try rocksdb.stats if rocksdb.properties is not found
(2) add some null checking before consuming rep->table_properties

Test Plan: make sure a file generated in 2.7 couldn't be opened now can be opened.

Reviewers: haobo, igor, yhchiang

Reviewed By: igor

CC: ljin, xjin, dhruba, kailiu, leveldb

Differential Revision: https://reviews.facebook.net/D17961
2014-04-17 10:17:12 -07:00
sdong
0f7daf5fb4 Miss one file from previous cherry-pick
Summary:

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2014-04-15 20:20:01 -07:00
Kai Liu
22f396798e Enable hash index for block-based table
Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index.

Test Plan: Wrote several new unit tests.

Reviewers: sdong, haobo, dhruba

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16539

Use shorten index key for hash-index
Summary:
I was wrong about the "index builder", right now since we create index
by scanning both whole table and index, there is not need to preserve
the whole key as the index key.

I switch back to original way index which is both space efficient and
able to supprot in-fly construction of hash index.

IN this patch, I made minimal change since I'm not sure if we still need
the "pluggable index builder", under current circumstance it is of no use
and kind of over-engineered. But I'm not sure if we can still exploit its
usefulness in the future; otherwise I think I can just burn them with great
vengeance.

Test Plan: unit tests

Reviewers: sdong, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17745

Conflicts:
	table/block_based_table_reader.cc
	table/block_based_table_reader.h
2014-04-15 20:13:34 -07:00
sdong
258eac1772 Polish IterKey and use it in DBImpl::ProcessKeyValueCompaction()
Summary:
1. Polish IterKey a little bit.
2. Turn to use it in local parameter of current_user_key in DBImpl::ProcessKeyValueCompaction(). Our profile showing that DBImpl::ProcessKeyValueCompaction() has about 14% costs in std::string (the base including reading and writing data but excluding compaction filtering), which is higher than it should be. There are two std::string used in DBImpl::ProcessKeyValueCompaction(), compaction_filter_value and current_user_key and it's hard to distinguish the two.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17613

Conflicts:
	db/db_impl.cc
2014-04-09 20:54:16 -07:00
kailiu
c80c3f3b05 Make the block-based table's index pluggable
Summary:
This patch introduced a new table options that allows us to make
block-based table's index pluggable.

To support that new features:

* Code has been refacotred to be more flexible and supports this option well.
* More documentation is added for the existing obsecure functionalities.
* Big surgeon on DataBlockReader(), where the logic was really convoluted.
* Other small code cleanups.

The pluggablility will mostly affect development of internal modules
and won't change frequently, as a result I intentionally avoid
heavy-weight patterns (like factory) and try to make it simple.

Test Plan: make all check

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16395

Conflicts:
	table/block_based_table_reader.cc
	table/block_based_table_reader.h
2014-04-09 11:34:13 -07:00
sdong
14534f0d7b PlainTableBuilder::Add() to use local char array instead of reused std::string as tmp buffer
Summary: Our profile shows that in one of the applications, 5% of the CPU costs of PlainTableBuilder::Add() are spent on std::string stacks. By this simple change, we avoid this global reusable string. Also, we avoid another call of file appending, which probably gives another 2%.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17601
2014-04-09 10:33:15 -07:00
sdong
b405cb886b PlainTableIterator not to store copied key in std::string
Summary:
Move PlainTableIterator's copied key from std::string local buffer to avoid paying the extra costs in std::string related to sharing. Reuse the same buffer class in DbIter. Move the class to dbformat.h.

This patch improves iterator performance significantly. Running this benchmark:

./table_reader_bench --num_keys2=17 --iterator --plain_table --time_unit=nanosecond

The average latency is improved to about 750 nanoseconds from 1100 nanoseconds.

Test Plan:
Add a unit test.
make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17547
2014-04-07 22:18:09 -07:00
45 changed files with 1382 additions and 605 deletions

View File

@ -2540,9 +2540,9 @@ Status DBImpl::ProcessKeyValueCompaction(
Status status;
std::string compaction_filter_value;
ParsedInternalKey ikey;
std::string current_user_key;
IterKey current_user_key;
bool has_current_user_key = false;
std::vector<char> delete_key; // for compaction filter
IterKey delete_key;
SequenceNumber last_sequence_for_key __attribute__((unused)) =
kMaxSequenceNumber;
SequenceNumber visible_in_snapshot = kMaxSequenceNumber;
@ -2612,16 +2612,16 @@ Status DBImpl::ProcessKeyValueCompaction(
// Do not hide error keys
// TODO: error key stays in db forever? Figure out the intention/rationale
// v10 error v8 : we cannot hide v8 even though it's pretty obvious.
current_user_key.clear();
current_user_key.Clear();
has_current_user_key = false;
last_sequence_for_key = kMaxSequenceNumber;
visible_in_snapshot = kMaxSequenceNumber;
} else {
if (!has_current_user_key ||
user_comparator()->Compare(ikey.user_key,
Slice(current_user_key)) != 0) {
current_user_key.GetKey()) != 0) {
// First occurrence of this user key
current_user_key.assign(ikey.user_key.data(), ikey.user_key.size());
current_user_key.SetUserKey(ikey.user_key);
has_current_user_key = true;
last_sequence_for_key = kMaxSequenceNumber;
visible_in_snapshot = kMaxSequenceNumber;
@ -2642,13 +2642,11 @@ Status DBImpl::ProcessKeyValueCompaction(
&compaction_filter_value,
&value_changed);
if (to_delete) {
// make a copy of the original key
delete_key.assign(key.data(), key.data() + key.size());
// convert it to a delete
UpdateInternalKey(&delete_key[0], delete_key.size(),
ikey.sequence, kTypeDeletion);
// make a copy of the original key and convert it to a delete
delete_key.SetInternalKey(ExtractUserKey(key), ikey.sequence,
kTypeDeletion);
// anchor the key again
key = Slice(&delete_key[0], delete_key.size());
key = delete_key.GetKey();
// needed because ikey is backed by key
ParseInternalKey(key, &ikey);
// no value associated with delete
@ -3455,7 +3453,7 @@ Status DBImpl::GetImpl(const ReadOptions& options,
StartPerfTimer(&from_files_timer);
sv->current->Get(options, lkey, value, &s, &merge_context, &stats,
options_, value_found);
value_found);
have_stat_update = true;
BumpPerfTime(&perf_context.get_from_output_files_time, &from_files_timer);
RecordTick(options_.statistics.get(), MEMTABLE_MISS);
@ -3559,7 +3557,7 @@ std::vector<Status> DBImpl::MultiGet(const ReadOptions& options,
// Done
} else {
get_version->current->Get(options, lkey, value, &s, &merge_context,
&stats, options_);
&stats);
have_stat_update = true;
}

View File

@ -64,7 +64,7 @@ Status DBImplReadOnly::Get(const ReadOptions& options,
} else {
Version::GetStats stats;
super_version->current->Get(options, lkey, value, &s, &merge_context,
&stats, options_);
&stats);
}
return s;
}

View File

@ -39,71 +39,6 @@ static void DumpInternalIter(Iterator* iter) {
namespace {
class IterLookupKey {
public:
IterLookupKey() : key_(space_), buf_size_(sizeof(space_)), key_size_(0) {}
~IterLookupKey() { Clear(); }
Slice GetKey() const {
if (key_ != nullptr) {
return Slice(key_, key_size_);
} else {
return Slice();
}
}
bool Valid() const { return key_ != nullptr; }
void Clear() {
if (key_ != nullptr && key_ != space_) {
delete[] key_;
}
key_ = space_;
buf_size_ = sizeof(buf_size_);
}
// Enlarge the buffer size if needed based on key_size.
// By default, static allocated buffer is used. Once there is a key
// larger than the static allocated buffer, another buffer is dynamically
// allocated, until a larger key buffer is requested. In that case, we
// reallocate buffer and delete the old one.
void EnlargeBufferIfNeeded(size_t key_size) {
// If size is smaller than buffer size, continue using current buffer,
// or the static allocated one, as default
if (key_size > buf_size_) {
// Need to enlarge the buffer.
Clear();
key_ = new char[key_size];
buf_size_ = key_size;
}
key_size_ = key_size;
}
void SetUserKey(const Slice& user_key) {
size_t size = user_key.size();
EnlargeBufferIfNeeded(size);
memcpy(key_, user_key.data(), size);
}
void SetInternalKey(const Slice& user_key, SequenceNumber s) {
size_t usize = user_key.size();
EnlargeBufferIfNeeded(usize + sizeof(uint64_t));
memcpy(key_, user_key.data(), usize);
EncodeFixed64(key_ + usize, PackSequenceAndType(s, kValueTypeForSeek));
}
private:
char* key_;
size_t buf_size_;
size_t key_size_;
char space_[32]; // Avoid allocation for short keys
// No copying allowed
IterLookupKey(const IterLookupKey&) = delete;
void operator=(const LookupKey&) = delete;
};
// Memtables and sstables that make the DB representation contain
// (userkey,seq,type) => uservalue entries. DBIter
// combines multiple entries for the same userkey found in the DB
@ -191,7 +126,7 @@ class DBIter: public Iterator {
SequenceNumber const sequence_;
Status status_;
IterLookupKey saved_key_; // == current key when direction_==kReverse
IterKey saved_key_; // == current key when direction_==kReverse
std::string saved_value_; // == current raw value when direction_==kReverse
std::string skip_key_;
Direction direction_;

View File

@ -266,6 +266,9 @@ class DBTest {
// Sequence of option configurations to try
enum OptionConfig {
kDefault,
kBlockBasedTableWithPrefixHashIndex,
// TODO(kailiu) figure this out
// kBlockBasedTableWithWholeKeyHashIndex,
kPlainTableFirstBytePrefix,
kPlainTableAllBytesPrefix,
kVectorRep,
@ -302,7 +305,8 @@ class DBTest {
kSkipDeletesFilterFirst = 1,
kSkipUniversalCompaction = 2,
kSkipMergePut = 4,
kSkipPlainTable = 8
kSkipPlainTable = 8,
kSkipHashIndex = 16
};
DBTest() : option_config_(kDefault),
@ -343,6 +347,11 @@ class DBTest {
|| option_config_ == kPlainTableFirstBytePrefix)) {
continue;
}
if ((skip_mask & kSkipPlainTable) &&
option_config_ == kBlockBasedTableWithPrefixHashIndex) {
continue;
}
break;
}
@ -429,7 +438,7 @@ class DBTest {
break;
case kHashLinkList:
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
options.memtable_factory.reset(NewHashLinkListRepFactory(4));
options.memtable_factory.reset(NewHashLinkListRepFactory(4, 0));
break;
case kUniversalCompaction:
options.compaction_style = kCompactionStyleUniversal;
@ -441,6 +450,21 @@ class DBTest {
case kInfiniteMaxOpenFiles:
options.max_open_files = -1;
break;
case kBlockBasedTableWithPrefixHashIndex: {
BlockBasedTableOptions table_options;
table_options.index_type = BlockBasedTableOptions::kHashSearch;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
break;
}
// TODO(kailiu) figure out why it's failing and fix
// case kBlockBasedTableWithWholeKeyHashIndex: {
// BlockBasedTableOptions table_options;
// table_options.index_type = BlockBasedTableOptions::kHashSearch;
// options.table_factory.reset(NewBlockBasedTableFactory(table_options));
// options.prefix_extractor.reset(NewNoopTransform());
// break;
// }
default:
break;
}
@ -885,13 +909,29 @@ TEST(DBTest, Empty) {
options.write_buffer_size = 100000; // Small write buffer
Reopen(&options);
std::string num;
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-active-mem-table", &num));
ASSERT_EQ("0", num);
ASSERT_OK(Put("foo", "v1"));
ASSERT_EQ("v1", Get("foo"));
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-active-mem-table", &num));
ASSERT_EQ("1", num);
env_->delay_sstable_sync_.Release_Store(env_); // Block sync calls
Put("k1", std::string(100000, 'x')); // Fill memtable
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-active-mem-table", &num));
ASSERT_EQ("2", num);
Put("k2", std::string(100000, 'y')); // Trigger compaction
ASSERT_EQ("v1", Get("foo"));
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-active-mem-table", &num));
ASSERT_EQ("1", num);
env_->delay_sstable_sync_.Release_Store(nullptr); // Release sync calls
} while (ChangeOptions());
}
@ -1251,7 +1291,7 @@ TEST(DBTest, KeyMayExist) {
// KeyMayExist function only checks data in block caches, which is not used
// by plain table format.
} while (ChangeOptions(kSkipPlainTable));
} while (ChangeOptions(kSkipPlainTable | kSkipHashIndex));
}
TEST(DBTest, NonBlockingIteration) {
@ -2064,6 +2104,9 @@ TEST(DBTest, NumImmutableMemTable) {
ASSERT_OK(dbfull()->Put(writeOpt, "k1", big_value));
ASSERT_TRUE(dbfull()->GetProperty("rocksdb.num-immutable-mem-table", &num));
ASSERT_EQ(num, "0");
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-active-mem-table", &num));
ASSERT_EQ(num, "1");
perf_context.Reset();
Get("k1");
ASSERT_EQ(1, (int) perf_context.get_from_memtable_count);
@ -2071,6 +2114,13 @@ TEST(DBTest, NumImmutableMemTable) {
ASSERT_OK(dbfull()->Put(writeOpt, "k2", big_value));
ASSERT_TRUE(dbfull()->GetProperty("rocksdb.num-immutable-mem-table", &num));
ASSERT_EQ(num, "1");
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-active-mem-table", &num));
ASSERT_EQ(num, "1");
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-imm-mem-tables", &num));
ASSERT_EQ(num, "1");
perf_context.Reset();
Get("k1");
ASSERT_EQ(2, (int) perf_context.get_from_memtable_count);
@ -2083,6 +2133,12 @@ TEST(DBTest, NumImmutableMemTable) {
&num));
ASSERT_TRUE(dbfull()->GetProperty("rocksdb.num-immutable-mem-table", &num));
ASSERT_EQ(num, "2");
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-active-mem-table", &num));
ASSERT_EQ(num, "1");
ASSERT_TRUE(dbfull()->GetProperty(
"rocksdb.num-entries-imm-mem-tables", &num));
ASSERT_EQ(num, "2");
perf_context.Reset();
Get("k2");
ASSERT_EQ(2, (int) perf_context.get_from_memtable_count);
@ -5882,7 +5938,8 @@ TEST(DBTest, Randomized) {
int minimum = 0;
if (option_config_ == kHashSkipList ||
option_config_ == kHashLinkList ||
option_config_ == kPlainTableFirstBytePrefix) {
option_config_ == kPlainTableFirstBytePrefix ||
option_config_ == kBlockBasedTableWithPrefixHashIndex) {
minimum = 1;
}
if (p < 45) { // Put
@ -5924,6 +5981,7 @@ TEST(DBTest, Randomized) {
if ((step % 100) == 0) {
ASSERT_TRUE(CompareIterators(step, &model, db_, nullptr, nullptr));
ASSERT_TRUE(CompareIterators(step, &model, db_, model_snap, db_snap));
// Save a snapshot from each DB this time that we'll use next
// time we compare things, to make sure the current state is
// preserved with the snapshot

View File

@ -13,6 +13,7 @@
#include "rocksdb/db.h"
#include "rocksdb/filter_policy.h"
#include "rocksdb/slice.h"
#include "rocksdb/slice_transform.h"
#include "rocksdb/table.h"
#include "rocksdb/types.h"
#include "util/coding.h"
@ -235,4 +236,100 @@ inline LookupKey::~LookupKey() {
if (start_ != space_) delete[] start_;
}
class IterKey {
public:
IterKey() : key_(space_), buf_size_(sizeof(space_)), key_size_(0) {}
~IterKey() { ResetBuffer(); }
Slice GetKey() const { return Slice(key_, key_size_); }
void Clear() { key_size_ = 0; }
void SetUserKey(const Slice& user_key) {
size_t size = user_key.size();
EnlargeBufferIfNeeded(size);
memcpy(key_, user_key.data(), size);
key_size_ = size;
}
void SetInternalKey(const Slice& user_key, SequenceNumber s,
ValueType value_type = kValueTypeForSeek) {
size_t usize = user_key.size();
EnlargeBufferIfNeeded(usize + sizeof(uint64_t));
memcpy(key_, user_key.data(), usize);
EncodeFixed64(key_ + usize, PackSequenceAndType(s, value_type));
key_size_ = usize + sizeof(uint64_t);
}
void SetInternalKey(const ParsedInternalKey& parsed_key) {
SetInternalKey(parsed_key.user_key, parsed_key.sequence, parsed_key.type);
}
private:
char* key_;
size_t buf_size_;
size_t key_size_;
char space_[32]; // Avoid allocation for short keys
void ResetBuffer() {
if (key_ != nullptr && key_ != space_) {
delete[] key_;
}
key_ = space_;
buf_size_ = sizeof(space_);
key_size_ = 0;
}
// Enlarge the buffer size if needed based on key_size.
// By default, static allocated buffer is used. Once there is a key
// larger than the static allocated buffer, another buffer is dynamically
// allocated, until a larger key buffer is requested. In that case, we
// reallocate buffer and delete the old one.
void EnlargeBufferIfNeeded(size_t key_size) {
// If size is smaller than buffer size, continue using current buffer,
// or the static allocated one, as default
if (key_size > buf_size_) {
// Need to enlarge the buffer.
ResetBuffer();
key_ = new char[key_size];
buf_size_ = key_size;
}
}
// No copying allowed
IterKey(const IterKey&) = delete;
void operator=(const IterKey&) = delete;
};
class InternalKeySliceTransform : public SliceTransform {
public:
explicit InternalKeySliceTransform(const SliceTransform* transform)
: transform_(transform) {}
virtual const char* Name() const { return transform_->Name(); }
virtual Slice Transform(const Slice& src) const {
auto user_key = ExtractUserKey(src);
return transform_->Transform(user_key);
}
virtual bool InDomain(const Slice& src) const {
auto user_key = ExtractUserKey(src);
return transform_->InDomain(user_key);
}
virtual bool InRange(const Slice& dst) const {
auto user_key = ExtractUserKey(dst);
return transform_->InRange(user_key);
}
const SliceTransform* user_prefix_extractor() const { return transform_; }
private:
// Like comparator, InternalKeySliceTransform will not take care of the
// deletion of transform_
const SliceTransform* const transform_;
};
} // namespace rocksdb

View File

@ -38,6 +38,10 @@ DBPropertyType GetPropertyType(const Slice& property) {
return kBackgroundErrors;
} else if (in == "cur-size-active-mem-table") {
return kCurSizeActiveMemTable;
} else if (in == "num-entries-active-mem-table") {
return kNumEntriesInMutableMemtable;
} else if (in == "num-entries-imm-mem-tables") {
return kNumEntriesInImmutableMemtable;
}
return kUnknown;
}
@ -47,7 +51,7 @@ bool InternalStats::GetProperty(DBPropertyType property_type,
DBImpl* db) {
VersionSet* version_set = db->versions_.get();
Version* current = version_set->current();
const MemTableList& imm = db->imm_;
MemTableList& imm = db->imm_;
Slice in = property;
switch (property_type) {
@ -353,6 +357,14 @@ bool InternalStats::GetProperty(DBPropertyType property_type,
// Current size of the active memtable
*value = std::to_string(db->mem_->ApproximateMemoryUsage());
return true;
case kNumEntriesInMutableMemtable:
// Current size of the active memtable
*value = std::to_string(db->mem_->GetNumEntries());
return true;
case kNumEntriesInImmutableMemtable:
// Current size of the active memtable
*value = std::to_string(imm.current()->GetTotalNumEntries());
return true;
default:
return false;
}

View File

@ -28,11 +28,14 @@ enum DBPropertyType {
kSsTables, // Return a human readable string of current SST files
kNumImmutableMemTable, // Return number of immutable mem tables
kMemtableFlushPending, // Return 1 if mem table flushing is pending,
// otherwise
// 0.
// otherwise 0.
kCompactionPending, // Return 1 if a compaction is pending. Otherwise 0.
kBackgroundErrors, // Return accumulated background errors encountered.
kCurSizeActiveMemTable, // Return current size of the active memtable
kNumEntriesInMutableMemtable, // Return number of entries in the mutable
// memtable.
kNumEntriesInImmutableMemtable, // Return sum of number of entries in all
// the immutable mem tables.
kUnknown,
};

View File

@ -37,6 +37,7 @@ MemTable::MemTable(const InternalKeyComparator& cmp, const Options& options)
arena_(options.arena_block_size),
table_(options.memtable_factory->CreateMemTableRep(
comparator_, &arena_, options.prefix_extractor.get())),
num_entries_(0),
flush_in_progress_(false),
flush_completed_(false),
file_number_(0),
@ -51,9 +52,10 @@ MemTable::MemTable(const InternalKeyComparator& cmp, const Options& options)
// gone wrong already.
assert(!should_flush_);
if (prefix_extractor_ && options.memtable_prefix_bloom_bits > 0) {
prefix_bloom_.reset(new DynamicBloom(options.memtable_prefix_bloom_bits,
options.bloom_locality,
options.memtable_prefix_bloom_probes));
prefix_bloom_.reset(new DynamicBloom(
options.memtable_prefix_bloom_bits, options.bloom_locality,
options.memtable_prefix_bloom_probes, nullptr,
options.memtable_prefix_bloom_huge_page_tlb_size));
}
}
@ -260,6 +262,7 @@ void MemTable::Add(SequenceNumber s, ValueType type,
memcpy(p, value.data(), val_size);
assert((unsigned)(p + val_size - buf) == (unsigned)encoded_len);
table_->Insert(handle);
num_entries_++;
if (prefix_bloom_) {
assert(prefix_extractor_);

View File

@ -132,6 +132,9 @@ class MemTable {
// key in the memtable.
size_t CountSuccessiveMergeEntries(const LookupKey& key);
// Get total number of entries in the mem table.
uint64_t GetNumEntries() const { return num_entries_; }
// Returns the edits area that is needed for flushing the memtable
VersionEdit* GetEdits() { return &edit_; }
@ -182,6 +185,8 @@ class MemTable {
Arena arena_;
unique_ptr<MemTableRep> table_;
uint64_t num_entries_;
// These are used to manage memtable flushes to storage
bool flush_in_progress_; // started the flush
bool flush_completed_; // finished the flush

View File

@ -77,6 +77,14 @@ void MemTableListVersion::AddIterators(const ReadOptions& options,
}
}
uint64_t MemTableListVersion::GetTotalNumEntries() const {
uint64_t total_num = 0;
for (auto& m : memlist_) {
total_num += m->GetNumEntries();
}
return total_num;
}
// caller is responsible for referencing m
void MemTableListVersion::Add(MemTable* m) {
assert(refs_ == 1); // only when refs_ == 1 is MemTableListVersion mutable

View File

@ -43,6 +43,8 @@ class MemTableListVersion {
void AddIterators(const ReadOptions& options,
std::vector<Iterator*>* iterator_list);
uint64_t GetTotalNumEntries() const;
private:
// REQUIRE: m is mutable memtable
void Add(MemTable* m);

View File

@ -188,9 +188,9 @@ class TestPlainTableReader : public PlainTableReader {
const Options& options, bool* expect_bloom_not_match)
: PlainTableReader(options, std::move(file), storage_options, icomparator,
file_size, bloom_bits_per_key, hash_table_ratio,
index_sparseness, table_properties),
index_sparseness, table_properties, 2 * 1024 * 1024),
expect_bloom_not_match_(expect_bloom_not_match) {
Status s = PopulateIndex();
Status s = PopulateIndex(const_cast<TableProperties*>(table_properties));
ASSERT_TRUE(s.ok());
}
@ -209,13 +209,12 @@ extern const uint64_t kPlainTableMagicNumber;
class TestPlainTableFactory : public PlainTableFactory {
public:
explicit TestPlainTableFactory(bool* expect_bloom_not_match,
uint32_t user_key_len =
kPlainTableVariableLength,
int bloom_bits_per_key = 0,
double hash_table_ratio = 0.75,
size_t index_sparseness = 16)
uint32_t user_key_len, int bloom_bits_per_key,
double hash_table_ratio,
size_t index_sparseness,
size_t huge_page_tlb_size)
: PlainTableFactory(user_key_len, user_key_len, hash_table_ratio,
hash_table_ratio),
index_sparseness, huge_page_tlb_size),
bloom_bits_per_key_(bloom_bits_per_key),
hash_table_ratio_(hash_table_ratio),
index_sparseness_(index_sparseness),
@ -247,6 +246,8 @@ class TestPlainTableFactory : public PlainTableFactory {
};
TEST(PlainTableDBTest, Flush) {
for (size_t huge_page_tlb_size = 0; huge_page_tlb_size <= 2 * 1024 * 1024;
huge_page_tlb_size += 2 * 1024 * 1024) {
for (int bloom_bits = 0; bloom_bits <= 117; bloom_bits += 117) {
for (int total_order = 0; total_order <= 1; total_order++) {
Options options = CurrentOptions();
@ -254,10 +255,11 @@ TEST(PlainTableDBTest, Flush) {
// Set only one bucket to force bucket conflict.
// Test index interval for the same prefix to be 1, 2 and 4
if (total_order) {
options.table_factory.reset(
NewTotalOrderPlainTableFactory(16, bloom_bits, 2));
options.table_factory.reset(NewTotalOrderPlainTableFactory(
16, bloom_bits, 2, huge_page_tlb_size));
} else {
options.table_factory.reset(NewPlainTableFactory(16, bloom_bits));
options.table_factory.reset(NewPlainTableFactory(
16, bloom_bits, 0.75, 16, huge_page_tlb_size));
}
DestroyAndReopen(&options);
@ -265,13 +267,27 @@ TEST(PlainTableDBTest, Flush) {
ASSERT_OK(Put("0000000000000bar", "v2"));
ASSERT_OK(Put("1000000000000foo", "v3"));
dbfull()->TEST_FlushMemTable();
TablePropertiesCollection ptc;
reinterpret_cast<DB*>(dbfull())->GetPropertiesOfAllTables(&ptc);
ASSERT_EQ(1U, ptc.size());
auto row = ptc.begin();
auto tp = row->second;
ASSERT_EQ(total_order ? "4" : "12", (tp->user_collected_properties).at(
"plain_table_hash_table_size"));
ASSERT_EQ(total_order ? "9" : "0", (tp->user_collected_properties).at(
"plain_table_sub_index_size"));
ASSERT_EQ("v3", Get("1000000000000foo"));
ASSERT_EQ("v2", Get("0000000000000bar"));
}
}
}
}
TEST(PlainTableDBTest, Flush2) {
for (size_t huge_page_tlb_size = 0; huge_page_tlb_size <= 2 * 1024 * 1024;
huge_page_tlb_size += 2 * 1024 * 1024) {
for (int bloom_bits = 0; bloom_bits <= 117; bloom_bits += 117) {
for (int total_order = 0; total_order <= 1; total_order++) {
bool expect_bloom_not_match = false;
@ -281,11 +297,13 @@ TEST(PlainTableDBTest, Flush2) {
// Test index interval for the same prefix to be 1, 2 and 4
if (total_order) {
options.prefix_extractor = nullptr;
options.table_factory.reset(new TestPlainTableFactory(
&expect_bloom_not_match, 16, bloom_bits, 0, 2));
options.table_factory.reset(
new TestPlainTableFactory(&expect_bloom_not_match, 16, bloom_bits,
0, 2, huge_page_tlb_size));
} else {
options.table_factory.reset(
new TestPlainTableFactory(&expect_bloom_not_match, 16, bloom_bits));
new TestPlainTableFactory(&expect_bloom_not_match, 16, bloom_bits,
0.75, 16, huge_page_tlb_size));
}
DestroyAndReopen(&options);
ASSERT_OK(Put("0000000000000bar", "b"));
@ -325,8 +343,11 @@ TEST(PlainTableDBTest, Flush2) {
}
}
}
}
TEST(PlainTableDBTest, Iterator) {
for (size_t huge_page_tlb_size = 0; huge_page_tlb_size <= 2 * 1024 * 1024;
huge_page_tlb_size += 2 * 1024 * 1024) {
for (int bloom_bits = 0; bloom_bits <= 117; bloom_bits += 117) {
for (int total_order = 0; total_order <= 1; total_order++) {
bool expect_bloom_not_match = false;
@ -337,10 +358,11 @@ TEST(PlainTableDBTest, Iterator) {
if (total_order) {
options.prefix_extractor = nullptr;
options.table_factory.reset(new TestPlainTableFactory(
&expect_bloom_not_match, 16, bloom_bits, 0, 2));
&expect_bloom_not_match, 16, bloom_bits, 0, 2, huge_page_tlb_size));
} else {
options.table_factory.reset(
new TestPlainTableFactory(&expect_bloom_not_match, 16, bloom_bits));
new TestPlainTableFactory(&expect_bloom_not_match, 16, bloom_bits,
0.75, 16, huge_page_tlb_size));
}
DestroyAndReopen(&options);
@ -428,6 +450,49 @@ TEST(PlainTableDBTest, Iterator) {
}
}
}
}
std::string MakeLongKey(size_t length, char c) {
return std::string(length, c);
}
TEST(PlainTableDBTest, IteratorLargeKeys) {
Options options = CurrentOptions();
options.table_factory.reset(NewTotalOrderPlainTableFactory(0, 0, 16));
options.create_if_missing = true;
options.prefix_extractor.reset();
DestroyAndReopen(&options);
std::string key_list[] = {
MakeLongKey(30, '0'),
MakeLongKey(16, '1'),
MakeLongKey(32, '2'),
MakeLongKey(60, '3'),
MakeLongKey(90, '4'),
MakeLongKey(50, '5'),
MakeLongKey(26, '6')
};
for (size_t i = 0; i < 7; i++) {
ASSERT_OK(Put(key_list[i], std::to_string(i)));
}
dbfull()->TEST_FlushMemTable();
Iterator* iter = dbfull()->NewIterator(ro_);
iter->Seek(key_list[0]);
for (size_t i = 0; i < 7; i++) {
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(key_list[i], iter->key().ToString());
ASSERT_EQ(std::to_string(i), iter->value().ToString());
iter->Next();
}
ASSERT_TRUE(!iter->Valid());
delete iter;
}
// A test comparator which compare two strings in this way:
// (1) first compare prefix of 8 bytes in alphabet order,
@ -527,12 +592,15 @@ TEST(PlainTableDBTest, IteratorReverseSuffixComparator) {
}
TEST(PlainTableDBTest, HashBucketConflict) {
for (size_t huge_page_tlb_size = 0; huge_page_tlb_size <= 2 * 1024 * 1024;
huge_page_tlb_size += 2 * 1024 * 1024) {
for (unsigned char i = 1; i <= 3; i++) {
Options options = CurrentOptions();
options.create_if_missing = true;
// Set only one bucket to force bucket conflict.
// Test index interval for the same prefix to be 1, 2 and 4
options.table_factory.reset(NewTotalOrderPlainTableFactory(16, 0, 2 ^ i));
options.table_factory.reset(
NewTotalOrderPlainTableFactory(16, 0, 2 ^ i, huge_page_tlb_size));
DestroyAndReopen(&options);
ASSERT_OK(Put("5000000000000fo0", "v1"));
ASSERT_OK(Put("5000000000000fo1", "v2"));
@ -606,8 +674,11 @@ TEST(PlainTableDBTest, HashBucketConflict) {
delete iter;
}
}
}
TEST(PlainTableDBTest, HashBucketConflictReverseSuffixComparator) {
for (size_t huge_page_tlb_size = 0; huge_page_tlb_size <= 2 * 1024 * 1024;
huge_page_tlb_size += 2 * 1024 * 1024) {
for (unsigned char i = 1; i <= 3; i++) {
Options options = CurrentOptions();
options.create_if_missing = true;
@ -615,7 +686,8 @@ TEST(PlainTableDBTest, HashBucketConflictReverseSuffixComparator) {
options.comparator = &comp;
// Set only one bucket to force bucket conflict.
// Test index interval for the same prefix to be 1, 2 and 4
options.table_factory.reset(NewTotalOrderPlainTableFactory(16, 0, 2 ^ i));
options.table_factory.reset(
NewTotalOrderPlainTableFactory(16, 0, 2 ^ i, huge_page_tlb_size));
DestroyAndReopen(&options);
ASSERT_OK(Put("5000000000000fo0", "v1"));
ASSERT_OK(Put("5000000000000fo1", "v2"));
@ -688,6 +760,7 @@ TEST(PlainTableDBTest, HashBucketConflictReverseSuffixComparator) {
delete iter;
}
}
}
TEST(PlainTableDBTest, NonExistingKeyToNonEmptyBucket) {
Options options = CurrentOptions();

View File

@ -31,6 +31,7 @@ DEFINE_int64(min_write_buffer_number_to_merge, 1, "");
DEFINE_int32(skiplist_height, 4, "");
DEFINE_int32(memtable_prefix_bloom_bits, 10000000, "");
DEFINE_int32(memtable_prefix_bloom_probes, 10, "");
DEFINE_int32(memtable_prefix_bloom_huge_page_tlb_size, 2 * 1024 * 1024, "");
DEFINE_int32(value_size, 40, "");
// Path to the database on file system
@ -147,6 +148,8 @@ class PrefixTest {
options.memtable_prefix_bloom_bits = FLAGS_memtable_prefix_bloom_bits;
options.memtable_prefix_bloom_probes = FLAGS_memtable_prefix_bloom_probes;
options.memtable_prefix_bloom_huge_page_tlb_size =
FLAGS_memtable_prefix_bloom_huge_page_tlb_size;
Status s = DB::Open(options, kDbName, &db);
ASSERT_OK(s);
@ -171,6 +174,10 @@ class PrefixTest {
options.memtable_factory.reset(
NewHashLinkListRepFactory(bucket_count));
return true;
case kHashLinkListHugePageTlb:
options.memtable_factory.reset(
NewHashLinkListRepFactory(bucket_count, 2 * 1024 * 1024));
return true;
default:
return false;
}
@ -189,6 +196,7 @@ class PrefixTest {
kBegin,
kHashSkipList,
kHashLinkList,
kHashLinkListHugePageTlb,
kEnd
};
int option_config_;

View File

@ -111,19 +111,20 @@ Iterator* TableCache::NewIterator(const ReadOptions& options,
if (table_reader_ptr != nullptr) {
*table_reader_ptr = nullptr;
}
Cache::Handle* handle = file_meta.table_reader_handle;
TableReader* table_reader = file_meta.table_reader;
Cache::Handle* handle = nullptr;
Status s;
if (!handle) {
if (table_reader == nullptr) {
s = FindTable(toptions, icomparator, file_meta.number, file_meta.file_size,
&handle, nullptr, options.read_tier == kBlockCacheTier);
}
if (!s.ok()) {
return NewErrorIterator(s);
}
table_reader = GetTableReaderFromHandle(handle);
}
TableReader* table_reader = GetTableReaderFromHandle(handle);
Iterator* result = table_reader->NewIterator(options);
if (!file_meta.table_reader_handle) {
if (handle != nullptr) {
result->RegisterCleanup(&UnrefEntry, cache_.get(), handle);
}
if (table_reader_ptr != nullptr) {
@ -143,17 +144,20 @@ Status TableCache::Get(const ReadOptions& options,
bool (*saver)(void*, const ParsedInternalKey&,
const Slice&, bool),
bool* table_io, void (*mark_key_may_exist)(void*)) {
Cache::Handle* handle = file_meta.table_reader_handle;
TableReader* t = file_meta.table_reader;
Status s;
if (!handle) {
Cache::Handle* handle = nullptr;
if (!t) {
s = FindTable(storage_options_, internal_comparator, file_meta.number,
file_meta.file_size, &handle, table_io,
options.read_tier == kBlockCacheTier);
if (s.ok()) {
t = GetTableReaderFromHandle(handle);
}
}
if (s.ok()) {
TableReader* t = GetTableReaderFromHandle(handle);
s = t->Get(options, k, arg, saver, mark_key_may_exist);
if (!file_meta.table_reader_handle) {
if (handle != nullptr) {
ReleaseHandle(handle);
}
} else if (options.read_tier && s.IsIncomplete()) {
@ -169,15 +173,16 @@ Status TableCache::GetTableProperties(
const FileMetaData& file_meta,
std::shared_ptr<const TableProperties>* properties, bool no_io) {
Status s;
auto table_handle = file_meta.table_reader_handle;
auto table_reader = file_meta.table_reader;
// table already been pre-loaded?
if (table_handle) {
auto table = GetTableReaderFromHandle(table_handle);
*properties = table->GetTableProperties();
if (table_reader) {
*properties = table_reader->GetTableProperties();
return s;
}
bool table_io;
Cache::Handle* table_handle = nullptr;
s = FindTable(toptions, internal_comparator, file_meta.number,
file_meta.file_size, &table_handle, &table_io, no_io);
if (!s.ok()) {
@ -195,20 +200,21 @@ bool TableCache::PrefixMayMatch(const ReadOptions& options,
const FileMetaData& file_meta,
const Slice& internal_prefix, bool* table_io) {
bool may_match = true;
auto table_handle = file_meta.table_reader_handle;
if (table_handle == nullptr) {
auto table_reader = file_meta.table_reader;
Cache::Handle* table_handle = nullptr;
if (table_reader == nullptr) {
// Need to get table handle from file number
Status s = FindTable(storage_options_, icomparator, file_meta.number,
file_meta.file_size, &table_handle, table_io);
if (!s.ok()) {
return may_match;
}
table_reader = GetTableReaderFromHandle(table_handle);
}
auto table = GetTableReaderFromHandle(table_handle);
may_match = table->PrefixMayMatch(internal_prefix);
may_match = table_reader->PrefixMayMatch(internal_prefix);
if (file_meta.table_reader_handle == nullptr) {
if (table_handle != nullptr) {
// Need to release handle if it is generated from here.
ReleaseHandle(table_handle);
}

View File

@ -31,10 +31,13 @@ struct FileMetaData {
// Needs to be disposed when refs becomes 0.
Cache::Handle* table_reader_handle;
// Table reader in table_reader_handle
TableReader* table_reader;
FileMetaData(uint64_t number, uint64_t file_size) :
refs(0), allowed_seeks(1 << 30), number(number), file_size(file_size),
being_compacted(false), table_reader_handle(nullptr) {
being_compacted(false), table_reader_handle(nullptr),
table_reader(nullptr) {
}
FileMetaData() : FileMetaData(0, 0) { }
};

View File

@ -148,7 +148,7 @@ namespace {
struct EncodedFileMetaData {
uint64_t number; // file number
uint64_t file_size; // file size
Cache::Handle* table_reader_handle; // cached table reader's handler
TableReader* table_reader; // cached table reader
};
} // namespace
@ -196,7 +196,7 @@ class Version::LevelFileNumIterator : public Iterator {
auto* file_meta = (*flist_)[index_];
current_value_.number = file_meta->number;
current_value_.file_size = file_meta->file_size;
current_value_.table_reader_handle = file_meta->table_reader_handle;
current_value_.table_reader = file_meta->table_reader;
return Slice(reinterpret_cast<const char*>(&current_value_),
sizeof(EncodedFileMetaData));
}
@ -228,7 +228,7 @@ static Iterator* GetFileIterator(void* arg, const ReadOptions& options,
const EncodedFileMetaData* encoded_meta =
reinterpret_cast<const EncodedFileMetaData*>(file_value.data());
FileMetaData meta(encoded_meta->number, encoded_meta->file_size);
meta.table_reader_handle = encoded_meta->table_reader_handle;
meta.table_reader = encoded_meta->table_reader;
return cache->NewIterator(
options.prefix ? options_copy : options, soptions, icomparator, meta,
nullptr /* don't need reference to table*/, for_compaction);
@ -254,7 +254,7 @@ bool Version::PrefixMayMatch(const ReadOptions& options,
reinterpret_cast<const EncodedFileMetaData*>(
level_iter->value().data());
FileMetaData meta(encoded_meta->number, encoded_meta->file_size);
meta.table_reader_handle = encoded_meta->table_reader_handle;
meta.table_reader = encoded_meta->table_reader;
may_match = vset_->table_cache_->PrefixMayMatch(options, vset_->icmp_, meta,
internal_prefix, nullptr);
}
@ -478,6 +478,12 @@ bool BySmallestKey(FileMetaData* a, FileMetaData* b,
Version::Version(VersionSet* vset, uint64_t version_number)
: vset_(vset),
internal_comparator_(&(vset->icmp_)),
user_comparator_(internal_comparator_->user_comparator()),
table_cache_(vset->table_cache_),
merge_operator_(vset->options_->merge_operator.get()),
info_log_(vset->options_->info_log.get()),
db_statistics_(vset->options_->statistics.get()),
next_(this),
prev_(this),
refs_(0),
@ -497,27 +503,22 @@ void Version::Get(const ReadOptions& options,
Status* status,
MergeContext* merge_context,
GetStats* stats,
const Options& db_options,
bool* value_found) {
Slice ikey = k.internal_key();
Slice user_key = k.user_key();
const Comparator* ucmp = vset_->icmp_.user_comparator();
auto merge_operator = db_options.merge_operator.get();
auto logger = db_options.info_log.get();
assert(status->ok() || status->IsMergeInProgress());
Saver saver;
saver.state = status->ok()? kNotFound : kMerge;
saver.ucmp = ucmp;
saver.ucmp = user_comparator_;
saver.user_key = user_key;
saver.value_found = value_found;
saver.value = value;
saver.merge_operator = merge_operator;
saver.merge_operator = merge_operator_;
saver.merge_context = merge_context;
saver.logger = logger;
saver.logger = info_log_;
saver.didIO = false;
saver.statistics = db_options.statistics.get();
saver.statistics = db_statistics_;
stats->seek_file = nullptr;
stats->seek_file_level = -1;
@ -548,7 +549,7 @@ void Version::Get(const ReadOptions& options,
// On Level-n (n>=1), files are sorted.
// Binary search to find earliest index whose largest key >= ikey.
// We will also stop when the file no longer overlaps ikey
start_index = FindFile(vset_->icmp_, files_[level], ikey);
start_index = FindFile(*internal_comparator_, files_[level], ikey);
}
// Traverse each relevant file to find the desired key
@ -557,8 +558,10 @@ void Version::Get(const ReadOptions& options,
#endif
for (uint32_t i = start_index; i < num_files; ++i) {
FileMetaData* f = files[i];
if (ucmp->Compare(user_key, f->smallest.user_key()) < 0 ||
ucmp->Compare(user_key, f->largest.user_key()) > 0) {
// Skip key range filtering for levle 0 if there are few level 0 files.
if ((level > 0 || num_files > 2) &&
(user_comparator_->Compare(user_key, f->smallest.user_key()) < 0 ||
user_comparator_->Compare(user_key, f->largest.user_key()) > 0)) {
// Only process overlapping files.
if (level > 0) {
// If on Level-n (n>=1) then the files are sorted.
@ -574,7 +577,8 @@ void Version::Get(const ReadOptions& options,
// Sanity check to make sure that the files are correctly sorted
if (prev_file) {
if (level != 0) {
int comp_sign = vset_->icmp_.Compare(prev_file->largest, f->smallest);
int comp_sign =
internal_comparator_->Compare(prev_file->largest, f->smallest);
assert(comp_sign < 0);
} else {
// level == 0, the current file cannot be newer than the previous one.
@ -588,9 +592,8 @@ void Version::Get(const ReadOptions& options,
prev_file = f;
#endif
bool tableIO = false;
*status =
vset_->table_cache_->Get(options, vset_->icmp_, *f, ikey, &saver,
SaveValue, &tableIO, MarkKeyMayExist);
*status = table_cache_->Get(options, *internal_comparator_, *f, ikey,
&saver, SaveValue, &tableIO, MarkKeyMayExist);
// TODO: examine the behavior for corrupted key
if (!status->ok()) {
return;
@ -635,12 +638,12 @@ void Version::Get(const ReadOptions& options,
if (kMerge == saver.state) {
// merge_operands are in saver and we hit the beginning of the key history
// do a final merge of nullptr and operands;
if (merge_operator->FullMerge(user_key, nullptr,
saver.merge_context->GetOperands(),
value, logger)) {
if (merge_operator_->FullMerge(user_key, nullptr,
saver.merge_context->GetOperands(), value,
info_log_)) {
*status = Status::OK();
} else {
RecordTick(db_options.statistics.get(), NUMBER_MERGE_FAILURES);
RecordTick(db_statistics_, NUMBER_MERGE_FAILURES);
*status = Status::Corruption("could not perform end-of-key merge for ",
user_key);
}
@ -1447,6 +1450,12 @@ class VersionSet::Builder {
file_meta->number, file_meta->file_size,
&file_meta->table_reader_handle,
&table_io, false);
if (file_meta->table_reader_handle != nullptr) {
// Load table_reader
file_meta->table_reader =
vset_->table_cache_->GetTableReaderFromHandle(
file_meta->table_reader_handle);
}
}
}
}

View File

@ -83,8 +83,7 @@ class Version {
int seek_file_level;
};
void Get(const ReadOptions&, const LookupKey& key, std::string* val,
Status* status, MergeContext* merge_context,
GetStats* stats, const Options& db_option,
Status* status, MergeContext* merge_context, GetStats* stats,
bool* value_found = nullptr);
// Adds "stats" into the current state. Returns true if a new
@ -224,6 +223,12 @@ class Version {
void UpdateFilesBySize();
VersionSet* vset_; // VersionSet to which this Version belongs
const InternalKeyComparator* internal_comparator_;
const Comparator* user_comparator_;
TableCache* table_cache_;
const MergeOperator* merge_operator_;
Logger* info_log_;
Statistics* db_statistics_;
Version* next_; // Next version in linked list
Version* prev_; // Previous version in linked list
int refs_; // Number of live refs to this version

View File

@ -223,8 +223,13 @@ extern MemTableRepFactory* NewHashSkipListRepFactory(
// The factory is to create memtables with a hashed linked list:
// it contains a fixed array of buckets, each pointing to a sorted single
// linked list (null if the bucket is empty).
// bucket_count: number of fixed array buckets
// @bucket_count: number of fixed array buckets
// @huge_page_tlb_size: if <=0, allocate the hash table bytes from malloc.
// Otherwise from huge page TLB. The user needs to reserve
// huge pages for it to be allocated, like:
// sysctl -w vm.nr_hugepages=20
// See linux doc Documentation/vm/hugetlbpage.txt
extern MemTableRepFactory* NewHashLinkListRepFactory(
size_t bucket_count = 50000);
size_t bucket_count = 50000, size_t huge_page_tlb_size = 0);
} // namespace rocksdb

View File

@ -719,6 +719,14 @@ struct Options {
// number of hash probes per key
uint32_t memtable_prefix_bloom_probes;
// Page size for huge page TLB for bloom in memtable. If <=0, not allocate
// from huge page TLB but from malloc.
// Need to reserve huge pages for it to be allocated. For example:
// sysctl -w vm.nr_hugepages=20
// See linux doc Documentation/vm/hugetlbpage.txt
size_t memtable_prefix_bloom_huge_page_tlb_size;
// Control locality of bloom filter probes to improve cache miss rate.
// This option only applies to memtable prefix bloom and plaintable
// prefix bloom. It essentially limits the max number of cache lines each

View File

@ -60,6 +60,12 @@ struct BlockBasedTableOptions {
// A space efficient index block that is optimized for
// binary-search-based index.
kBinarySearch,
// The hash index, if enabled, will do the hash lookup when
// `ReadOption.prefix_seek == true`. User should also specify
// `Options.prefix_extractor` to allow the index block to correctly
// extract the prefix of the given key and perform hash table lookup.
kHashSearch,
};
IndexType index_type = kBinarySearch;
@ -91,12 +97,19 @@ extern TableFactory* NewBlockBasedTableFactory(
// in the hash table
// @index_sparseness: inside each prefix, need to build one index record for how
// many keys for binary search inside each hash bucket.
// @huge_page_tlb_size: if <=0, allocate hash indexes and blooms from malloc.
// Otherwise from huge page TLB. The user needs to reserve
// huge pages for it to be allocated, like:
// sysctl -w vm.nr_hugepages=20
// See linux doc Documentation/vm/hugetlbpage.txt
const uint32_t kPlainTableVariableLength = 0;
extern TableFactory* NewPlainTableFactory(uint32_t user_key_len =
kPlainTableVariableLength,
int bloom_bits_per_prefix = 10,
double hash_table_ratio = 0.75,
size_t index_sparseness = 16);
size_t index_sparseness = 16,
size_t huge_page_tlb_size = 0);
// -- Plain Table
// This factory of plain table ignores Options.prefix_extractor and assumes no
@ -110,9 +123,15 @@ extern TableFactory* NewPlainTableFactory(uint32_t user_key_len =
// disable it by passing a zero.
// @index_sparseness: need to build one index record for how many keys for
// binary search.
// @huge_page_tlb_size: if <=0, allocate hash indexes and blooms from malloc.
// Otherwise from huge page TLB. The user needs to reserve
// huge pages for it to be allocated, like:
// sysctl -w vm.nr_hugepages=20
// See linux doc Documentation/vm/hugetlbpage.txt
extern TableFactory* NewTotalOrderPlainTableFactory(
uint32_t user_key_len = kPlainTableVariableLength,
int bloom_bits_per_key = 0, size_t index_sparseness = 16);
int bloom_bits_per_key = 0, size_t index_sparseness = 16,
size_t huge_page_tlb_size = 0);
// A base class for table factories.
class TableFactory {

View File

@ -23,7 +23,7 @@ namespace rocksdb {
// ++pos) {
// ...
// }
typedef std::map<std::string, std::string> UserCollectedProperties;
typedef std::map<const std::string, std::string> UserCollectedProperties;
// TableProperties contains a bunch of read-only properties of its associated
// table.

View File

@ -11,16 +11,20 @@
#include "table/block.h"
#include <vector>
#include <algorithm>
#include <string>
#include <unordered_map>
#include <vector>
#include "rocksdb/comparator.h"
#include "table/block_hash_index.h"
#include "table/format.h"
#include "util/coding.h"
#include "util/logging.h"
namespace rocksdb {
inline uint32_t Block::NumRestarts() const {
uint32_t Block::NumRestarts() const {
assert(size_ >= 2*sizeof(uint32_t));
return DecodeFixed32(data_ + size_ - sizeof(uint32_t));
}
@ -92,6 +96,7 @@ class Block::Iter : public Iterator {
std::string key_;
Slice value_;
Status status_;
BlockHashIndex* hash_index_;
inline int Compare(const Slice& a, const Slice& b) const {
return comparator_->Compare(a, b);
@ -118,16 +123,15 @@ class Block::Iter : public Iterator {
}
public:
Iter(const Comparator* comparator,
const char* data,
uint32_t restarts,
uint32_t num_restarts)
Iter(const Comparator* comparator, const char* data, uint32_t restarts,
uint32_t num_restarts, BlockHashIndex* hash_index)
: comparator_(comparator),
data_(data),
restarts_(restarts),
num_restarts_(num_restarts),
current_(restarts_),
restart_index_(num_restarts_) {
restart_index_(num_restarts_),
hash_index_(hash_index) {
assert(num_restarts_ > 0);
}
@ -169,45 +173,22 @@ class Block::Iter : public Iterator {
}
virtual void Seek(const Slice& target) {
// Binary search in restart array to find the first restart point
// with a key >= target
uint32_t left = 0;
uint32_t right = num_restarts_ - 1;
while (left < right) {
uint32_t mid = (left + right + 1) / 2;
uint32_t region_offset = GetRestartPoint(mid);
uint32_t shared, non_shared, value_length;
const char* key_ptr = DecodeEntry(data_ + region_offset,
data_ + restarts_,
&shared, &non_shared, &value_length);
if (key_ptr == nullptr || (shared != 0)) {
CorruptionError();
uint32_t index = 0;
bool ok = hash_index_ ? HashSeek(target, &index)
: BinarySeek(target, 0, num_restarts_ - 1, &index);
if (!ok) {
return;
}
Slice mid_key(key_ptr, non_shared);
if (Compare(mid_key, target) < 0) {
// Key at "mid" is smaller than "target". Therefore all
// blocks before "mid" are uninteresting.
left = mid;
} else {
// Key at "mid" is >= "target". Therefore all blocks at or
// after "mid" are uninteresting.
right = mid - 1;
}
}
SeekToRestartPoint(index);
// Linear search (within restart block) for first key >= target
SeekToRestartPoint(left);
while (true) {
if (!ParseNextKey()) {
return;
}
if (Compare(key_, target) >= 0) {
return;
}
}
}
while (true) {
if (!ParseNextKey() || Compare(key_, target) >= 0) {
return;
}
}
}
virtual void SeekToFirst() {
SeekToRestartPoint(0);
ParseNextKey();
@ -257,6 +238,53 @@ class Block::Iter : public Iterator {
return true;
}
}
// Binary search in restart array to find the first restart point
// with a key >= target
bool BinarySeek(const Slice& target, uint32_t left, uint32_t right,
uint32_t* index) {
assert(left <= right);
while (left < right) {
uint32_t mid = (left + right + 1) / 2;
uint32_t region_offset = GetRestartPoint(mid);
uint32_t shared, non_shared, value_length;
const char* key_ptr =
DecodeEntry(data_ + region_offset, data_ + restarts_, &shared,
&non_shared, &value_length);
if (key_ptr == nullptr || (shared != 0)) {
CorruptionError();
return false;
}
Slice mid_key(key_ptr, non_shared);
if (Compare(mid_key, target) < 0) {
// Key at "mid" is smaller than "target". Therefore all
// blocks before "mid" are uninteresting.
left = mid;
} else {
// Key at "mid" is >= "target". Therefore all blocks at or
// after "mid" are uninteresting.
right = mid - 1;
}
}
*index = left;
return true;
}
bool HashSeek(const Slice& target, uint32_t* index) {
assert(hash_index_);
auto restart_index = hash_index_->GetRestartIndex(target);
if (restart_index == nullptr) {
current_ = restarts_;
return 0;
}
// the elements in restart_array[index : index + num_blocks]
// are all with same prefix. We'll do binary search in that small range.
auto left = restart_index->first_index;
auto right = restart_index->first_index + restart_index->num_blocks - 1;
return BinarySeek(target, left, right, index);
}
};
Iterator* Block::NewIterator(const Comparator* cmp) {
@ -267,8 +295,13 @@ Iterator* Block::NewIterator(const Comparator* cmp) {
if (num_restarts == 0) {
return NewEmptyIterator();
} else {
return new Iter(cmp, data_, restart_offset_, num_restarts);
return new Iter(cmp, data_, restart_offset_, num_restarts,
hash_index_.get());
}
}
void Block::SetBlockHashIndex(BlockHashIndex* hash_index) {
hash_index_.reset(hash_index);
}
} // namespace rocksdb

View File

@ -10,6 +10,7 @@
#pragma once
#include <stddef.h>
#include <stdint.h>
#include "rocksdb/iterator.h"
#include "rocksdb/options.h"
@ -17,6 +18,7 @@ namespace rocksdb {
struct BlockContents;
class Comparator;
class BlockHashIndex;
class Block {
public:
@ -26,20 +28,28 @@ class Block {
~Block();
size_t size() const { return size_; }
const char* data() const { return data_; }
bool cachable() const { return cachable_; }
uint32_t NumRestarts() const;
CompressionType compression_type() const { return compression_type_; }
// If hash index lookup is enabled and `use_hash_index` is true. This block
// will do hash lookup for the key prefix.
//
// NOTE: for the hash based lookup, if a key prefix doesn't match any key,
// the iterator will simply be set as "invalid", rather than returning
// the key that is just pass the target key.
Iterator* NewIterator(const Comparator* comparator);
const char* data() { return data_; }
void SetBlockHashIndex(BlockHashIndex* hash_index);
private:
uint32_t NumRestarts() const;
const char* data_;
size_t size_;
uint32_t restart_offset_; // Offset in data_ of restart array
bool owned_; // Block owns data_[]
bool cachable_;
CompressionType compression_type_;
std::unique_ptr<BlockHashIndex> hash_index_;
// No copying allowed
Block(const Block&);

View File

@ -88,8 +88,7 @@ class IndexBuilder {
const Comparator* comparator_;
};
// This index builder builds space-efficient index block for binary-search-based
// index.
// This index builder builds space-efficient index block.
//
// Optimizations:
// 1. Made block's `block_restart_interval` to be 1, which will avoid linear
@ -97,9 +96,9 @@ class IndexBuilder {
// 2. Shorten the key length for index block. Other than honestly using the
// last key in the data block as the index key, we instead find a shortest
// substitute key that serves the same function.
class BinarySearchIndexBuilder : public IndexBuilder {
class ShortenedIndexBuilder : public IndexBuilder {
public:
explicit BinarySearchIndexBuilder(const Comparator* comparator)
explicit ShortenedIndexBuilder(const Comparator* comparator)
: IndexBuilder(comparator),
index_block_builder_(1 /* block_restart_interval == 1 */, comparator) {}
@ -128,11 +127,37 @@ class BinarySearchIndexBuilder : public IndexBuilder {
BlockBuilder index_block_builder_;
};
// FullKeyIndexBuilder is also based on BlockBuilder. It works pretty much like
// ShortenedIndexBuilder, but preserves the full key instead the substitude key.
class FullKeyIndexBuilder : public IndexBuilder {
public:
explicit FullKeyIndexBuilder(const Comparator* comparator)
: IndexBuilder(comparator),
index_block_builder_(1 /* block_restart_interval == 1 */, comparator) {}
virtual void AddEntry(std::string* last_key_in_current_block,
const Slice* first_key_in_next_block,
const BlockHandle& block_handle) override {
std::string handle_encoding;
block_handle.EncodeTo(&handle_encoding);
index_block_builder_.Add(*last_key_in_current_block, handle_encoding);
}
virtual Slice Finish() override { return index_block_builder_.Finish(); }
virtual size_t EstimatedSize() const {
return index_block_builder_.CurrentSizeEstimate();
}
private:
BlockBuilder index_block_builder_;
};
// Create a index builder based on its type.
IndexBuilder* CreateIndexBuilder(IndexType type, const Comparator* comparator) {
switch (type) {
case BlockBasedTableOptions::kBinarySearch: {
return new BinarySearchIndexBuilder(comparator);
return new ShortenedIndexBuilder(comparator);
}
default: {
assert(!"Do not recognize the index type ");
@ -296,13 +321,16 @@ struct BlockBasedTableBuilder::Rep {
}
};
// TODO(sdong): Currently only write out binary search index. In
// BlockBasedTableReader, Hash index will be built using binary search index.
BlockBasedTableBuilder::BlockBasedTableBuilder(
const Options& options, const BlockBasedTableOptions& table_options,
const InternalKeyComparator& internal_comparator, WritableFile* file,
CompressionType compression_type)
: rep_(new Rep(options, internal_comparator, file,
table_options.flush_block_policy_factory.get(),
compression_type, table_options.index_type)) {
compression_type,
BlockBasedTableOptions::IndexType::kBinarySearch)) {
if (rep_->filter_block != nullptr) {
rep_->filter_block->StartBlock(0);
}

View File

@ -25,6 +25,7 @@
#include "table/block.h"
#include "table/filter_block.h"
#include "table/block_hash_index.h"
#include "table/format.h"
#include "table/meta_blocks.h"
#include "table/two_level_iterator.h"
@ -180,19 +181,51 @@ class BinarySearchIndexReader : public IndexReader {
std::unique_ptr<Block> index_block_;
};
// TODO(kailiu) This class is only a stub for now. And the comment below is also
// not completed.
// Index that leverages an internal hash table to quicken the lookup for a given
// key.
// @param data_iter_gen, equavalent to BlockBasedTable::NewIterator(). But that
// functions requires index to be initalized. To avoid this problem external
// caller will pass a function that can create the iterator over the entries
// without the table to be fully initialized.
class HashIndexReader : public IndexReader {
public:
static Status Create(RandomAccessFile* file, const BlockHandle& index_handle,
Env* env, const Comparator* comparator,
BlockBasedTable* table,
std::function<Iterator*(Iterator*)> data_iter_gen,
const SliceTransform* prefix_extractor,
IndexReader** index_reader) {
return Status::NotSupported("not implemented yet!");
assert(prefix_extractor);
Block* index_block = nullptr;
auto s =
ReadBlockFromFile(file, ReadOptions(), index_handle, &index_block, env);
if (!s.ok()) {
return s;
}
*index_reader = new HashIndexReader(comparator, index_block);
std::unique_ptr<Iterator> index_iter(index_block->NewIterator(nullptr));
std::unique_ptr<Iterator> data_iter(
data_iter_gen(index_block->NewIterator(nullptr)));
auto hash_index = CreateBlockHashIndex(index_iter.get(), data_iter.get(),
index_block->NumRestarts(),
comparator, prefix_extractor);
index_block->SetBlockHashIndex(hash_index);
return s;
}
virtual Iterator* NewIterator() override {
return index_block_->NewIterator(comparator_);
}
virtual size_t size() const override { return index_block_->size(); }
private:
HashIndexReader(const Comparator* comparator, Block* index_block)
: IndexReader(comparator), index_block_(index_block) {
assert(index_block_ != nullptr);
}
std::unique_ptr<Block> index_block_;
};
@ -223,6 +256,11 @@ struct BlockBasedTable::Rep {
std::shared_ptr<const TableProperties> table_properties;
BlockBasedTableOptions::IndexType index_type;
// TODO(kailiu) It is very ugly to use internal key in table, since table
// module should not be relying on db module. However to make things easier
// and compatible with existing code, we introduce a wrapper that allows
// block to extract prefix without knowing if a key is internal or not.
unique_ptr<SliceTransform> internal_prefix_transform;
};
BlockBasedTable::~BlockBasedTable() {
@ -327,8 +365,20 @@ Status BlockBasedTable::Open(const Options& options, const EnvOptions& soptions,
s = ReadMetaBlock(rep, &meta, &meta_iter);
// Read the properties
bool found_properties_block = true;
meta_iter->Seek(kPropertiesBlock);
if (meta_iter->Valid() && meta_iter->key() == kPropertiesBlock) {
if (meta_iter->status().ok() &&
(!meta_iter->Valid() || meta_iter->key() != kPropertiesBlock)) {
meta_iter->Seek(kPropertiesBlockOldName);
if (meta_iter->status().ok() &&
(!meta_iter->Valid() || meta_iter->key() != kPropertiesBlockOldName)) {
found_properties_block = false;
Log(WARN, rep->options.info_log,
"Cannot find Properties block from file.");
}
}
if (found_properties_block) {
s = meta_iter->status();
TableProperties* table_properties = nullptr;
if (s.ok()) {
@ -747,8 +797,7 @@ BlockBasedTable::CachableEntry<FilterBlockReader> BlockBasedTable::GetFilter(
return { filter, cache_handle };
}
Iterator* BlockBasedTable::NewIndexIterator(const ReadOptions& read_options)
const {
Iterator* BlockBasedTable::NewIndexIterator(const ReadOptions& read_options) {
// index reader has already been pre-populated.
if (rep_->index_reader) {
return rep_->index_reader->NewIterator();
@ -954,11 +1003,14 @@ Status BlockBasedTable::Get(
return s;
}
namespace {
bool SaveDidIO(void* arg, const ParsedInternalKey& key, const Slice& value,
bool didIO) {
*reinterpret_cast<bool*>(arg) = didIO;
return false;
}
} // namespace
bool BlockBasedTable::TEST_KeyInCache(const ReadOptions& options,
const Slice& key) {
// We use Get() as it has logic that checks whether we read the
@ -975,22 +1027,51 @@ bool BlockBasedTable::TEST_KeyInCache(const ReadOptions& options,
// 3. options
// 4. internal_comparator
// 5. index_type
Status BlockBasedTable::CreateIndexReader(IndexReader** index_reader) const {
Status BlockBasedTable::CreateIndexReader(IndexReader** index_reader) {
// Some old version of block-based tables don't have index type present in
// table properties. If that's the case we can safely use the kBinarySearch.
auto index_type = BlockBasedTableOptions::kBinarySearch;
auto index_type_on_file = BlockBasedTableOptions::kBinarySearch;
if (rep_->table_properties) {
auto& props = rep_->table_properties->user_collected_properties;
auto pos = props.find(BlockBasedTablePropertyNames::kIndexType);
if (pos != props.end()) {
index_type = static_cast<BlockBasedTableOptions::IndexType>(
index_type_on_file = static_cast<BlockBasedTableOptions::IndexType>(
DecodeFixed32(pos->second.c_str()));
}
}
switch (index_type) {
// TODO(sdong): Currently binary index is the only index type we support in
// files. Hash index is built on top of binary index too.
if (index_type_on_file != BlockBasedTableOptions::kBinarySearch) {
return Status::NotSupported("File Contains not supported index type: ",
std::to_string(index_type_on_file));
}
auto file = rep_->file.get();
const auto& index_handle = rep_->index_handle;
auto env = rep_->options.env;
auto comparator = &rep_->internal_comparator;
switch (rep_->index_type) {
case BlockBasedTableOptions::kBinarySearch: {
return BinarySearchIndexReader::Create(
rep_->file.get(), rep_->index_handle, rep_->options.env,
&rep_->internal_comparator, index_reader);
return BinarySearchIndexReader::Create(file, index_handle, env,
comparator, index_reader);
}
case BlockBasedTableOptions::kHashSearch: {
// We need to wrap data with internal_prefix_transform to make sure it can
// handle prefix correctly.
rep_->internal_prefix_transform.reset(
new InternalKeySliceTransform(rep_->options.prefix_extractor.get()));
return HashIndexReader::Create(
file, index_handle, env, comparator,
[&](Iterator* index_iter) {
return NewTwoLevelIterator(
index_iter, &BlockBasedTable::DataBlockReader,
const_cast<BlockBasedTable*>(this), ReadOptions(),
rep_->soptions, rep_->internal_comparator);
},
rep_->internal_prefix_transform.get(), index_reader);
}
default: {
std::string error_message =
@ -1023,7 +1104,10 @@ uint64_t BlockBasedTable::ApproximateOffsetOf(const Slice& key) {
// key is past the last key in the file. If table_properties is not
// available, approximate the offset by returning the offset of the
// metaindex block (which is right near the end of the file).
result = 0;
if (rep_->table_properties) {
result = rep_->table_properties->data_size;
}
// table_properties is not present in the table.
if (result == 0) {
result = rep_->metaindex_handle.offset();

View File

@ -12,6 +12,7 @@
#include <stdint.h>
#include <memory>
#include <utility>
#include <string>
#include "rocksdb/statistics.h"
#include "rocksdb/status.h"
@ -131,7 +132,7 @@ class BlockBasedTable : public TableReader {
// 2. index is not present in block cache.
// 3. We disallowed any io to be performed, that is, read_options ==
// kBlockCacheTier
Iterator* NewIndexIterator(const ReadOptions& read_options) const;
Iterator* NewIndexIterator(const ReadOptions& read_options);
// Read block cache from block caches (if set): block_cache and
// block_cache_compressed.
@ -164,7 +165,7 @@ class BlockBasedTable : public TableReader {
void ReadMeta(const Footer& footer);
void ReadFilter(const Slice& filter_handle_value);
Status CreateIndexReader(IndexReader** index_reader) const;
Status CreateIndexReader(IndexReader** index_reader);
// Read the meta block from sst.
static Status ReadMetaBlock(
@ -198,4 +199,8 @@ class BlockBasedTable : public TableReader {
void operator=(const TableReader&) = delete;
};
// Backward compatible properties block name. Limited in block based
// table.
extern const std::string kPropertiesBlockOldName;
} // namespace rocksdb

View File

@ -3,7 +3,10 @@
// LICENSE file in the root directory of this source tree. An additional grant
// of patent rights can be found in the PATENTS file in the same directory.
//
#include <stdio.h>
#include <string>
#include <vector>
#include "db/dbformat.h"
#include "db/memtable.h"
#include "db/write_batch_internal.h"
@ -11,9 +14,11 @@
#include "rocksdb/env.h"
#include "rocksdb/iterator.h"
#include "rocksdb/table.h"
#include "rocksdb/slice_transform.h"
#include "table/block.h"
#include "table/block_builder.h"
#include "table/format.h"
#include "table/block_hash_index.h"
#include "util/random.h"
#include "util/testharness.h"
#include "util/testutil.h"
@ -25,6 +30,40 @@ static std::string RandomString(Random* rnd, int len) {
test::RandomString(rnd, len, &r);
return r;
}
std::string GenerateKey(int primary_key, int secondary_key, int padding_size,
Random *rnd) {
char buf[50];
char *p = &buf[0];
snprintf(buf, sizeof(buf), "%6d%4d", primary_key, secondary_key);
std::string k(p);
if (padding_size) {
k += RandomString(rnd, padding_size);
}
return k;
}
// Generate random key value pairs.
// The generated key will be sorted. You can tune the parameters to generated
// different kinds of test key/value pairs for different scenario.
void GenerateRandomKVs(std::vector<std::string> *keys,
std::vector<std::string> *values, const int from,
const int len, const int step = 1,
const int padding_size = 0,
const int keys_share_prefix = 1) {
Random rnd(302);
// generate different prefix
for (int i = from; i < from + len; i += step) {
// generating keys that shares the prefix
for (int j = 0; j < keys_share_prefix; ++j) {
keys->emplace_back(GenerateKey(i, j, padding_size, &rnd));
// 100 bytes values
values->emplace_back(RandomString(&rnd, 100));
}
}
}
class BlockTest {};
@ -39,24 +78,11 @@ TEST(BlockTest, SimpleTest) {
std::vector<std::string> values;
BlockBuilder builder(options, ic.get());
int num_records = 100000;
char buf[10];
char* p = &buf[0];
GenerateRandomKVs(&keys, &values, 0, num_records);
// add a bunch of records to a block
for (int i = 0; i < num_records; i++) {
// generate random kvs
sprintf(p, "%6d", i);
std::string k(p);
std::string v = RandomString(&rnd, 100); // 100 byte values
// write kvs to the block
Slice key(k);
Slice value(v);
builder.Add(key, value);
// remember kvs in a lookaside array
keys.push_back(k);
values.push_back(v);
builder.Add(keys[i], values[i]);
}
// read serialized contents of the block
@ -101,6 +127,114 @@ TEST(BlockTest, SimpleTest) {
delete iter;
}
// return the block contents
BlockContents GetBlockContents(std::unique_ptr<BlockBuilder> *builder,
const std::vector<std::string> &keys,
const std::vector<std::string> &values,
const int prefix_group_size = 1) {
builder->reset(
new BlockBuilder(1 /* restart interval */, BytewiseComparator()));
// Add only half of the keys
for (size_t i = 0; i < keys.size(); ++i) {
(*builder)->Add(keys[i], values[i]);
}
Slice rawblock = (*builder)->Finish();
BlockContents contents;
contents.data = rawblock;
contents.cachable = false;
contents.heap_allocated = false;
return contents;
}
void CheckBlockContents(BlockContents contents, const int max_key,
const std::vector<std::string> &keys,
const std::vector<std::string> &values) {
const size_t prefix_size = 6;
// create block reader
Block reader1(contents);
Block reader2(contents);
std::unique_ptr<const SliceTransform> prefix_extractor(
NewFixedPrefixTransform(prefix_size));
{
auto iter1 = reader1.NewIterator(nullptr);
auto iter2 = reader1.NewIterator(nullptr);
reader1.SetBlockHashIndex(CreateBlockHashIndex(iter1, iter2, keys.size(),
BytewiseComparator(),
prefix_extractor.get()));
delete iter1;
delete iter2;
}
std::unique_ptr<Iterator> hash_iter(
reader1.NewIterator(BytewiseComparator()));
std::unique_ptr<Iterator> regular_iter(
reader2.NewIterator(BytewiseComparator()));
// Seek existent keys
for (size_t i = 0; i < keys.size(); i++) {
hash_iter->Seek(keys[i]);
ASSERT_OK(hash_iter->status());
ASSERT_TRUE(hash_iter->Valid());
Slice v = hash_iter->value();
ASSERT_EQ(v.ToString().compare(values[i]), 0);
}
// Seek non-existent keys.
// For hash index, if no key with a given prefix is not found, iterator will
// simply be set as invalid; whereas the binary search based iterator will
// return the one that is closest.
for (int i = 1; i < max_key - 1; i += 2) {
auto key = GenerateKey(i, 0, 0, nullptr);
hash_iter->Seek(key);
ASSERT_TRUE(!hash_iter->Valid());
regular_iter->Seek(key);
ASSERT_TRUE(regular_iter->Valid());
}
}
// In this test case, no two key share same prefix.
TEST(BlockTest, SimpleIndexHash) {
const int kMaxKey = 100000;
std::vector<std::string> keys;
std::vector<std::string> values;
GenerateRandomKVs(&keys, &values, 0 /* first key id */,
kMaxKey /* last key id */, 2 /* step */,
8 /* padding size (8 bytes randomly generated suffix) */);
std::unique_ptr<BlockBuilder> builder;
auto contents = GetBlockContents(&builder, keys, values);
CheckBlockContents(contents, kMaxKey, keys, values);
}
TEST(BlockTest, IndexHashWithSharedPrefix) {
const int kMaxKey = 100000;
// for each prefix, there will be 5 keys starts with it.
const int kPrefixGroup = 5;
std::vector<std::string> keys;
std::vector<std::string> values;
// Generate keys with same prefix.
GenerateRandomKVs(&keys, &values, 0, // first key id
kMaxKey, // last key id
2, // step
10, // padding size,
kPrefixGroup);
std::unique_ptr<BlockBuilder> builder;
auto contents = GetBlockContents(&builder, keys, values, kPrefixGroup);
CheckBlockContents(contents, kMaxKey, keys, values);
}
} // namespace rocksdb
int main(int argc, char** argv) {

View File

@ -244,6 +244,8 @@ Status ReadTableProperties(RandomAccessFile* file, uint64_t file_size,
metaindex_block.NewIterator(BytewiseComparator()));
// -- Read property block
// This function is not used by BlockBasedTable, so we don't have to
// worry about old properties block name.
meta_iter->Seek(kPropertiesBlock);
TableProperties table_properties;
if (meta_iter->Valid() &&

View File

@ -74,10 +74,12 @@ void PlainTableBuilder::Add(const Slice& key, const Slice& value) {
if (!IsFixedLength()) {
// Write key length
key_size_str_.clear();
PutVarint32(&key_size_str_, user_key_size);
file_->Append(key_size_str_);
offset_ += key_size_str_.length();
char key_size_buf[5]; // tmp buffer for key size as varint32
char* ptr = EncodeVarint32(key_size_buf, user_key_size);
assert(ptr <= key_size_buf + sizeof(key_size_buf));
auto len = ptr - key_size_buf;
file_->Append(Slice(key_size_buf, len));
offset_ += len;
}
// Write key
@ -86,25 +88,32 @@ void PlainTableBuilder::Add(const Slice& key, const Slice& value) {
status_ = Status::Corruption(Slice());
return;
}
// For value size as varint32 (up to 5 bytes).
// If the row is of value type with seqId 0, flush the special flag together
// in this buffer to safe one file append call, which takes 1 byte.
char value_size_buf[6];
size_t value_size_buf_size = 0;
if (parsed_key.sequence == 0 && parsed_key.type == kTypeValue) {
file_->Append(Slice(key.data(), user_key_size));
char tmp_char = PlainTableFactory::kValueTypeSeqId0;
file_->Append(Slice(&tmp_char, 1));
offset_ += key.size() - 7;
offset_ += user_key_size;
value_size_buf[0] = PlainTableFactory::kValueTypeSeqId0;
value_size_buf_size = 1;
} else {
file_->Append(key);
offset_ += key.size();
}
// Write value length
value_size_str_.clear();
int value_size = value.size();
PutVarint32(&value_size_str_, value_size);
file_->Append(value_size_str_);
char* end_ptr =
EncodeVarint32(value_size_buf + value_size_buf_size, value_size);
assert(end_ptr <= value_size_buf + sizeof(value_size_buf));
value_size_buf_size = end_ptr - value_size_buf;
file_->Append(Slice(value_size_buf, value_size_buf_size));
// Write value
file_->Append(value);
offset_ += value_size + value_size_str_.length();
offset_ += value_size + value_size_buf_size;
properties_.num_entries++;
properties_.raw_key_size += key.size();

View File

@ -69,9 +69,6 @@ private:
const size_t user_key_len_;
bool closed_ = false; // Either Finish() or Abandon() has been called.
std::string key_size_str_;
std::string value_size_str_;
bool IsFixedLength() const {
return user_key_len_ > 0;
}

View File

@ -21,7 +21,8 @@ Status PlainTableFactory::NewTableReader(const Options& options,
unique_ptr<TableReader>* table) const {
return PlainTableReader::Open(options, soptions, icomp, std::move(file),
file_size, table, bloom_bits_per_key_,
hash_table_ratio_, index_sparseness_);
hash_table_ratio_, index_sparseness_,
huge_page_tlb_size_);
}
TableBuilder* PlainTableFactory::NewTableBuilder(
@ -33,16 +34,19 @@ TableBuilder* PlainTableFactory::NewTableBuilder(
extern TableFactory* NewPlainTableFactory(uint32_t user_key_len,
int bloom_bits_per_key,
double hash_table_ratio,
size_t index_sparseness) {
size_t index_sparseness,
size_t huge_page_tlb_size) {
return new PlainTableFactory(user_key_len, bloom_bits_per_key,
hash_table_ratio, index_sparseness);
hash_table_ratio, index_sparseness,
huge_page_tlb_size);
}
extern TableFactory* NewTotalOrderPlainTableFactory(uint32_t user_key_len,
int bloom_bits_per_key,
size_t index_sparseness) {
size_t index_sparseness,
size_t huge_page_tlb_size) {
return new PlainTableFactory(user_key_len, bloom_bits_per_key, 0,
index_sparseness);
index_sparseness, huge_page_tlb_size);
}
} // namespace rocksdb

View File

@ -54,14 +54,19 @@ class PlainTableFactory : public TableFactory {
// inside the same prefix. It will be the maximum number of linear search
// required after hash and binary search.
// index_sparseness = 0 means index for every key.
// huge_page_tlb_size determines whether to allocate hash indexes from huge
// page TLB and the page size if allocating from there. See comments of
// Arena::AllocateAligned() for details.
explicit PlainTableFactory(uint32_t user_key_len = kPlainTableVariableLength,
int bloom_bits_per_key = 0,
double hash_table_ratio = 0.75,
size_t index_sparseness = 16)
size_t index_sparseness = 16,
size_t huge_page_tlb_size = 2 * 1024 * 1024)
: user_key_len_(user_key_len),
bloom_bits_per_key_(bloom_bits_per_key),
hash_table_ratio_(hash_table_ratio),
index_sparseness_(index_sparseness) {}
index_sparseness_(index_sparseness),
huge_page_tlb_size_(huge_page_tlb_size) {}
const char* Name() const override { return "PlainTable"; }
Status NewTableReader(const Options& options, const EnvOptions& soptions,
const InternalKeyComparator& internal_comparator,
@ -80,6 +85,7 @@ class PlainTableFactory : public TableFactory {
int bloom_bits_per_key_;
double hash_table_ratio_;
size_t index_sparseness_;
size_t huge_page_tlb_size_;
};
} // namespace rocksdb

View File

@ -23,6 +23,7 @@
#include "table/two_level_iterator.h"
#include "table/plain_table_factory.h"
#include "util/arena.h"
#include "util/coding.h"
#include "util/dynamic_bloom.h"
#include "util/hash.h"
@ -81,10 +82,9 @@ class PlainTableIterator : public Iterator {
bool use_prefix_seek_;
uint32_t offset_;
uint32_t next_offset_;
Slice key_;
IterKey key_;
Slice value_;
Status status_;
std::string tmp_str_;
// No copying allowed
PlainTableIterator(const PlainTableIterator&) = delete;
void operator=(const Iterator&) = delete;
@ -95,7 +95,8 @@ PlainTableReader::PlainTableReader(
const Options& options, unique_ptr<RandomAccessFile>&& file,
const EnvOptions& storage_options, const InternalKeyComparator& icomparator,
uint64_t file_size, int bloom_bits_per_key, double hash_table_ratio,
size_t index_sparseness, const TableProperties* table_properties)
size_t index_sparseness, const TableProperties* table_properties,
size_t huge_page_tlb_size)
: options_(options),
soptions_(storage_options),
file_(std::move(file)),
@ -104,21 +105,25 @@ PlainTableReader::PlainTableReader(
kHashTableRatio(hash_table_ratio),
kBloomBitsPerKey(bloom_bits_per_key),
kIndexIntervalForSamePrefixKeys(index_sparseness),
table_properties_(table_properties),
data_end_offset_(table_properties_->data_size),
user_key_len_(table_properties->fixed_key_len) {
table_properties_(nullptr),
data_end_offset_(table_properties->data_size),
user_key_len_(table_properties->fixed_key_len),
huge_page_tlb_size_(huge_page_tlb_size) {
assert(kHashTableRatio >= 0.0);
}
PlainTableReader::~PlainTableReader() {
}
Status PlainTableReader::Open(
const Options& options, const EnvOptions& soptions,
Status PlainTableReader::Open(const Options& options,
const EnvOptions& soptions,
const InternalKeyComparator& internal_comparator,
unique_ptr<RandomAccessFile>&& file, uint64_t file_size,
unique_ptr<TableReader>* table_reader, const int bloom_bits_per_key,
double hash_table_ratio, size_t index_sparseness) {
unique_ptr<RandomAccessFile>&& file,
uint64_t file_size,
unique_ptr<TableReader>* table_reader,
const int bloom_bits_per_key,
double hash_table_ratio, size_t index_sparseness,
size_t huge_page_tlb_size) {
assert(options.allow_mmap_reads);
if (file_size > kMaxFileSize) {
@ -134,10 +139,11 @@ Status PlainTableReader::Open(
std::unique_ptr<PlainTableReader> new_reader(new PlainTableReader(
options, std::move(file), soptions, internal_comparator, file_size,
bloom_bits_per_key, hash_table_ratio, index_sparseness, props));
bloom_bits_per_key, hash_table_ratio, index_sparseness, props,
huge_page_tlb_size));
// -- Populate Index
s = new_reader->PopulateIndex();
s = new_reader->PopulateIndex(props);
if (!s.ok()) {
return s;
}
@ -265,12 +271,11 @@ Status PlainTableReader::PopulateIndexRecordList(IndexRecordList* record_list,
}
void PlainTableReader::AllocateIndexAndBloom(int num_prefixes) {
index_.reset();
if (options_.prefix_extractor.get() != nullptr) {
uint32_t bloom_total_bits = num_prefixes * kBloomBitsPerKey;
if (bloom_total_bits > 0) {
bloom_.reset(new DynamicBloom(bloom_total_bits, options_.bloom_locality));
bloom_.reset(new DynamicBloom(bloom_total_bits, options_.bloom_locality,
6, nullptr, huge_page_tlb_size_));
}
}
@ -282,7 +287,6 @@ void PlainTableReader::AllocateIndexAndBloom(int num_prefixes) {
double hash_table_size_multipier = 1.0 / kHashTableRatio;
index_size_ = num_prefixes * hash_table_size_multipier + 1;
}
index_.reset(new uint32_t[index_size_]);
}
size_t PlainTableReader::BucketizeIndexesAndFillBloom(
@ -326,7 +330,12 @@ void PlainTableReader::FillIndexes(
const std::vector<uint32_t>& entries_per_bucket) {
Log(options_.info_log, "Reserving %zu bytes for plain table's sub_index",
kSubIndexSize);
sub_index_.reset(new char[kSubIndexSize]);
auto total_allocate_size = sizeof(uint32_t) * index_size_ + kSubIndexSize;
char* allocated =
arena_.AllocateAligned(total_allocate_size, huge_page_tlb_size_);
index_ = reinterpret_cast<uint32_t*>(allocated);
sub_index_ = allocated + sizeof(uint32_t) * index_size_;
size_t sub_index_offset = 0;
for (int i = 0; i < index_size_; i++) {
uint32_t num_keys_for_bucket = entries_per_bucket[i];
@ -364,7 +373,10 @@ void PlainTableReader::FillIndexes(
index_size_, kSubIndexSize);
}
Status PlainTableReader::PopulateIndex() {
Status PlainTableReader::PopulateIndex(TableProperties* props) {
assert(props != nullptr);
table_properties_.reset(props);
// options.prefix_extractor is requried for a hash-based look-up.
if (options_.prefix_extractor.get() == nullptr && kHashTableRatio != 0) {
return Status::NotSupported(
@ -388,7 +400,8 @@ Status PlainTableReader::PopulateIndex() {
if (IsTotalOrderMode()) {
uint32_t num_bloom_bits = table_properties_->num_entries * kBloomBitsPerKey;
if (num_bloom_bits > 0) {
bloom_.reset(new DynamicBloom(num_bloom_bits, options_.bloom_locality));
bloom_.reset(new DynamicBloom(num_bloom_bits, options_.bloom_locality, 6,
nullptr, huge_page_tlb_size_));
}
}
@ -409,6 +422,14 @@ Status PlainTableReader::PopulateIndex() {
// From the temp data structure, populate indexes.
FillIndexes(sub_index_size_needed, hash_to_offsets, entries_per_bucket);
// Fill two table properties.
// TODO(sdong): after we have the feature of storing index in file, this
// properties need to be populated to index_size instead.
props->user_collected_properties["plain_table_hash_table_size"] =
std::to_string(index_size_ * 4U);
props->user_collected_properties["plain_table_sub_index_size"] =
std::to_string(sub_index_size_needed);
return Status::OK();
}
@ -720,9 +741,7 @@ void PlainTableIterator::Next() {
status_ = table_->Next(&next_offset_, &parsed_key, &value_);
if (status_.ok()) {
// Make a copy in this case. TODO optimize.
tmp_str_.clear();
AppendInternalKey(&tmp_str_, parsed_key);
key_ = Slice(tmp_str_);
key_.SetInternalKey(parsed_key);
} else {
offset_ = next_offset_ = table_->data_end_offset_;
}
@ -735,7 +754,7 @@ void PlainTableIterator::Prev() {
Slice PlainTableIterator::key() const {
assert(Valid());
return key_;
return key_.GetKey();
}
Slice PlainTableIterator::value() const {

View File

@ -17,6 +17,7 @@
#include "rocksdb/table_properties.h"
#include "table/table_reader.h"
#include "table/plain_table_factory.h"
#include "util/arena.h"
namespace rocksdb {
@ -50,7 +51,7 @@ class PlainTableReader: public TableReader {
unique_ptr<RandomAccessFile>&& file, uint64_t file_size,
unique_ptr<TableReader>* table,
const int bloom_bits_per_key, double hash_table_ratio,
size_t index_sparseness);
size_t index_sparseness, size_t huge_page_tlb_size);
bool PrefixMayMatch(const Slice& internal_prefix);
@ -74,7 +75,8 @@ class PlainTableReader: public TableReader {
const InternalKeyComparator& internal_comparator,
uint64_t file_size, int bloom_num_bits,
double hash_table_ratio, size_t index_sparseness,
const TableProperties* table_properties);
const TableProperties* table_properties,
size_t huge_page_tlb_size);
virtual ~PlainTableReader();
protected:
@ -86,6 +88,9 @@ class PlainTableReader: public TableReader {
// PopulateIndex() builds index of keys. It must be called before any query
// to the table.
//
// props: the table properties object that need to be stored. Ownership of
// the object will be passed.
//
// index_ contains buckets size of index_size_, each is a
// 32-bit integer. The lower 31 bits contain an offset value (explained below)
// and the first bit of the integer indicates type of the offset.
@ -121,7 +126,7 @@ class PlainTableReader: public TableReader {
// ....
// record N file offset: fixedint32
// <end>
Status PopulateIndex();
Status PopulateIndex(TableProperties* props);
private:
struct IndexRecord;
@ -133,9 +138,9 @@ class PlainTableReader: public TableReader {
// For more details about the in-memory index, please refer to:
// https://github.com/facebook/rocksdb/wiki/PlainTable-Format
// #wiki-in-memory-index-format
std::unique_ptr<uint32_t[]> index_;
uint32_t* index_;
int index_size_ = 0;
std::unique_ptr<char[]> sub_index_;
char* sub_index_;
Options options_;
const EnvOptions& soptions_;
@ -156,6 +161,7 @@ class PlainTableReader: public TableReader {
const size_t kIndexIntervalForSamePrefixKeys = 16;
// Bloom filter is used to rule out non-existent key
unique_ptr<DynamicBloom> bloom_;
Arena arena_;
std::shared_ptr<const TableProperties> table_properties_;
// data_start_offset_ and data_end_offset_ defines the range of the
@ -163,6 +169,7 @@ class PlainTableReader: public TableReader {
const uint32_t data_start_offset_ = 0;
const uint32_t data_end_offset_;
const size_t user_key_len_;
const size_t huge_page_tlb_size_;
static const size_t kNumInternalBytes = 8;
static const uint32_t kSubIndexMask = 0x80000000;

View File

@ -91,5 +91,7 @@ const std::string TablePropertiesNames::kFixedKeyLen =
"rocksdb.fixed.key.length";
extern const std::string kPropertiesBlock = "rocksdb.properties";
// Old property block name for backward compatibility
extern const std::string kPropertiesBlockOldName = "rocksdb.stats";
} // namespace rocksdb

View File

@ -1055,6 +1055,116 @@ static std::string RandomString(Random* rnd, int len) {
return r;
}
void AddInternalKey(TableConstructor* c, const std::string prefix,
int suffix_len = 800) {
static Random rnd(1023);
InternalKey k(prefix + RandomString(&rnd, 800), 0, kTypeValue);
c->Add(k.Encode().ToString(), "v");
}
TEST(TableTest, HashIndexTest) {
TableConstructor c(BytewiseComparator());
// keys with prefix length 3, make sure the key/value is big enough to fill
// one block
AddInternalKey(&c, "0015");
AddInternalKey(&c, "0035");
AddInternalKey(&c, "0054");
AddInternalKey(&c, "0055");
AddInternalKey(&c, "0056");
AddInternalKey(&c, "0057");
AddInternalKey(&c, "0058");
AddInternalKey(&c, "0075");
AddInternalKey(&c, "0076");
AddInternalKey(&c, "0095");
std::vector<std::string> keys;
KVMap kvmap;
Options options;
BlockBasedTableOptions table_options;
table_options.index_type = BlockBasedTableOptions::kHashSearch;
options.table_factory.reset(new BlockBasedTableFactory(table_options));
options.prefix_extractor.reset(NewFixedPrefixTransform(3));
options.block_cache = NewLRUCache(1024);
options.block_size = 1700;
std::unique_ptr<InternalKeyComparator> comparator(
new InternalKeyComparator(BytewiseComparator()));
c.Finish(options, *comparator, &keys, &kvmap);
auto reader = c.table_reader();
auto props = c.table_reader()->GetTableProperties();
ASSERT_EQ(5u, props->num_data_blocks);
std::unique_ptr<Iterator> hash_iter(reader->NewIterator(ReadOptions()));
// -- Find keys do not exist, but have common prefix.
std::vector<std::string> prefixes = {"001", "003", "005", "007", "009"};
std::vector<std::string> lower_bound = {keys[0], keys[1], keys[2],
keys[7], keys[9], };
// find the lower bound of the prefix
for (size_t i = 0; i < prefixes.size(); ++i) {
hash_iter->Seek(InternalKey(prefixes[i], 0, kTypeValue).Encode());
ASSERT_OK(hash_iter->status());
ASSERT_TRUE(hash_iter->Valid());
// seek the first element in the block
ASSERT_EQ(lower_bound[i], hash_iter->key().ToString());
ASSERT_EQ("v", hash_iter->value().ToString());
}
// find the upper bound of prefixes
std::vector<std::string> upper_bound = {keys[1], keys[2], keys[7], keys[9], };
// find existing keys
for (const auto& item : kvmap) {
auto ukey = ExtractUserKey(item.first).ToString();
hash_iter->Seek(ukey);
// ASSERT_OK(regular_iter->status());
ASSERT_OK(hash_iter->status());
// ASSERT_TRUE(regular_iter->Valid());
ASSERT_TRUE(hash_iter->Valid());
ASSERT_EQ(item.first, hash_iter->key().ToString());
ASSERT_EQ(item.second, hash_iter->value().ToString());
}
for (size_t i = 0; i < prefixes.size(); ++i) {
// the key is greater than any existing keys.
auto key = prefixes[i] + "9";
hash_iter->Seek(InternalKey(key, 0, kTypeValue).Encode());
ASSERT_OK(hash_iter->status());
if (i == prefixes.size() - 1) {
// last key
ASSERT_TRUE(!hash_iter->Valid());
} else {
ASSERT_TRUE(hash_iter->Valid());
// seek the first element in the block
ASSERT_EQ(upper_bound[i], hash_iter->key().ToString());
ASSERT_EQ("v", hash_iter->value().ToString());
}
}
// find keys with prefix that don't match any of the existing prefixes.
std::vector<std::string> non_exist_prefixes = {"002", "004", "006", "008"};
for (const auto& prefix : non_exist_prefixes) {
hash_iter->Seek(InternalKey(prefix, 0, kTypeValue).Encode());
// regular_iter->Seek(prefix);
ASSERT_OK(hash_iter->status());
ASSERT_TRUE(!hash_iter->Valid());
}
}
// It's very hard to figure out the index block size of a block accurately.
// To make sure we get the index size, we just make sure as key number
// grows, the filter block size also grows.

View File

@ -8,6 +8,7 @@
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#include "util/arena.h"
#include <sys/mman.h>
#include <algorithm>
namespace rocksdb {
@ -38,6 +39,13 @@ Arena::~Arena() {
for (const auto& block : blocks_) {
delete[] block;
}
for (const auto& mmap_info : huge_blocks_) {
auto ret = munmap(mmap_info.addr_, mmap_info.length_);
if (ret != 0) {
// TODO(sdong): Better handling
perror("munmap");
}
}
}
char* Arena::AllocateFallback(size_t bytes, bool aligned) {
@ -63,9 +71,29 @@ char* Arena::AllocateFallback(size_t bytes, bool aligned) {
}
}
char* Arena::AllocateAligned(size_t bytes) {
char* Arena::AllocateAligned(size_t bytes, size_t huge_page_tlb_size) {
assert((kAlignUnit & (kAlignUnit - 1)) ==
0); // Pointer size should be a power of 2
#ifdef OS_LINUX
if (huge_page_tlb_size > 0 && bytes > 0) {
// Allocate from a huge page TBL table.
size_t reserved_size =
((bytes - 1U) / huge_page_tlb_size + 1U) * huge_page_tlb_size;
assert(reserved_size >= bytes);
void* addr = mmap(nullptr, reserved_size, (PROT_READ | PROT_WRITE),
(MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB), 0, 0);
if (addr == MAP_FAILED) {
perror("mmap");
// fail back to malloc
} else {
blocks_memory_ += reserved_size;
huge_blocks_.push_back(MmapInfo(addr, reserved_size));
return reinterpret_cast<char*>(addr);
}
}
#endif
size_t current_mod =
reinterpret_cast<uintptr_t>(aligned_alloc_ptr_) & (kAlignUnit - 1);
size_t slop = (current_mod == 0 ? 0 : kAlignUnit - current_mod);

View File

@ -34,7 +34,14 @@ class Arena {
char* Allocate(size_t bytes);
char* AllocateAligned(size_t bytes);
// huge_page_tlb_size: if >0, allocate bytes from huge page TLB and the size
// of the huge page TLB. Bytes will be rounded up to multiple and 2MB and
// allocate huge pages through mmap anonymous option with huge page on.
// The extra space allocated will be wasted. To enable it, need to reserve
// huge pages for it to be allocated, like:
// sysctl -w vm.nr_hugepages=20
// See linux doc Documentation/vm/hugetlbpage.txt for details.
char* AllocateAligned(size_t bytes, size_t huge_page_tlb_size = 0);
// Returns an estimate of the total memory usage of data allocated
// by the arena (exclude the space allocated but not yet used for future
@ -60,6 +67,14 @@ class Arena {
// Array of new[] allocated memory blocks
typedef std::vector<char*> Blocks;
Blocks blocks_;
struct MmapInfo {
void* addr_;
size_t length_;
MmapInfo(void* addr, size_t length) : addr_(addr), length_(length) {}
};
std::vector<MmapInfo> huge_blocks_;
size_t irregular_block_num = 0;
// Stats for current active block.

View File

@ -19,15 +19,16 @@ static uint32_t BloomHash(const Slice& key) {
}
}
DynamicBloom::DynamicBloom(uint32_t total_bits,
uint32_t cl_per_block,
DynamicBloom::DynamicBloom(uint32_t total_bits, uint32_t cl_per_block,
uint32_t num_probes,
uint32_t (*hash_func)(const Slice& key))
uint32_t (*hash_func)(const Slice& key),
size_t huge_page_tlb_size)
: kBlocked(cl_per_block > 0),
kBitsPerBlock(std::min(cl_per_block, num_probes) * CACHE_LINE_SIZE * 8),
kTotalBits((kBlocked ? (total_bits + kBitsPerBlock - 1) / kBitsPerBlock
* kBitsPerBlock :
total_bits + 7) / 8 * 8),
kTotalBits((kBlocked ? (total_bits + kBitsPerBlock - 1) / kBitsPerBlock *
kBitsPerBlock
: total_bits + 7) /
8 * 8),
kNumBlocks(kBlocked ? kTotalBits / kBitsPerBlock : 1),
kNumProbes(num_probes),
hash_func_(hash_func == nullptr ? &BloomHash : hash_func) {
@ -38,7 +39,9 @@ DynamicBloom::DynamicBloom(uint32_t total_bits,
if (kBlocked) {
sz += CACHE_LINE_SIZE - 1;
}
raw_ = new unsigned char[sz]();
raw_ = reinterpret_cast<unsigned char*>(
arena_.AllocateAligned(sz, huge_page_tlb_size));
memset(raw_, 0, sz);
if (kBlocked && (reinterpret_cast<uint64_t>(raw_) % CACHE_LINE_SIZE)) {
data_ = raw_ + CACHE_LINE_SIZE -
reinterpret_cast<uint64_t>(raw_) % CACHE_LINE_SIZE;

View File

@ -8,6 +8,8 @@
#include <atomic>
#include <memory>
#include <util/arena.h>
namespace rocksdb {
class Slice;
@ -19,13 +21,17 @@ class DynamicBloom {
// cl_per_block: block size in cache lines. When this is non-zero, a
// query/set is done within a block to improve cache locality.
// hash_func: customized hash function
// huge_page_tlb_size: if >0, try to allocate bloom bytes from huge page TLB
// withi this page size. Need to reserve huge pages for
// it to be allocated, like:
// sysctl -w vm.nr_hugepages=20
// See linux doc Documentation/vm/hugetlbpage.txt
explicit DynamicBloom(uint32_t total_bits, uint32_t cl_per_block = 0,
uint32_t num_probes = 6,
uint32_t (*hash_func)(const Slice& key) = nullptr);
uint32_t (*hash_func)(const Slice& key) = nullptr,
size_t huge_page_tlb_size = 0);
~DynamicBloom() {
delete[] raw_;
}
~DynamicBloom() {}
// Assuming single threaded access to this function.
void Add(const Slice& key);
@ -49,6 +55,8 @@ class DynamicBloom {
uint32_t (*hash_func_)(const Slice& key);
unsigned char* data_;
unsigned char* raw_;
Arena arena_;
};
inline void DynamicBloom::Add(const Slice& key) { AddHash(hash_func_(key)); }

View File

@ -761,6 +761,10 @@ class PosixWritableFile : public WritableFile {
}
virtual Status Sync() {
Status s = Flush();
if (!s.ok()) {
return s;
}
TEST_KILL_RANDOM(rocksdb_kill_odds);
if (pending_sync_ && fdatasync(fd_) < 0) {
return IOError(filename_, errno);
@ -771,6 +775,10 @@ class PosixWritableFile : public WritableFile {
}
virtual Status Fsync() {
Status s = Flush();
if (!s.ok()) {
return s;
}
TEST_KILL_RANDOM(rocksdb_kill_odds);
if (pending_fsync_ && fsync(fd_) < 0) {
return IOError(filename_, errno);

View File

@ -290,7 +290,6 @@ TEST(EnvPosixTest, AllocateTest) {
// allocate 100 MB
size_t kPreallocateSize = 100 * 1024 * 1024;
size_t kBlockSize = 512;
size_t kPageSize = 4096;
std::string data = "test";
wfile->SetPreallocationBlockSize(kPreallocateSize);
ASSERT_OK(wfile->Append(Slice(data)));
@ -299,8 +298,9 @@ TEST(EnvPosixTest, AllocateTest) {
struct stat f_stat;
stat(fname.c_str(), &f_stat);
ASSERT_EQ((unsigned int)data.size(), f_stat.st_size);
auto st_blocks = f_stat.st_blocks;
// verify that blocks are preallocated
ASSERT_EQ((unsigned int)(kPreallocateSize / kBlockSize), f_stat.st_blocks);
ASSERT_LE((unsigned int)(kPreallocateSize / kBlockSize), st_blocks);
// close the file, should deallocate the blocks
wfile.reset();
@ -308,8 +308,7 @@ TEST(EnvPosixTest, AllocateTest) {
stat(fname.c_str(), &f_stat);
ASSERT_EQ((unsigned int)data.size(), f_stat.st_size);
// verify that preallocated blocks were deallocated on file close
size_t data_blocks_pages = ((data.size() + kPageSize - 1) / kPageSize);
ASSERT_EQ((unsigned int)(data_blocks_pages * kPageSize / kBlockSize), f_stat.st_blocks);
ASSERT_LT(f_stat.st_blocks, st_blocks);
}
#endif

View File

@ -52,7 +52,8 @@ struct Node {
class HashLinkListRep : public MemTableRep {
public:
HashLinkListRep(const MemTableRep::KeyComparator& compare, Arena* arena,
const SliceTransform* transform, size_t bucket_size);
const SliceTransform* transform, size_t bucket_size,
size_t huge_page_tlb_size);
virtual KeyHandle Allocate(const size_t len, char** buf) override;
@ -308,13 +309,13 @@ class HashLinkListRep : public MemTableRep {
HashLinkListRep::HashLinkListRep(const MemTableRep::KeyComparator& compare,
Arena* arena, const SliceTransform* transform,
size_t bucket_size)
size_t bucket_size, size_t huge_page_tlb_size)
: MemTableRep(arena),
bucket_size_(bucket_size),
transform_(transform),
compare_(compare) {
char* mem = arena_->AllocateAligned(
sizeof(port::AtomicPointer) * bucket_size);
char* mem = arena_->AllocateAligned(sizeof(port::AtomicPointer) * bucket_size,
huge_page_tlb_size);
buckets_ = new (mem) port::AtomicPointer[bucket_size];
@ -476,11 +477,13 @@ Node* HashLinkListRep::FindGreaterOrEqualInBucket(Node* head,
MemTableRep* HashLinkListRepFactory::CreateMemTableRep(
const MemTableRep::KeyComparator& compare, Arena* arena,
const SliceTransform* transform) {
return new HashLinkListRep(compare, arena, transform, bucket_count_);
return new HashLinkListRep(compare, arena, transform, bucket_count_,
huge_page_tlb_size_);
}
MemTableRepFactory* NewHashLinkListRepFactory(size_t bucket_count) {
return new HashLinkListRepFactory(bucket_count);
MemTableRepFactory* NewHashLinkListRepFactory(size_t bucket_count,
size_t huge_page_tlb_size) {
return new HashLinkListRepFactory(bucket_count, huge_page_tlb_size);
}
} // namespace rocksdb

View File

@ -14,8 +14,9 @@ namespace rocksdb {
class HashLinkListRepFactory : public MemTableRepFactory {
public:
explicit HashLinkListRepFactory(size_t bucket_count)
: bucket_count_(bucket_count) { }
explicit HashLinkListRepFactory(size_t bucket_count,
size_t huge_page_tlb_size)
: bucket_count_(bucket_count), huge_page_tlb_size_(huge_page_tlb_size) {}
virtual ~HashLinkListRepFactory() {}
@ -29,6 +30,7 @@ class HashLinkListRepFactory : public MemTableRepFactory {
private:
const size_t bucket_count_;
const size_t huge_page_tlb_size_;
};
}

View File

@ -317,6 +317,10 @@ Options::Dump(Logger* log) const
memtable_prefix_bloom_bits);
Log(log, " Options.memtable_prefix_bloom_probes: %d",
memtable_prefix_bloom_probes);
Log(log, " Options.memtable_prefix_bloom_huge_page_tlb_size: %zu",
memtable_prefix_bloom_huge_page_tlb_size);
Log(log, " Options.bloom_locality: %d",
bloom_locality);
Log(log, " Options.max_successive_merges: %zd",
max_successive_merges);
} // Options::Dump