2016-02-09 15:12:00 -08:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 16:03:42 -07:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2013-10-31 13:38:54 -07:00
|
|
|
|
2014-05-09 08:34:18 -07:00
|
|
|
#ifndef GFLAGS
|
|
|
|
#include <cstdio>
|
|
|
|
int main() {
|
|
|
|
fprintf(stderr, "Please install gflags to run rocksdb tools\n");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
|
2019-05-31 11:52:59 -07:00
|
|
|
#include "db/db_impl/db_impl.h"
|
2017-04-05 19:02:00 -07:00
|
|
|
#include "db/dbformat.h"
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:47:08 -08:00
|
|
|
#include "env/composite_env_wrapper.h"
|
2019-09-16 10:31:27 -07:00
|
|
|
#include "file/random_access_file_reader.h"
|
2017-04-05 19:02:00 -07:00
|
|
|
#include "monitoring/histogram.h"
|
2013-10-31 13:38:54 -07:00
|
|
|
#include "rocksdb/db.h"
|
2013-10-28 20:34:02 -07:00
|
|
|
#include "rocksdb/slice_transform.h"
|
2013-10-31 13:38:54 -07:00
|
|
|
#include "rocksdb/table.h"
|
2019-05-30 14:47:29 -07:00
|
|
|
#include "table/block_based/block_based_table_factory.h"
|
2017-04-05 19:02:00 -07:00
|
|
|
#include "table/get_context.h"
|
2015-10-12 15:06:38 -07:00
|
|
|
#include "table/internal_iterator.h"
|
2019-05-30 14:47:29 -07:00
|
|
|
#include "table/plain/plain_table_factory.h"
|
2014-02-12 18:09:24 -08:00
|
|
|
#include "table/table_builder.h"
|
2019-05-30 11:21:38 -07:00
|
|
|
#include "test_util/testharness.h"
|
|
|
|
#include "test_util/testutil.h"
|
2019-05-30 17:39:43 -07:00
|
|
|
#include "util/gflags_compat.h"
|
2013-10-31 13:38:54 -07:00
|
|
|
|
2017-12-01 10:40:45 -08:00
|
|
|
using GFLAGS_NAMESPACE::ParseCommandLineFlags;
|
|
|
|
using GFLAGS_NAMESPACE::SetUsageMessage;
|
2014-05-09 08:34:18 -07:00
|
|
|
|
2020-02-20 12:07:53 -08:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2014-04-09 21:17:14 -07:00
|
|
|
|
|
|
|
namespace {
|
2013-10-31 13:38:54 -07:00
|
|
|
// Make a key that i determines the first 4 characters and j determines the
|
|
|
|
// last 4 characters.
|
2013-11-15 22:23:12 -08:00
|
|
|
static std::string MakeKey(int i, int j, bool through_db) {
|
2013-10-31 13:38:54 -07:00
|
|
|
char buf[100];
|
2013-11-15 22:23:12 -08:00
|
|
|
snprintf(buf, sizeof(buf), "%04d__key___%04d", i, j);
|
|
|
|
if (through_db) {
|
|
|
|
return std::string(buf);
|
|
|
|
}
|
|
|
|
// If we directly query table, which operates on internal keys
|
|
|
|
// instead of user keys, we need to add 8 bytes of internal
|
|
|
|
// information (row type etc) to user key to make an internal
|
|
|
|
// key.
|
|
|
|
InternalKey key(std::string(buf), 0, ValueType::kTypeValue);
|
|
|
|
return key.Encode().ToString();
|
2013-10-31 13:38:54 -07:00
|
|
|
}
|
|
|
|
|
Benchmark table reader wiht nanoseconds
Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results.
Test Plan:
sample output:
./table_reader_bench --plain_table --time_unit=nanosecond
=======================================================================================================
InMemoryTableSimpleBenchmark: PlainTable num_key1: 4096 num_key2: 512 non_empty
=======================================================================================================
Histogram (unit: nanosecond):
Count: 6291456 Average: 475.3867 StdDev: 556.05
Min: 135.0000 Median: 400.1817 Max: 33370.0000
Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21
------------------------------------------------------
[ 120, 140 ) 2 0.000% 0.000%
[ 140, 160 ) 452 0.007% 0.007%
[ 160, 180 ) 13683 0.217% 0.225%
[ 180, 200 ) 54353 0.864% 1.089%
[ 200, 250 ) 101004 1.605% 2.694%
[ 250, 300 ) 729791 11.600% 14.294% ##
[ 300, 350 ) 616070 9.792% 24.086% ##
[ 350, 400 ) 1628021 25.877% 49.963% #####
[ 400, 450 ) 647220 10.287% 60.250% ##
[ 450, 500 ) 577206 9.174% 69.424% ##
[ 500, 600 ) 1168585 18.574% 87.999% ####
[ 600, 700 ) 506875 8.057% 96.055% ##
[ 700, 800 ) 147878 2.350% 98.406%
[ 800, 900 ) 42633 0.678% 99.083%
[ 900, 1000 ) 16304 0.259% 99.342%
[ 1000, 1200 ) 7811 0.124% 99.466%
[ 1200, 1400 ) 1453 0.023% 99.490%
[ 1400, 1600 ) 307 0.005% 99.494%
[ 1600, 1800 ) 81 0.001% 99.496%
[ 1800, 2000 ) 18 0.000% 99.496%
[ 2000, 2500 ) 8 0.000% 99.496%
[ 2500, 3000 ) 6 0.000% 99.496%
[ 3500, 4000 ) 3 0.000% 99.496%
[ 4000, 4500 ) 116 0.002% 99.498%
[ 4500, 5000 ) 1144 0.018% 99.516%
[ 5000, 6000 ) 1087 0.017% 99.534%
[ 6000, 7000 ) 2403 0.038% 99.572%
[ 7000, 8000 ) 9840 0.156% 99.728%
[ 8000, 9000 ) 12820 0.204% 99.932%
[ 9000, 10000 ) 3881 0.062% 99.994%
[ 10000, 12000 ) 135 0.002% 99.996%
[ 12000, 14000 ) 159 0.003% 99.998%
[ 14000, 16000 ) 58 0.001% 99.999%
[ 16000, 18000 ) 30 0.000% 100.000%
[ 18000, 20000 ) 14 0.000% 100.000%
[ 20000, 25000 ) 2 0.000% 100.000%
[ 25000, 30000 ) 2 0.000% 100.000%
[ 30000, 35000 ) 1 0.000% 100.000%
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16113
2014-02-13 13:55:04 -08:00
|
|
|
uint64_t Now(Env* env, bool measured_by_nanosecond) {
|
|
|
|
return measured_by_nanosecond ? env->NowNanos() : env->NowMicros();
|
|
|
|
}
|
2014-04-09 21:17:14 -07:00
|
|
|
} // namespace
|
Benchmark table reader wiht nanoseconds
Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results.
Test Plan:
sample output:
./table_reader_bench --plain_table --time_unit=nanosecond
=======================================================================================================
InMemoryTableSimpleBenchmark: PlainTable num_key1: 4096 num_key2: 512 non_empty
=======================================================================================================
Histogram (unit: nanosecond):
Count: 6291456 Average: 475.3867 StdDev: 556.05
Min: 135.0000 Median: 400.1817 Max: 33370.0000
Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21
------------------------------------------------------
[ 120, 140 ) 2 0.000% 0.000%
[ 140, 160 ) 452 0.007% 0.007%
[ 160, 180 ) 13683 0.217% 0.225%
[ 180, 200 ) 54353 0.864% 1.089%
[ 200, 250 ) 101004 1.605% 2.694%
[ 250, 300 ) 729791 11.600% 14.294% ##
[ 300, 350 ) 616070 9.792% 24.086% ##
[ 350, 400 ) 1628021 25.877% 49.963% #####
[ 400, 450 ) 647220 10.287% 60.250% ##
[ 450, 500 ) 577206 9.174% 69.424% ##
[ 500, 600 ) 1168585 18.574% 87.999% ####
[ 600, 700 ) 506875 8.057% 96.055% ##
[ 700, 800 ) 147878 2.350% 98.406%
[ 800, 900 ) 42633 0.678% 99.083%
[ 900, 1000 ) 16304 0.259% 99.342%
[ 1000, 1200 ) 7811 0.124% 99.466%
[ 1200, 1400 ) 1453 0.023% 99.490%
[ 1400, 1600 ) 307 0.005% 99.494%
[ 1600, 1800 ) 81 0.001% 99.496%
[ 1800, 2000 ) 18 0.000% 99.496%
[ 2000, 2500 ) 8 0.000% 99.496%
[ 2500, 3000 ) 6 0.000% 99.496%
[ 3500, 4000 ) 3 0.000% 99.496%
[ 4000, 4500 ) 116 0.002% 99.498%
[ 4500, 5000 ) 1144 0.018% 99.516%
[ 5000, 6000 ) 1087 0.017% 99.534%
[ 6000, 7000 ) 2403 0.038% 99.572%
[ 7000, 8000 ) 9840 0.156% 99.728%
[ 8000, 9000 ) 12820 0.204% 99.932%
[ 9000, 10000 ) 3881 0.062% 99.994%
[ 10000, 12000 ) 135 0.002% 99.996%
[ 12000, 14000 ) 159 0.003% 99.998%
[ 14000, 16000 ) 58 0.001% 99.999%
[ 16000, 18000 ) 30 0.000% 100.000%
[ 18000, 20000 ) 14 0.000% 100.000%
[ 20000, 25000 ) 2 0.000% 100.000%
[ 25000, 30000 ) 2 0.000% 100.000%
[ 30000, 35000 ) 1 0.000% 100.000%
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16113
2014-02-13 13:55:04 -08:00
|
|
|
|
2013-10-31 13:38:54 -07:00
|
|
|
// A very simple benchmark that.
|
|
|
|
// Create a table with roughly numKey1 * numKey2 keys,
|
|
|
|
// where there are numKey1 prefixes of the key, each has numKey2 number of
|
|
|
|
// distinguished key, differing in the suffix part.
|
|
|
|
// If if_query_empty_keys = false, query the existing keys numKey1 * numKey2
|
|
|
|
// times randomly.
|
|
|
|
// If if_query_empty_keys = true, query numKey1 * numKey2 random empty keys.
|
|
|
|
// Print out the total time.
|
2013-11-15 22:23:12 -08:00
|
|
|
// If through_db=true, a full DB will be created and queries will be against
|
|
|
|
// it. Otherwise, operations will be directly through table level.
|
2013-10-31 13:38:54 -07:00
|
|
|
//
|
|
|
|
// If for_terator=true, instead of just query one key each time, it queries
|
|
|
|
// a range sharing the same prefix.
|
2014-04-09 21:17:14 -07:00
|
|
|
namespace {
|
2013-10-31 13:38:54 -07:00
|
|
|
void TableReaderBenchmark(Options& opts, EnvOptions& env_options,
|
2013-11-15 22:23:12 -08:00
|
|
|
ReadOptions& read_options, int num_keys1,
|
2018-04-12 17:55:14 -07:00
|
|
|
int num_keys2, int num_iter, int /*prefix_len*/,
|
2013-11-15 22:23:12 -08:00
|
|
|
bool if_query_empty_keys, bool for_iterator,
|
Benchmark table reader wiht nanoseconds
Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results.
Test Plan:
sample output:
./table_reader_bench --plain_table --time_unit=nanosecond
=======================================================================================================
InMemoryTableSimpleBenchmark: PlainTable num_key1: 4096 num_key2: 512 non_empty
=======================================================================================================
Histogram (unit: nanosecond):
Count: 6291456 Average: 475.3867 StdDev: 556.05
Min: 135.0000 Median: 400.1817 Max: 33370.0000
Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21
------------------------------------------------------
[ 120, 140 ) 2 0.000% 0.000%
[ 140, 160 ) 452 0.007% 0.007%
[ 160, 180 ) 13683 0.217% 0.225%
[ 180, 200 ) 54353 0.864% 1.089%
[ 200, 250 ) 101004 1.605% 2.694%
[ 250, 300 ) 729791 11.600% 14.294% ##
[ 300, 350 ) 616070 9.792% 24.086% ##
[ 350, 400 ) 1628021 25.877% 49.963% #####
[ 400, 450 ) 647220 10.287% 60.250% ##
[ 450, 500 ) 577206 9.174% 69.424% ##
[ 500, 600 ) 1168585 18.574% 87.999% ####
[ 600, 700 ) 506875 8.057% 96.055% ##
[ 700, 800 ) 147878 2.350% 98.406%
[ 800, 900 ) 42633 0.678% 99.083%
[ 900, 1000 ) 16304 0.259% 99.342%
[ 1000, 1200 ) 7811 0.124% 99.466%
[ 1200, 1400 ) 1453 0.023% 99.490%
[ 1400, 1600 ) 307 0.005% 99.494%
[ 1600, 1800 ) 81 0.001% 99.496%
[ 1800, 2000 ) 18 0.000% 99.496%
[ 2000, 2500 ) 8 0.000% 99.496%
[ 2500, 3000 ) 6 0.000% 99.496%
[ 3500, 4000 ) 3 0.000% 99.496%
[ 4000, 4500 ) 116 0.002% 99.498%
[ 4500, 5000 ) 1144 0.018% 99.516%
[ 5000, 6000 ) 1087 0.017% 99.534%
[ 6000, 7000 ) 2403 0.038% 99.572%
[ 7000, 8000 ) 9840 0.156% 99.728%
[ 8000, 9000 ) 12820 0.204% 99.932%
[ 9000, 10000 ) 3881 0.062% 99.994%
[ 10000, 12000 ) 135 0.002% 99.996%
[ 12000, 14000 ) 159 0.003% 99.998%
[ 14000, 16000 ) 58 0.001% 99.999%
[ 16000, 18000 ) 30 0.000% 100.000%
[ 18000, 20000 ) 14 0.000% 100.000%
[ 20000, 25000 ) 2 0.000% 100.000%
[ 25000, 30000 ) 2 0.000% 100.000%
[ 30000, 35000 ) 1 0.000% 100.000%
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16113
2014-02-13 13:55:04 -08:00
|
|
|
bool through_db, bool measured_by_nanosecond) {
|
2020-02-20 12:07:53 -08:00
|
|
|
ROCKSDB_NAMESPACE::InternalKeyComparator ikc(opts.comparator);
|
2014-02-12 18:09:24 -08:00
|
|
|
|
2018-07-13 17:18:39 -07:00
|
|
|
std::string file_name =
|
|
|
|
test::PerThreadDBPath("rocksdb_table_reader_benchmark");
|
|
|
|
std::string dbname = test::PerThreadDBPath("rocksdb_table_reader_bench_db");
|
2013-11-15 22:23:12 -08:00
|
|
|
WriteOptions wo;
|
2013-10-31 13:38:54 -07:00
|
|
|
Env* env = Env::Default();
|
2013-11-15 22:23:12 -08:00
|
|
|
TableBuilder* tb = nullptr;
|
|
|
|
DB* db = nullptr;
|
|
|
|
Status s;
|
2014-09-04 16:18:36 -07:00
|
|
|
const ImmutableCFOptions ioptions(opts);
|
2018-05-21 14:33:55 -07:00
|
|
|
const ColumnFamilyOptions cfo(opts);
|
|
|
|
const MutableCFOptions moptions(cfo);
|
2018-11-09 11:17:34 -08:00
|
|
|
std::unique_ptr<WritableFileWriter> file_writer;
|
2013-11-15 22:23:12 -08:00
|
|
|
if (!through_db) {
|
2018-11-09 11:17:34 -08:00
|
|
|
std::unique_ptr<WritableFile> file;
|
2013-11-15 22:23:12 -08:00
|
|
|
env->NewWritableFile(file_name, &file, env_options);
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:04:30 -07:00
|
|
|
|
|
|
|
std::vector<std::unique_ptr<IntTblPropCollectorFactory> >
|
|
|
|
int_tbl_prop_collector_factories;
|
|
|
|
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:47:08 -08:00
|
|
|
file_writer.reset(new WritableFileWriter(
|
|
|
|
NewLegacyWritableFileWrapper(std::move(file)), file_name, env_options));
|
2016-09-18 13:30:43 +08:00
|
|
|
int unknown_level = -1;
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:04:30 -07:00
|
|
|
tb = opts.table_factory->NewTableBuilder(
|
2018-05-21 14:33:55 -07:00
|
|
|
TableBuilderOptions(
|
|
|
|
ioptions, moptions, ikc, &int_tbl_prop_collector_factories,
|
2019-03-18 12:07:35 -07:00
|
|
|
CompressionType::kNoCompression, 0 /* sample_for_compression */,
|
|
|
|
CompressionOptions(), false /* skip_filters */,
|
|
|
|
kDefaultColumnFamilyName, unknown_level),
|
2016-04-06 23:10:32 -07:00
|
|
|
0 /* column_family_id */, file_writer.get());
|
2013-11-15 22:23:12 -08:00
|
|
|
} else {
|
|
|
|
s = DB::Open(opts, dbname, &db);
|
|
|
|
ASSERT_OK(s);
|
|
|
|
ASSERT_TRUE(db != nullptr);
|
|
|
|
}
|
2013-10-31 13:38:54 -07:00
|
|
|
// Populate slightly more than 1M keys
|
|
|
|
for (int i = 0; i < num_keys1; i++) {
|
|
|
|
for (int j = 0; j < num_keys2; j++) {
|
2013-11-15 22:23:12 -08:00
|
|
|
std::string key = MakeKey(i * 2, j, through_db);
|
|
|
|
if (!through_db) {
|
|
|
|
tb->Add(key, key);
|
|
|
|
} else {
|
|
|
|
db->Put(wo, key, key);
|
|
|
|
}
|
2013-10-31 13:38:54 -07:00
|
|
|
}
|
|
|
|
}
|
2013-11-15 22:23:12 -08:00
|
|
|
if (!through_db) {
|
|
|
|
tb->Finish();
|
2015-09-16 16:57:43 -07:00
|
|
|
file_writer->Close();
|
2013-11-15 22:23:12 -08:00
|
|
|
} else {
|
|
|
|
db->Flush(FlushOptions());
|
|
|
|
}
|
2013-10-31 13:38:54 -07:00
|
|
|
|
2018-11-09 11:17:34 -08:00
|
|
|
std::unique_ptr<TableReader> table_reader;
|
2013-11-15 22:23:12 -08:00
|
|
|
if (!through_db) {
|
2018-11-09 11:17:34 -08:00
|
|
|
std::unique_ptr<RandomAccessFile> raf;
|
2014-11-07 15:04:30 -08:00
|
|
|
s = env->NewRandomAccessFile(file_name, &raf, env_options);
|
2015-09-16 16:57:43 -07:00
|
|
|
if (!s.ok()) {
|
|
|
|
fprintf(stderr, "Create File Error: %s\n", s.ToString().c_str());
|
|
|
|
exit(1);
|
|
|
|
}
|
2013-11-15 22:23:12 -08:00
|
|
|
uint64_t file_size;
|
|
|
|
env->GetFileSize(file_name, &file_size);
|
2018-11-09 11:17:34 -08:00
|
|
|
std::unique_ptr<RandomAccessFileReader> file_reader(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
2019-12-13 14:47:08 -08:00
|
|
|
new RandomAccessFileReader(NewLegacyRandomAccessFileWrapper(raf),
|
|
|
|
file_name));
|
2015-09-11 11:36:33 -07:00
|
|
|
s = opts.table_factory->NewTableReader(
|
2018-05-21 14:33:55 -07:00
|
|
|
TableReaderOptions(ioptions, moptions.prefix_extractor.get(),
|
|
|
|
env_options, ikc),
|
|
|
|
std::move(file_reader), file_size, &table_reader);
|
2015-09-16 16:57:43 -07:00
|
|
|
if (!s.ok()) {
|
|
|
|
fprintf(stderr, "Open Table Error: %s\n", s.ToString().c_str());
|
|
|
|
exit(1);
|
|
|
|
}
|
2013-11-15 22:23:12 -08:00
|
|
|
}
|
2013-10-31 13:38:54 -07:00
|
|
|
|
|
|
|
Random rnd(301);
|
2013-11-15 22:23:12 -08:00
|
|
|
std::string result;
|
2013-10-31 13:38:54 -07:00
|
|
|
HistogramImpl hist;
|
|
|
|
|
|
|
|
for (int it = 0; it < num_iter; it++) {
|
|
|
|
for (int i = 0; i < num_keys1; i++) {
|
|
|
|
for (int j = 0; j < num_keys2; j++) {
|
|
|
|
int r1 = rnd.Uniform(num_keys1) * 2;
|
|
|
|
int r2 = rnd.Uniform(num_keys2);
|
2013-11-15 22:23:12 -08:00
|
|
|
if (if_query_empty_keys) {
|
|
|
|
r1++;
|
|
|
|
r2 = num_keys2 * 2 - r2;
|
|
|
|
}
|
|
|
|
|
2013-10-31 13:38:54 -07:00
|
|
|
if (!for_iterator) {
|
|
|
|
// Query one existing key;
|
2013-11-15 22:23:12 -08:00
|
|
|
std::string key = MakeKey(r1, r2, through_db);
|
2014-02-13 15:27:59 -08:00
|
|
|
uint64_t start_time = Now(env, measured_by_nanosecond);
|
2013-11-15 22:23:12 -08:00
|
|
|
if (!through_db) {
|
2017-03-13 11:44:50 -07:00
|
|
|
PinnableSlice value;
|
2014-09-29 11:09:09 -07:00
|
|
|
MergeContext merge_context;
|
Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.
In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```
...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```
The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.
Readrandom results before PR:
```
readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found)
```
Readrandom results after PR:
```
readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found)
```
So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
Differential Revision: D10370575
Pulled By: abhimadan
fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:29:29 -07:00
|
|
|
SequenceNumber max_covering_tombstone_seq = 0;
|
2016-11-03 18:40:23 -07:00
|
|
|
GetContext get_context(ioptions.user_comparator,
|
|
|
|
ioptions.merge_operator, ioptions.info_log,
|
|
|
|
ioptions.statistics, GetContext::kNotFound,
|
|
|
|
Slice(key), &value, nullptr, &merge_context,
|
New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:22:34 -07:00
|
|
|
true, &max_covering_tombstone_seq, env);
|
2018-05-21 14:33:55 -07:00
|
|
|
s = table_reader->Get(read_options, key, &get_context, nullptr);
|
2013-11-15 22:23:12 -08:00
|
|
|
} else {
|
2014-02-12 18:09:24 -08:00
|
|
|
s = db->Get(read_options, key, &result);
|
2013-11-15 22:23:12 -08:00
|
|
|
}
|
2014-02-13 15:27:59 -08:00
|
|
|
hist.Add(Now(env, measured_by_nanosecond) - start_time);
|
2013-10-31 13:38:54 -07:00
|
|
|
} else {
|
2013-11-15 22:23:12 -08:00
|
|
|
int r2_len;
|
|
|
|
if (if_query_empty_keys) {
|
|
|
|
r2_len = 0;
|
|
|
|
} else {
|
|
|
|
r2_len = rnd.Uniform(num_keys2) + 1;
|
|
|
|
if (r2_len + r2 > num_keys2) {
|
|
|
|
r2_len = num_keys2 - r2;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
std::string start_key = MakeKey(r1, r2, through_db);
|
|
|
|
std::string end_key = MakeKey(r1, r2 + r2_len, through_db);
|
2013-10-31 15:26:06 -07:00
|
|
|
uint64_t total_time = 0;
|
2014-02-13 15:27:59 -08:00
|
|
|
uint64_t start_time = Now(env, measured_by_nanosecond);
|
2015-10-12 15:06:38 -07:00
|
|
|
Iterator* iter = nullptr;
|
|
|
|
InternalIterator* iiter = nullptr;
|
2013-11-15 22:23:12 -08:00
|
|
|
if (!through_db) {
|
2019-06-20 14:28:22 -07:00
|
|
|
iiter = table_reader->NewIterator(
|
|
|
|
read_options, /*prefix_extractor=*/nullptr, /*arena=*/nullptr,
|
|
|
|
/*skip_filters=*/false, TableReaderCaller::kUncategorized);
|
2013-11-15 22:23:12 -08:00
|
|
|
} else {
|
|
|
|
iter = db->NewIterator(read_options);
|
|
|
|
}
|
2013-10-31 13:38:54 -07:00
|
|
|
int count = 0;
|
2015-10-12 15:06:38 -07:00
|
|
|
for (through_db ? iter->Seek(start_key) : iiter->Seek(start_key);
|
|
|
|
through_db ? iter->Valid() : iiter->Valid();
|
|
|
|
through_db ? iter->Next() : iiter->Next()) {
|
2013-11-15 22:23:12 -08:00
|
|
|
if (if_query_empty_keys) {
|
|
|
|
break;
|
|
|
|
}
|
2013-10-31 13:38:54 -07:00
|
|
|
// verify key;
|
2014-02-13 15:27:59 -08:00
|
|
|
total_time += Now(env, measured_by_nanosecond) - start_time;
|
plain table reader: non-mmap mode to keep two recent buffers
Summary: In plain table reader's non-mmap mode, we only keep the most recent read buffer. However, for binary search, it is likely we come back to a location to read. To avoid one pread in such a case, we keep two read buffers. It should cover most of the cases.
Test Plan:
1. run tests
2. check the optimization works through strace when running
./table_reader_bench -mmap_read=false --num_keys2=1 -num_keys1=5000 -table_factory=plain_table --iterator --through_db
Reviewers: anthony, rven, kradhakrishnan, igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51171
2015-12-23 17:30:10 -08:00
|
|
|
assert(Slice(MakeKey(r1, r2 + count, through_db)) ==
|
|
|
|
(through_db ? iter->key() : iiter->key()));
|
2014-02-13 15:27:59 -08:00
|
|
|
start_time = Now(env, measured_by_nanosecond);
|
2013-10-31 13:38:54 -07:00
|
|
|
if (++count >= r2_len) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (count != r2_len) {
|
|
|
|
fprintf(
|
|
|
|
stderr, "Iterator cannot iterate expected number of entries. "
|
|
|
|
"Expected %d but got %d\n", r2_len, count);
|
|
|
|
assert(false);
|
|
|
|
}
|
|
|
|
delete iter;
|
2014-02-13 15:27:59 -08:00
|
|
|
total_time += Now(env, measured_by_nanosecond) - start_time;
|
2013-10-31 15:26:06 -07:00
|
|
|
hist.Add(total_time);
|
2013-10-31 13:38:54 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
fprintf(
|
|
|
|
stderr,
|
|
|
|
"==================================================="
|
|
|
|
"====================================================\n"
|
|
|
|
"InMemoryTableSimpleBenchmark: %20s num_key1: %5d "
|
|
|
|
"num_key2: %5d %10s\n"
|
|
|
|
"==================================================="
|
|
|
|
"===================================================="
|
Benchmark table reader wiht nanoseconds
Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results.
Test Plan:
sample output:
./table_reader_bench --plain_table --time_unit=nanosecond
=======================================================================================================
InMemoryTableSimpleBenchmark: PlainTable num_key1: 4096 num_key2: 512 non_empty
=======================================================================================================
Histogram (unit: nanosecond):
Count: 6291456 Average: 475.3867 StdDev: 556.05
Min: 135.0000 Median: 400.1817 Max: 33370.0000
Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21
------------------------------------------------------
[ 120, 140 ) 2 0.000% 0.000%
[ 140, 160 ) 452 0.007% 0.007%
[ 160, 180 ) 13683 0.217% 0.225%
[ 180, 200 ) 54353 0.864% 1.089%
[ 200, 250 ) 101004 1.605% 2.694%
[ 250, 300 ) 729791 11.600% 14.294% ##
[ 300, 350 ) 616070 9.792% 24.086% ##
[ 350, 400 ) 1628021 25.877% 49.963% #####
[ 400, 450 ) 647220 10.287% 60.250% ##
[ 450, 500 ) 577206 9.174% 69.424% ##
[ 500, 600 ) 1168585 18.574% 87.999% ####
[ 600, 700 ) 506875 8.057% 96.055% ##
[ 700, 800 ) 147878 2.350% 98.406%
[ 800, 900 ) 42633 0.678% 99.083%
[ 900, 1000 ) 16304 0.259% 99.342%
[ 1000, 1200 ) 7811 0.124% 99.466%
[ 1200, 1400 ) 1453 0.023% 99.490%
[ 1400, 1600 ) 307 0.005% 99.494%
[ 1600, 1800 ) 81 0.001% 99.496%
[ 1800, 2000 ) 18 0.000% 99.496%
[ 2000, 2500 ) 8 0.000% 99.496%
[ 2500, 3000 ) 6 0.000% 99.496%
[ 3500, 4000 ) 3 0.000% 99.496%
[ 4000, 4500 ) 116 0.002% 99.498%
[ 4500, 5000 ) 1144 0.018% 99.516%
[ 5000, 6000 ) 1087 0.017% 99.534%
[ 6000, 7000 ) 2403 0.038% 99.572%
[ 7000, 8000 ) 9840 0.156% 99.728%
[ 8000, 9000 ) 12820 0.204% 99.932%
[ 9000, 10000 ) 3881 0.062% 99.994%
[ 10000, 12000 ) 135 0.002% 99.996%
[ 12000, 14000 ) 159 0.003% 99.998%
[ 14000, 16000 ) 58 0.001% 99.999%
[ 16000, 18000 ) 30 0.000% 100.000%
[ 18000, 20000 ) 14 0.000% 100.000%
[ 20000, 25000 ) 2 0.000% 100.000%
[ 25000, 30000 ) 2 0.000% 100.000%
[ 30000, 35000 ) 1 0.000% 100.000%
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16113
2014-02-13 13:55:04 -08:00
|
|
|
"\nHistogram (unit: %s): \n%s",
|
2013-11-15 22:23:12 -08:00
|
|
|
opts.table_factory->Name(), num_keys1, num_keys2,
|
Benchmark table reader wiht nanoseconds
Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results.
Test Plan:
sample output:
./table_reader_bench --plain_table --time_unit=nanosecond
=======================================================================================================
InMemoryTableSimpleBenchmark: PlainTable num_key1: 4096 num_key2: 512 non_empty
=======================================================================================================
Histogram (unit: nanosecond):
Count: 6291456 Average: 475.3867 StdDev: 556.05
Min: 135.0000 Median: 400.1817 Max: 33370.0000
Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21
------------------------------------------------------
[ 120, 140 ) 2 0.000% 0.000%
[ 140, 160 ) 452 0.007% 0.007%
[ 160, 180 ) 13683 0.217% 0.225%
[ 180, 200 ) 54353 0.864% 1.089%
[ 200, 250 ) 101004 1.605% 2.694%
[ 250, 300 ) 729791 11.600% 14.294% ##
[ 300, 350 ) 616070 9.792% 24.086% ##
[ 350, 400 ) 1628021 25.877% 49.963% #####
[ 400, 450 ) 647220 10.287% 60.250% ##
[ 450, 500 ) 577206 9.174% 69.424% ##
[ 500, 600 ) 1168585 18.574% 87.999% ####
[ 600, 700 ) 506875 8.057% 96.055% ##
[ 700, 800 ) 147878 2.350% 98.406%
[ 800, 900 ) 42633 0.678% 99.083%
[ 900, 1000 ) 16304 0.259% 99.342%
[ 1000, 1200 ) 7811 0.124% 99.466%
[ 1200, 1400 ) 1453 0.023% 99.490%
[ 1400, 1600 ) 307 0.005% 99.494%
[ 1600, 1800 ) 81 0.001% 99.496%
[ 1800, 2000 ) 18 0.000% 99.496%
[ 2000, 2500 ) 8 0.000% 99.496%
[ 2500, 3000 ) 6 0.000% 99.496%
[ 3500, 4000 ) 3 0.000% 99.496%
[ 4000, 4500 ) 116 0.002% 99.498%
[ 4500, 5000 ) 1144 0.018% 99.516%
[ 5000, 6000 ) 1087 0.017% 99.534%
[ 6000, 7000 ) 2403 0.038% 99.572%
[ 7000, 8000 ) 9840 0.156% 99.728%
[ 8000, 9000 ) 12820 0.204% 99.932%
[ 9000, 10000 ) 3881 0.062% 99.994%
[ 10000, 12000 ) 135 0.002% 99.996%
[ 12000, 14000 ) 159 0.003% 99.998%
[ 14000, 16000 ) 58 0.001% 99.999%
[ 16000, 18000 ) 30 0.000% 100.000%
[ 18000, 20000 ) 14 0.000% 100.000%
[ 20000, 25000 ) 2 0.000% 100.000%
[ 25000, 30000 ) 2 0.000% 100.000%
[ 30000, 35000 ) 1 0.000% 100.000%
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16113
2014-02-13 13:55:04 -08:00
|
|
|
for_iterator ? "iterator" : (if_query_empty_keys ? "empty" : "non_empty"),
|
|
|
|
measured_by_nanosecond ? "nanosecond" : "microsecond",
|
2013-10-31 13:38:54 -07:00
|
|
|
hist.ToString().c_str());
|
2013-11-15 22:23:12 -08:00
|
|
|
if (!through_db) {
|
|
|
|
env->DeleteFile(file_name);
|
|
|
|
} else {
|
|
|
|
delete db;
|
|
|
|
db = nullptr;
|
|
|
|
DestroyDB(dbname, opts);
|
|
|
|
}
|
2013-10-31 13:38:54 -07:00
|
|
|
}
|
2014-04-09 21:17:14 -07:00
|
|
|
} // namespace
|
2020-02-20 12:07:53 -08:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|
2013-10-31 13:38:54 -07:00
|
|
|
|
|
|
|
DEFINE_bool(query_empty, false, "query non-existing keys instead of existing "
|
|
|
|
"ones.");
|
|
|
|
DEFINE_int32(num_keys1, 4096, "number of distinguish prefix of keys");
|
|
|
|
DEFINE_int32(num_keys2, 512, "number of distinguish keys for each prefix");
|
|
|
|
DEFINE_int32(iter, 3, "query non-existing keys instead of existing ones");
|
2013-11-15 22:23:12 -08:00
|
|
|
DEFINE_int32(prefix_len, 16, "Prefix length used for iterators and indexes");
|
2013-10-31 13:38:54 -07:00
|
|
|
DEFINE_bool(iterator, false, "For test iterator");
|
2013-11-15 22:23:12 -08:00
|
|
|
DEFINE_bool(through_db, false, "If enable, a DB instance will be created and "
|
|
|
|
"the query will be against DB. Otherwise, will be directly against "
|
|
|
|
"a table reader.");
|
2015-11-17 18:29:40 -08:00
|
|
|
DEFINE_bool(mmap_read, true, "Whether use mmap read");
|
2014-08-19 12:50:13 -07:00
|
|
|
DEFINE_string(table_factory, "block_based",
|
|
|
|
"Table factory to use: `block_based` (default), `plain_table` or "
|
|
|
|
"`cuckoo_hash`.");
|
Benchmark table reader wiht nanoseconds
Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results.
Test Plan:
sample output:
./table_reader_bench --plain_table --time_unit=nanosecond
=======================================================================================================
InMemoryTableSimpleBenchmark: PlainTable num_key1: 4096 num_key2: 512 non_empty
=======================================================================================================
Histogram (unit: nanosecond):
Count: 6291456 Average: 475.3867 StdDev: 556.05
Min: 135.0000 Median: 400.1817 Max: 33370.0000
Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21
------------------------------------------------------
[ 120, 140 ) 2 0.000% 0.000%
[ 140, 160 ) 452 0.007% 0.007%
[ 160, 180 ) 13683 0.217% 0.225%
[ 180, 200 ) 54353 0.864% 1.089%
[ 200, 250 ) 101004 1.605% 2.694%
[ 250, 300 ) 729791 11.600% 14.294% ##
[ 300, 350 ) 616070 9.792% 24.086% ##
[ 350, 400 ) 1628021 25.877% 49.963% #####
[ 400, 450 ) 647220 10.287% 60.250% ##
[ 450, 500 ) 577206 9.174% 69.424% ##
[ 500, 600 ) 1168585 18.574% 87.999% ####
[ 600, 700 ) 506875 8.057% 96.055% ##
[ 700, 800 ) 147878 2.350% 98.406%
[ 800, 900 ) 42633 0.678% 99.083%
[ 900, 1000 ) 16304 0.259% 99.342%
[ 1000, 1200 ) 7811 0.124% 99.466%
[ 1200, 1400 ) 1453 0.023% 99.490%
[ 1400, 1600 ) 307 0.005% 99.494%
[ 1600, 1800 ) 81 0.001% 99.496%
[ 1800, 2000 ) 18 0.000% 99.496%
[ 2000, 2500 ) 8 0.000% 99.496%
[ 2500, 3000 ) 6 0.000% 99.496%
[ 3500, 4000 ) 3 0.000% 99.496%
[ 4000, 4500 ) 116 0.002% 99.498%
[ 4500, 5000 ) 1144 0.018% 99.516%
[ 5000, 6000 ) 1087 0.017% 99.534%
[ 6000, 7000 ) 2403 0.038% 99.572%
[ 7000, 8000 ) 9840 0.156% 99.728%
[ 8000, 9000 ) 12820 0.204% 99.932%
[ 9000, 10000 ) 3881 0.062% 99.994%
[ 10000, 12000 ) 135 0.002% 99.996%
[ 12000, 14000 ) 159 0.003% 99.998%
[ 14000, 16000 ) 58 0.001% 99.999%
[ 16000, 18000 ) 30 0.000% 100.000%
[ 18000, 20000 ) 14 0.000% 100.000%
[ 20000, 25000 ) 2 0.000% 100.000%
[ 25000, 30000 ) 2 0.000% 100.000%
[ 30000, 35000 ) 1 0.000% 100.000%
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16113
2014-02-13 13:55:04 -08:00
|
|
|
DEFINE_string(time_unit, "microsecond",
|
|
|
|
"The time unit used for measuring performance. User can specify "
|
|
|
|
"`microsecond` (default) or `nanosecond`");
|
2013-10-31 13:38:54 -07:00
|
|
|
|
|
|
|
int main(int argc, char** argv) {
|
2014-05-09 08:34:18 -07:00
|
|
|
SetUsageMessage(std::string("\nUSAGE:\n") + std::string(argv[0]) +
|
|
|
|
" [OPTIONS]...");
|
|
|
|
ParseCommandLineFlags(&argc, &argv, true);
|
2013-10-31 13:38:54 -07:00
|
|
|
|
2020-02-20 12:07:53 -08:00
|
|
|
std::shared_ptr<ROCKSDB_NAMESPACE::TableFactory> tf;
|
|
|
|
ROCKSDB_NAMESPACE::Options options;
|
2013-11-15 22:23:12 -08:00
|
|
|
if (FLAGS_prefix_len < 16) {
|
2020-02-20 12:07:53 -08:00
|
|
|
options.prefix_extractor.reset(
|
|
|
|
ROCKSDB_NAMESPACE::NewFixedPrefixTransform(FLAGS_prefix_len));
|
2013-11-15 22:23:12 -08:00
|
|
|
}
|
2020-02-20 12:07:53 -08:00
|
|
|
ROCKSDB_NAMESPACE::ReadOptions ro;
|
|
|
|
ROCKSDB_NAMESPACE::EnvOptions env_options;
|
2013-11-15 22:23:12 -08:00
|
|
|
options.create_if_missing = true;
|
2020-02-20 12:07:53 -08:00
|
|
|
options.compression = ROCKSDB_NAMESPACE::CompressionType::kNoCompression;
|
2013-10-28 20:34:02 -07:00
|
|
|
|
2014-08-19 12:50:13 -07:00
|
|
|
if (FLAGS_table_factory == "cuckoo_hash") {
|
2014-11-12 13:05:12 -08:00
|
|
|
#ifndef ROCKSDB_LITE
|
2015-11-17 18:29:40 -08:00
|
|
|
options.allow_mmap_reads = FLAGS_mmap_read;
|
|
|
|
env_options.use_mmap_reads = FLAGS_mmap_read;
|
2020-02-20 12:07:53 -08:00
|
|
|
ROCKSDB_NAMESPACE::CuckooTableOptions table_options;
|
CuckooTable: add one option to allow identity function for the first hash function
Summary:
MurmurHash becomes expensive when we do millions Get() a second in one
thread. Add this option to allow the first hash function to use identity
function as hash function. It results in QPS increase from 3.7M/s to
~4.3M/s. I did not observe improvement for end to end RocksDB
performance. This may be caused by other bottlenecks that I will address
in a separate diff.
Test Plan:
```
[ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=0
==== Test CuckooReaderTest.WhenKeyExists
==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator
==== Test CuckooReaderTest.CheckIterator
==== Test CuckooReaderTest.CheckIteratorUint64
==== Test CuckooReaderTest.WhenKeyNotFound
==== Test CuckooReaderTest.TestReadPerformance
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.272us (3.7 Mqps) with batch size of 0, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.138us (7.2 Mqps) with batch size of 10, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.142us (7.1 Mqps) with batch size of 25, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.142us (7.0 Mqps) with batch size of 50, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.144us (6.9 Mqps) with batch size of 100, # of found keys 125829120
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.201us (5.0 Mqps) with batch size of 0, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.121us (8.3 Mqps) with batch size of 10, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.123us (8.1 Mqps) with batch size of 25, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.121us (8.3 Mqps) with batch size of 50, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.112us (8.9 Mqps) with batch size of 100, # of found keys 104857600
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.251us (4.0 Mqps) with batch size of 0, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.107us (9.4 Mqps) with batch size of 10, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.099us (10.1 Mqps) with batch size of 25, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.100us (10.0 Mqps) with batch size of 50, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.116us (8.6 Mqps) with batch size of 100, # of found keys 83886080
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.189us (5.3 Mqps) with batch size of 0, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.095us (10.5 Mqps) with batch size of 10, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.096us (10.4 Mqps) with batch size of 25, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.098us (10.2 Mqps) with batch size of 50, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.105us (9.5 Mqps) with batch size of 100, # of found keys 73400320
[ljin@dev1964 rocksdb] ./cuckoo_table_reader_test --enable_perf --file_dir=/dev/shm --write --identity_as_first_hash=1
==== Test CuckooReaderTest.WhenKeyExists
==== Test CuckooReaderTest.WhenKeyExistsWithUint64Comparator
==== Test CuckooReaderTest.CheckIterator
==== Test CuckooReaderTest.CheckIteratorUint64
==== Test CuckooReaderTest.WhenKeyNotFound
==== Test CuckooReaderTest.TestReadPerformance
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.230us (4.3 Mqps) with batch size of 0, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.086us (11.7 Mqps) with batch size of 10, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.088us (11.3 Mqps) with batch size of 25, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.083us (12.1 Mqps) with batch size of 50, # of found keys 125829120
With 125829120 items, utilization is 93.75%, number of hash functions: 2.
Time taken per op is 0.083us (12.1 Mqps) with batch size of 100, # of found keys 125829120
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.159us (6.3 Mqps) with batch size of 0, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 10, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.080us (12.6 Mqps) with batch size of 25, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.080us (12.5 Mqps) with batch size of 50, # of found keys 104857600
With 104857600 items, utilization is 78.12%, number of hash functions: 2.
Time taken per op is 0.082us (12.2 Mqps) with batch size of 100, # of found keys 104857600
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.154us (6.5 Mqps) with batch size of 0, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.077us (13.0 Mqps) with batch size of 10, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.077us (12.9 Mqps) with batch size of 25, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 50, # of found keys 83886080
With 83886080 items, utilization is 62.50%, number of hash functions: 2.
Time taken per op is 0.079us (12.6 Mqps) with batch size of 100, # of found keys 83886080
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.218us (4.6 Mqps) with batch size of 0, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.083us (12.0 Mqps) with batch size of 10, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.085us (11.7 Mqps) with batch size of 25, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.086us (11.6 Mqps) with batch size of 50, # of found keys 73400320
With 73400320 items, utilization is 54.69%, number of hash functions: 2.
Time taken per op is 0.078us (12.8 Mqps) with batch size of 100, # of found keys 73400320
```
Reviewers: sdong, igor, yhchiang
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D23451
2014-09-18 11:00:48 -07:00
|
|
|
table_options.hash_table_ratio = 0.75;
|
2020-02-20 12:07:53 -08:00
|
|
|
tf.reset(ROCKSDB_NAMESPACE::NewCuckooTableFactory(table_options));
|
2014-11-12 13:05:12 -08:00
|
|
|
#else
|
|
|
|
fprintf(stderr, "Plain table is not supported in lite mode\n");
|
|
|
|
exit(1);
|
|
|
|
#endif // ROCKSDB_LITE
|
2014-08-19 12:50:13 -07:00
|
|
|
} else if (FLAGS_table_factory == "plain_table") {
|
2014-11-12 13:05:12 -08:00
|
|
|
#ifndef ROCKSDB_LITE
|
2015-11-17 18:29:40 -08:00
|
|
|
options.allow_mmap_reads = FLAGS_mmap_read;
|
|
|
|
env_options.use_mmap_reads = FLAGS_mmap_read;
|
2014-07-18 00:08:38 -07:00
|
|
|
|
2020-02-20 12:07:53 -08:00
|
|
|
ROCKSDB_NAMESPACE::PlainTableOptions plain_table_options;
|
2014-07-18 00:08:38 -07:00
|
|
|
plain_table_options.user_key_len = 16;
|
|
|
|
plain_table_options.bloom_bits_per_key = (FLAGS_prefix_len == 16) ? 0 : 8;
|
|
|
|
plain_table_options.hash_table_ratio = 0.75;
|
|
|
|
|
2020-02-20 12:07:53 -08:00
|
|
|
tf.reset(new ROCKSDB_NAMESPACE::PlainTableFactory(plain_table_options));
|
|
|
|
options.prefix_extractor.reset(
|
|
|
|
ROCKSDB_NAMESPACE::NewFixedPrefixTransform(FLAGS_prefix_len));
|
2014-11-12 13:05:12 -08:00
|
|
|
#else
|
|
|
|
fprintf(stderr, "Cuckoo table is not supported in lite mode\n");
|
|
|
|
exit(1);
|
|
|
|
#endif // ROCKSDB_LITE
|
2014-08-19 12:50:13 -07:00
|
|
|
} else if (FLAGS_table_factory == "block_based") {
|
2020-02-20 12:07:53 -08:00
|
|
|
tf.reset(new ROCKSDB_NAMESPACE::BlockBasedTableFactory());
|
2014-08-19 12:50:13 -07:00
|
|
|
} else {
|
|
|
|
fprintf(stderr, "Invalid table type %s\n", FLAGS_table_factory.c_str());
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tf) {
|
|
|
|
// if user provides invalid options, just fall back to microsecond.
|
|
|
|
bool measured_by_nanosecond = FLAGS_time_unit == "nanosecond";
|
|
|
|
|
|
|
|
options.table_factory = tf;
|
2020-02-20 12:07:53 -08:00
|
|
|
ROCKSDB_NAMESPACE::TableReaderBenchmark(
|
|
|
|
options, env_options, ro, FLAGS_num_keys1, FLAGS_num_keys2, FLAGS_iter,
|
|
|
|
FLAGS_prefix_len, FLAGS_query_empty, FLAGS_iterator, FLAGS_through_db,
|
|
|
|
measured_by_nanosecond);
|
2013-10-28 20:34:02 -07:00
|
|
|
} else {
|
2014-08-19 12:50:13 -07:00
|
|
|
return 1;
|
2013-10-28 20:34:02 -07:00
|
|
|
}
|
Benchmark table reader wiht nanoseconds
Summary: nanosecnods gave us better view of the performance, especially when some operations are fast so that micro seconds may only reveal less informative results.
Test Plan:
sample output:
./table_reader_bench --plain_table --time_unit=nanosecond
=======================================================================================================
InMemoryTableSimpleBenchmark: PlainTable num_key1: 4096 num_key2: 512 non_empty
=======================================================================================================
Histogram (unit: nanosecond):
Count: 6291456 Average: 475.3867 StdDev: 556.05
Min: 135.0000 Median: 400.1817 Max: 33370.0000
Percentiles: P50: 400.18 P75: 530.02 P99: 887.73 P99.9: 8843.26 P99.99: 9941.21
------------------------------------------------------
[ 120, 140 ) 2 0.000% 0.000%
[ 140, 160 ) 452 0.007% 0.007%
[ 160, 180 ) 13683 0.217% 0.225%
[ 180, 200 ) 54353 0.864% 1.089%
[ 200, 250 ) 101004 1.605% 2.694%
[ 250, 300 ) 729791 11.600% 14.294% ##
[ 300, 350 ) 616070 9.792% 24.086% ##
[ 350, 400 ) 1628021 25.877% 49.963% #####
[ 400, 450 ) 647220 10.287% 60.250% ##
[ 450, 500 ) 577206 9.174% 69.424% ##
[ 500, 600 ) 1168585 18.574% 87.999% ####
[ 600, 700 ) 506875 8.057% 96.055% ##
[ 700, 800 ) 147878 2.350% 98.406%
[ 800, 900 ) 42633 0.678% 99.083%
[ 900, 1000 ) 16304 0.259% 99.342%
[ 1000, 1200 ) 7811 0.124% 99.466%
[ 1200, 1400 ) 1453 0.023% 99.490%
[ 1400, 1600 ) 307 0.005% 99.494%
[ 1600, 1800 ) 81 0.001% 99.496%
[ 1800, 2000 ) 18 0.000% 99.496%
[ 2000, 2500 ) 8 0.000% 99.496%
[ 2500, 3000 ) 6 0.000% 99.496%
[ 3500, 4000 ) 3 0.000% 99.496%
[ 4000, 4500 ) 116 0.002% 99.498%
[ 4500, 5000 ) 1144 0.018% 99.516%
[ 5000, 6000 ) 1087 0.017% 99.534%
[ 6000, 7000 ) 2403 0.038% 99.572%
[ 7000, 8000 ) 9840 0.156% 99.728%
[ 8000, 9000 ) 12820 0.204% 99.932%
[ 9000, 10000 ) 3881 0.062% 99.994%
[ 10000, 12000 ) 135 0.002% 99.996%
[ 12000, 14000 ) 159 0.003% 99.998%
[ 14000, 16000 ) 58 0.001% 99.999%
[ 16000, 18000 ) 30 0.000% 100.000%
[ 18000, 20000 ) 14 0.000% 100.000%
[ 20000, 25000 ) 2 0.000% 100.000%
[ 25000, 30000 ) 2 0.000% 100.000%
[ 30000, 35000 ) 1 0.000% 100.000%
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16113
2014-02-13 13:55:04 -08:00
|
|
|
|
2013-10-31 13:38:54 -07:00
|
|
|
return 0;
|
|
|
|
}
|
2014-05-09 08:34:18 -07:00
|
|
|
|
|
|
|
#endif // GFLAGS
|