2016-02-09 15:12:00 -08:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2013-10-16 14:59:46 -07:00
|
|
|
// This source code is licensed under the BSD-style license found in the
|
|
|
|
// LICENSE file in the root directory of this source tree. An additional grant
|
|
|
|
// of patent rights can be found in the PATENTS file in the same directory.
|
|
|
|
//
|
2011-03-18 22:37:00 +00:00
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
2013-10-04 22:32:05 -07:00
|
|
|
#pragma once
|
2014-01-02 11:26:57 -08:00
|
|
|
|
2013-05-24 12:52:45 -07:00
|
|
|
#include <atomic>
|
2012-03-08 16:23:21 -08:00
|
|
|
#include <deque>
|
2014-07-03 15:47:02 -07:00
|
|
|
#include <limits>
|
2014-11-07 11:50:34 -08:00
|
|
|
#include <list>
|
2015-09-02 13:58:22 -07:00
|
|
|
#include <set>
|
2014-02-26 10:03:34 -08:00
|
|
|
#include <string>
|
2015-09-02 13:58:22 -07:00
|
|
|
#include <utility>
|
|
|
|
#include <vector>
|
2013-12-30 18:33:57 -08:00
|
|
|
|
2014-01-24 14:30:28 -08:00
|
|
|
#include "db/column_family.h"
|
2015-06-02 14:12:23 -07:00
|
|
|
#include "db/compaction_job.h"
|
2015-09-02 13:58:22 -07:00
|
|
|
#include "db/dbformat.h"
|
2015-06-02 14:12:23 -07:00
|
|
|
#include "db/flush_job.h"
|
2015-09-02 13:58:22 -07:00
|
|
|
#include "db/flush_scheduler.h"
|
|
|
|
#include "db/internal_stats.h"
|
2015-08-06 17:59:05 -07:00
|
|
|
#include "db/log_writer.h"
|
|
|
|
#include "db/snapshot_impl.h"
|
2013-11-12 11:53:26 -08:00
|
|
|
#include "db/version_edit.h"
|
2014-10-29 17:43:37 -07:00
|
|
|
#include "db/wal_manager.h"
|
2015-09-02 13:58:22 -07:00
|
|
|
#include "db/write_controller.h"
|
|
|
|
#include "db/write_thread.h"
|
2014-12-02 12:09:20 -08:00
|
|
|
#include "db/writebuffer.h"
|
2014-01-02 11:26:57 -08:00
|
|
|
#include "memtable_list.h"
|
|
|
|
#include "port/port.h"
|
2013-08-23 08:38:13 -07:00
|
|
|
#include "rocksdb/db.h"
|
|
|
|
#include "rocksdb/env.h"
|
|
|
|
#include "rocksdb/memtablerep.h"
|
|
|
|
#include "rocksdb/transaction_log.h"
|
2015-10-12 15:06:38 -07:00
|
|
|
#include "table/scoped_arena_iterator.h"
|
2014-01-14 14:49:31 -08:00
|
|
|
#include "util/autovector.h"
|
EventLogger
Summary:
Here's my proposal for making our LOGs easier to read by machines.
The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize.
I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc)
Test Plan:
Ran db_bench. Observed:
2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699}
2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12}
Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31647
2015-03-13 10:15:54 -07:00
|
|
|
#include "util/event_logger.h"
|
|
|
|
#include "util/hash.h"
|
2015-02-04 21:39:45 -08:00
|
|
|
#include "util/instrumented_mutex.h"
|
2015-09-02 13:58:22 -07:00
|
|
|
#include "util/stop_watch.h"
|
|
|
|
#include "util/thread_local.h"
|
2012-08-14 15:20:36 -07:00
|
|
|
|
2013-10-03 21:49:15 -07:00
|
|
|
namespace rocksdb {
|
2011-03-18 22:37:00 +00:00
|
|
|
|
|
|
|
class MemTable;
|
|
|
|
class TableCache;
|
|
|
|
class Version;
|
|
|
|
class VersionEdit;
|
|
|
|
class VersionSet;
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 16:38:00 -07:00
|
|
|
class Arena;
|
2015-05-29 14:36:35 -07:00
|
|
|
class WriteCallback;
|
2014-11-14 15:43:10 -08:00
|
|
|
struct JobContext;
|
2015-09-23 12:42:43 -07:00
|
|
|
struct ExternalSstFileInfo;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
|
|
|
class DBImpl : public DB {
|
|
|
|
public:
|
2014-02-05 13:12:23 -08:00
|
|
|
DBImpl(const DBOptions& options, const std::string& dbname);
|
2011-03-18 22:37:00 +00:00
|
|
|
virtual ~DBImpl();
|
|
|
|
|
|
|
|
// Implementations of the DB interface
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::Put;
|
|
|
|
virtual Status Put(const WriteOptions& options,
|
2014-02-10 17:04:44 -08:00
|
|
|
ColumnFamilyHandle* column_family, const Slice& key,
|
2015-02-26 11:28:41 -08:00
|
|
|
const Slice& value) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::Merge;
|
|
|
|
virtual Status Merge(const WriteOptions& options,
|
2014-02-10 17:04:44 -08:00
|
|
|
ColumnFamilyHandle* column_family, const Slice& key,
|
2015-02-26 11:28:41 -08:00
|
|
|
const Slice& value) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::Delete;
|
|
|
|
virtual Status Delete(const WriteOptions& options,
|
2015-02-26 11:28:41 -08:00
|
|
|
ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key) override;
|
Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
|
|
|
using DB::SingleDelete;
|
|
|
|
virtual Status SingleDelete(const WriteOptions& options,
|
|
|
|
ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::Write;
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual Status Write(const WriteOptions& options,
|
|
|
|
WriteBatch* updates) override;
|
2015-05-29 14:36:35 -07:00
|
|
|
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::Get;
|
2011-03-18 22:37:00 +00:00
|
|
|
virtual Status Get(const ReadOptions& options,
|
2014-02-10 17:04:44 -08:00
|
|
|
ColumnFamilyHandle* column_family, const Slice& key,
|
2015-02-26 11:28:41 -08:00
|
|
|
std::string* value) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::MultiGet;
|
|
|
|
virtual std::vector<Status> MultiGet(
|
|
|
|
const ReadOptions& options,
|
2014-02-10 17:04:44 -08:00
|
|
|
const std::vector<ColumnFamilyHandle*>& column_family,
|
2015-02-26 11:28:41 -08:00
|
|
|
const std::vector<Slice>& keys,
|
|
|
|
std::vector<std::string>* values) override;
|
2013-07-05 18:49:18 -07:00
|
|
|
|
2014-01-02 09:08:12 -08:00
|
|
|
virtual Status CreateColumnFamily(const ColumnFamilyOptions& options,
|
2014-01-06 13:31:06 -08:00
|
|
|
const std::string& column_family,
|
2015-02-26 11:28:41 -08:00
|
|
|
ColumnFamilyHandle** handle) override;
|
|
|
|
virtual Status DropColumnFamily(ColumnFamilyHandle* column_family) override;
|
2014-01-02 09:08:12 -08:00
|
|
|
|
2013-07-26 12:57:01 -07:00
|
|
|
// Returns false if key doesn't exist in the database and true if it may.
|
|
|
|
// If value_found is not passed in as null, then return the value if found in
|
|
|
|
// memory. On return, if value was found, then value_found will be set to true
|
|
|
|
// , otherwise false.
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::KeyMayExist;
|
2013-07-26 12:57:01 -07:00
|
|
|
virtual bool KeyMayExist(const ReadOptions& options,
|
2014-02-10 17:04:44 -08:00
|
|
|
ColumnFamilyHandle* column_family, const Slice& key,
|
2015-02-26 11:28:41 -08:00
|
|
|
std::string* value,
|
|
|
|
bool* value_found = nullptr) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::NewIterator;
|
|
|
|
virtual Iterator* NewIterator(const ReadOptions& options,
|
2015-02-26 11:28:41 -08:00
|
|
|
ColumnFamilyHandle* column_family) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
virtual Status NewIterators(
|
|
|
|
const ReadOptions& options,
|
2014-03-07 16:12:34 -08:00
|
|
|
const std::vector<ColumnFamilyHandle*>& column_families,
|
2015-02-26 11:28:41 -08:00
|
|
|
std::vector<Iterator*>* iterators) override;
|
|
|
|
virtual const Snapshot* GetSnapshot() override;
|
|
|
|
virtual void ReleaseSnapshot(const Snapshot* snapshot) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::GetProperty;
|
2014-02-10 17:04:44 -08:00
|
|
|
virtual bool GetProperty(ColumnFamilyHandle* column_family,
|
2015-02-26 11:28:41 -08:00
|
|
|
const Slice& property, std::string* value) override;
|
2014-07-28 15:28:53 -07:00
|
|
|
using DB::GetIntProperty;
|
|
|
|
virtual bool GetIntProperty(ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& property, uint64_t* value) override;
|
2015-11-03 15:54:18 -08:00
|
|
|
using DB::GetAggregatedIntProperty;
|
|
|
|
virtual bool GetAggregatedIntProperty(const Slice& property,
|
|
|
|
uint64_t* aggregated_value) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::GetApproximateSizes;
|
2014-02-10 17:04:44 -08:00
|
|
|
virtual void GetApproximateSizes(ColumnFamilyHandle* column_family,
|
2015-06-12 18:04:30 -07:00
|
|
|
const Range* range, int n, uint64_t* sizes,
|
|
|
|
bool include_memtable = false) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::CompactRange;
|
2015-06-17 14:36:14 -07:00
|
|
|
virtual Status CompactRange(const CompactRangeOptions& options,
|
|
|
|
ColumnFamilyHandle* column_family,
|
|
|
|
const Slice* begin, const Slice* end) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
|
|
|
using DB::CompactFiles;
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual Status CompactFiles(const CompactionOptions& compact_options,
|
|
|
|
ColumnFamilyHandle* column_family,
|
|
|
|
const std::vector<std::string>& input_file_names,
|
|
|
|
const int output_level,
|
|
|
|
const int output_path_id = -1) override;
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
|
|
|
|
2015-10-02 13:17:34 -07:00
|
|
|
virtual Status PauseBackgroundWork() override;
|
|
|
|
virtual Status ContinueBackgroundWork() override;
|
|
|
|
|
2015-11-20 01:07:24 -08:00
|
|
|
virtual Status EnableAutoCompaction(
|
|
|
|
const std::vector<ColumnFamilyHandle*>& column_family_handles) override;
|
|
|
|
|
2014-09-17 12:49:13 -07:00
|
|
|
using DB::SetOptions;
|
2015-02-26 11:28:41 -08:00
|
|
|
Status SetOptions(
|
|
|
|
ColumnFamilyHandle* column_family,
|
|
|
|
const std::unordered_map<std::string, std::string>& options_map) override;
|
2014-09-17 12:49:13 -07:00
|
|
|
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::NumberLevels;
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual int NumberLevels(ColumnFamilyHandle* column_family) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::MaxMemCompactionLevel;
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual int MaxMemCompactionLevel(ColumnFamilyHandle* column_family) override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::Level0StopWriteTrigger;
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual int Level0StopWriteTrigger(
|
|
|
|
ColumnFamilyHandle* column_family) override;
|
|
|
|
virtual const std::string& GetName() const override;
|
|
|
|
virtual Env* GetEnv() const override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::GetOptions;
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual const Options& GetOptions(
|
|
|
|
ColumnFamilyHandle* column_family) const override;
|
2015-05-11 14:51:51 -07:00
|
|
|
using DB::GetDBOptions;
|
|
|
|
virtual const DBOptions& GetDBOptions() const override;
|
[RocksDB] [Column Family] Interface proposal
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
2013-12-03 11:14:09 -08:00
|
|
|
using DB::Flush;
|
|
|
|
virtual Status Flush(const FlushOptions& options,
|
2015-02-26 11:28:41 -08:00
|
|
|
ColumnFamilyHandle* column_family) override;
|
[wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.
Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.
Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.
Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba
Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 06:06:39 -07:00
|
|
|
virtual Status SyncWAL() override;
|
2014-04-15 13:39:26 -07:00
|
|
|
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual SequenceNumber GetLatestSequenceNumber() const override;
|
2014-04-15 13:39:26 -07:00
|
|
|
|
|
|
|
#ifndef ROCKSDB_LITE
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual Status DisableFileDeletions() override;
|
|
|
|
virtual Status EnableFileDeletions(bool force) override;
|
2014-08-26 16:26:29 -07:00
|
|
|
virtual int IsFileDeletionsEnabled() const;
|
2013-11-08 15:23:46 -08:00
|
|
|
// All the returned filenames start with "/"
|
2012-11-06 11:21:57 -08:00
|
|
|
virtual Status GetLiveFiles(std::vector<std::string>&,
|
2013-10-03 14:38:32 -07:00
|
|
|
uint64_t* manifest_file_size,
|
2015-02-26 11:28:41 -08:00
|
|
|
bool flush_memtable = true) override;
|
|
|
|
virtual Status GetSortedWalFiles(VectorLogPtr& files) override;
|
2014-04-15 13:39:26 -07:00
|
|
|
|
2014-02-28 11:50:36 -08:00
|
|
|
virtual Status GetUpdatesSince(
|
|
|
|
SequenceNumber seq_number, unique_ptr<TransactionLogIterator>* iter,
|
|
|
|
const TransactionLogIterator::ReadOptions&
|
2015-02-26 11:28:41 -08:00
|
|
|
read_options = TransactionLogIterator::ReadOptions()) override;
|
|
|
|
virtual Status DeleteFile(std::string name) override;
|
2015-12-29 13:22:13 -08:00
|
|
|
Status DeleteFilesInRange(ColumnFamilyHandle* column_family,
|
|
|
|
const Slice* begin, const Slice* end);
|
2013-08-22 14:32:53 -07:00
|
|
|
|
2015-02-26 11:28:41 -08:00
|
|
|
virtual void GetLiveFilesMetaData(
|
|
|
|
std::vector<LiveFileMetaData>* metadata) override;
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
|
|
|
|
|
|
|
// Obtains the meta data of the specified column family of the DB.
|
|
|
|
// Status::NotFound() will be returned if the current DB does not have
|
|
|
|
// any column family match the specified name.
|
|
|
|
// TODO(yhchiang): output parameter is placed in the end in this codebase.
|
|
|
|
virtual void GetColumnFamilyMetaData(
|
|
|
|
ColumnFamilyHandle* column_family,
|
|
|
|
ColumnFamilyMetaData* metadata) override;
|
|
|
|
|
Add experimental API MarkForCompaction()
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
2015-04-17 16:44:45 -07:00
|
|
|
// experimental API
|
|
|
|
Status SuggestCompactRange(ColumnFamilyHandle* column_family,
|
|
|
|
const Slice* begin, const Slice* end);
|
|
|
|
|
Implement DB::PromoteL0 method
Summary:
This diff implements a new `DB` method `PromoteL0` which moves all files in L0
to a given level skipping compaction, provided that the files have disjoint
ranges and all levels up to the target level are empty.
This method provides finer-grain control for trivial compactions, and it is
useful for bulk-loading pre-sorted keys. Compared to D34797, it does not change
the semantics of an existing operation, which can impact existing code.
PromoteL0 is designed to work well in combination with the proposed
`GetSstFileWriter`/`AddFile` interface, enabling to "design" the level structure
by populating one level at a time. Such fine-grained control can be very useful
for static or mostly-static databases.
Test Plan: `make check`
Reviewers: IslamAbdelRahman, philipp, MarkCallaghan, yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37107
2015-04-23 12:10:36 -07:00
|
|
|
Status PromoteL0(ColumnFamilyHandle* column_family, int target_level);
|
|
|
|
|
2015-05-29 14:36:35 -07:00
|
|
|
// Similar to Write() but will call the callback once on the single write
|
|
|
|
// thread to determine whether it is safe to perform the write.
|
|
|
|
virtual Status WriteWithCallback(const WriteOptions& write_options,
|
|
|
|
WriteBatch* my_batch,
|
|
|
|
WriteCallback* callback);
|
|
|
|
|
|
|
|
// Returns the sequence number that is guaranteed to be smaller than or equal
|
|
|
|
// to the sequence number of any key that could be inserted into the current
|
|
|
|
// memtables. It can then be assumed that any write with a larger(or equal)
|
|
|
|
// sequence number will be present in this memtable or a later memtable.
|
|
|
|
//
|
|
|
|
// If the earliest sequence number could not be determined,
|
|
|
|
// kMaxSequenceNumber will be returned.
|
|
|
|
//
|
|
|
|
// If include_history=true, will also search Memtables in MemTableList
|
|
|
|
// History.
|
|
|
|
SequenceNumber GetEarliestMemTableSequenceNumber(SuperVersion* sv,
|
|
|
|
bool include_history);
|
|
|
|
|
|
|
|
// For a given key, check to see if there are any records for this key
|
Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
Test Plan: unit tests, db bench
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb, yoshinorim
Differential Revision: https://reviews.facebook.net/D50475
2015-10-15 16:37:15 -07:00
|
|
|
// in the memtables, including memtable history. If cache_only is false,
|
|
|
|
// SST files will also be checked.
|
|
|
|
//
|
|
|
|
// If a key is found, *found_record_for_key will be set to true and
|
|
|
|
// *seq will will be set to the stored sequence number for the latest
|
|
|
|
// operation on this key or kMaxSequenceNumber if unknown.
|
|
|
|
// If no key is found, *found_record_for_key will be set to false.
|
|
|
|
//
|
|
|
|
// Note: If cache_only=false, it is possible for *seq to be set to 0 if
|
|
|
|
// the sequence number has been cleared from the record. If the caller is
|
|
|
|
// holding an active db snapshot, we know the missing sequence must be less
|
|
|
|
// than the snapshot's sequence number (sequence numbers are only cleared
|
|
|
|
// when there are no earlier active snapshots).
|
|
|
|
//
|
|
|
|
// If NotFound is returned and found_record_for_key is set to false, then no
|
|
|
|
// record for this key was found. If the caller is holding an active db
|
|
|
|
// snapshot, we know that no key could have existing after this snapshot
|
|
|
|
// (since we do not compact keys that have an earlier snapshot).
|
|
|
|
//
|
|
|
|
// Returns OK or NotFound on success,
|
|
|
|
// other status on unexpected error.
|
|
|
|
Status GetLatestSequenceForKey(SuperVersion* sv, const Slice& key,
|
|
|
|
bool cache_only, SequenceNumber* seq,
|
|
|
|
bool* found_record_for_key);
|
2015-05-29 14:36:35 -07:00
|
|
|
|
2015-09-23 12:42:43 -07:00
|
|
|
using DB::AddFile;
|
|
|
|
virtual Status AddFile(ColumnFamilyHandle* column_family,
|
|
|
|
const ExternalSstFileInfo* file_info,
|
|
|
|
bool move_file) override;
|
|
|
|
virtual Status AddFile(ColumnFamilyHandle* column_family,
|
|
|
|
const std::string& file_path, bool move_file) override;
|
|
|
|
|
2014-04-15 13:39:26 -07:00
|
|
|
#endif // ROCKSDB_LITE
|
2012-11-26 13:56:45 -08:00
|
|
|
|
2015-12-08 12:25:48 -08:00
|
|
|
// Similar to GetSnapshot(), but also lets the db know that this snapshot
|
|
|
|
// will be used for transaction write-conflict checking. The DB can then
|
|
|
|
// make sure not to compact any keys that would prevent a write-conflict from
|
|
|
|
// being detected.
|
|
|
|
const Snapshot* GetSnapshotForWriteConflictBoundary();
|
|
|
|
|
2014-03-20 14:18:29 -07:00
|
|
|
// checks if all live files exist on file system and that their file sizes
|
|
|
|
// match to our in-memory records
|
|
|
|
virtual Status CheckConsistency();
|
|
|
|
|
2015-05-21 11:01:48 -07:00
|
|
|
virtual Status GetDbIdentity(std::string& identity) const override;
|
2013-12-03 06:39:07 -08:00
|
|
|
|
2014-01-31 16:45:20 -08:00
|
|
|
Status RunManualCompaction(ColumnFamilyData* cfd, int input_level,
|
2014-07-16 17:39:18 -07:00
|
|
|
int output_level, uint32_t output_path_id,
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
2015-06-04 16:51:25 -07:00
|
|
|
const Slice* begin, const Slice* end,
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
bool exclusive,
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
2015-06-04 16:51:25 -07:00
|
|
|
bool disallow_trivial_move = false);
|
2014-01-14 16:19:09 -08:00
|
|
|
|
2015-10-12 17:22:37 -07:00
|
|
|
// Return an internal iterator over the current state of the database.
|
|
|
|
// The keys of this iterator are internal keys (see format.h).
|
|
|
|
// The returned iterator should be deleted when no longer needed.
|
2015-10-12 15:06:38 -07:00
|
|
|
InternalIterator* NewInternalIterator(
|
|
|
|
Arena* arena, ColumnFamilyHandle* column_family = nullptr);
|
2015-10-12 17:22:37 -07:00
|
|
|
|
2015-10-12 17:35:43 -07:00
|
|
|
#ifndef NDEBUG
|
2011-03-18 22:37:00 +00:00
|
|
|
// Extra methods (for testing) that are not in the public DB interface
|
2014-04-15 13:39:26 -07:00
|
|
|
// Implemented in db_impl_debug.cc
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2013-06-05 11:22:38 -07:00
|
|
|
// Compact any files in the named level that overlap [*begin, *end]
|
2014-02-07 14:47:16 -08:00
|
|
|
Status TEST_CompactRange(int level, const Slice* begin, const Slice* end,
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
2015-06-04 16:51:25 -07:00
|
|
|
ColumnFamilyHandle* column_family = nullptr,
|
|
|
|
bool disallow_trivial_move = false);
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2013-10-14 15:12:15 -07:00
|
|
|
// Force current memtable contents to be flushed.
|
2014-03-18 12:25:08 -07:00
|
|
|
Status TEST_FlushMemTable(bool wait = true);
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2012-06-22 19:30:03 -07:00
|
|
|
// Wait for memtable compaction
|
2014-02-07 14:47:16 -08:00
|
|
|
Status TEST_WaitForFlushMemTable(ColumnFamilyHandle* column_family = nullptr);
|
2012-06-22 19:30:03 -07:00
|
|
|
|
|
|
|
// Wait for any compaction
|
|
|
|
Status TEST_WaitForCompact();
|
|
|
|
|
2011-03-22 18:32:49 +00:00
|
|
|
// Return the maximum overlapping data (in bytes) at next level for any
|
|
|
|
// file at a level >= 1.
|
2014-02-07 14:47:16 -08:00
|
|
|
int64_t TEST_MaxNextLevelOverlappingBytes(ColumnFamilyHandle* column_family =
|
|
|
|
nullptr);
|
2011-03-22 18:32:49 +00:00
|
|
|
|
2013-01-10 17:18:50 -08:00
|
|
|
// Return the current manifest file no.
|
|
|
|
uint64_t TEST_Current_Manifest_FileNo();
|
2013-05-06 11:41:01 -07:00
|
|
|
|
2013-10-17 13:33:39 -07:00
|
|
|
// get total level0 file size. Only for testing.
|
2014-01-15 16:18:04 -08:00
|
|
|
uint64_t TEST_GetLevel0TotalSize();
|
2013-10-17 13:33:39 -07:00
|
|
|
|
2014-02-07 14:47:16 -08:00
|
|
|
void TEST_GetFilesMetaData(ColumnFamilyHandle* column_family,
|
|
|
|
std::vector<std::vector<FileMetaData>>* metadata);
|
2014-02-12 10:43:27 -08:00
|
|
|
|
2014-09-05 15:20:05 -07:00
|
|
|
void TEST_LockMutex();
|
|
|
|
|
|
|
|
void TEST_UnlockMutex();
|
|
|
|
|
|
|
|
// REQUIRES: mutex locked
|
|
|
|
void* TEST_BeginWrite();
|
|
|
|
|
|
|
|
// REQUIRES: mutex locked
|
|
|
|
// pass the pointer that you got from TEST_BeginWrite()
|
|
|
|
void TEST_EndWrite(void* w);
|
2014-12-08 12:52:18 -08:00
|
|
|
|
2015-03-30 15:04:10 -04:00
|
|
|
uint64_t TEST_MaxTotalInMemoryState() const {
|
2014-12-08 12:52:18 -08:00
|
|
|
return max_total_in_memory_state_;
|
|
|
|
}
|
2015-01-13 00:04:08 -08:00
|
|
|
|
2015-03-30 15:04:10 -04:00
|
|
|
size_t TEST_LogsToFreeSize();
|
|
|
|
|
2015-07-02 14:27:00 -07:00
|
|
|
uint64_t TEST_LogfileNumber();
|
|
|
|
|
2015-11-03 17:52:17 -08:00
|
|
|
// Returns column family name to ImmutableCFOptions map.
|
|
|
|
Status TEST_GetAllImmutableCFOptions(
|
|
|
|
std::unordered_map<std::string, const ImmutableCFOptions*>* iopts_map);
|
|
|
|
|
|
|
|
Cache* TEST_table_cache() { return table_cache_.get(); }
|
|
|
|
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
2015-12-17 17:07:44 -08:00
|
|
|
WriteController& TEST_write_controler() { return write_controller_; }
|
2015-11-25 15:43:11 -08:00
|
|
|
|
2015-10-12 17:35:43 -07:00
|
|
|
#endif // NDEBUG
|
2014-04-15 13:39:26 -07:00
|
|
|
|
Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
2016-01-28 11:56:16 -08:00
|
|
|
// Return maximum background compaction alowed to be scheduled based on
|
|
|
|
// compaction status.
|
|
|
|
int BGCompactionsAllowed() const;
|
|
|
|
|
2013-11-14 18:03:57 -08:00
|
|
|
// Returns the list of live files in 'live' and the list
|
2013-12-30 18:33:57 -08:00
|
|
|
// of all files in the filesystem in 'candidate_files'.
|
2013-11-14 18:03:57 -08:00
|
|
|
// If force == false and the last call was less than
|
2014-09-05 11:48:17 -07:00
|
|
|
// db_options_.delete_obsolete_files_period_micros microseconds ago,
|
2014-10-28 11:54:33 -07:00
|
|
|
// it will not fill up the job_context
|
|
|
|
void FindObsoleteFiles(JobContext* job_context, bool force,
|
2013-11-14 18:03:57 -08:00
|
|
|
bool no_full_scan = false);
|
|
|
|
|
|
|
|
// Diffs the files listed in filenames and those that do not
|
|
|
|
// belong to live files are posibly removed. Also, removes all the
|
|
|
|
// files in sst_delete_files and log_delete_files.
|
|
|
|
// It is not necessary to hold the mutex when invoking this method.
|
2014-10-28 11:54:33 -07:00
|
|
|
void PurgeObsoleteFiles(const JobContext& background_contet);
|
2013-11-14 18:03:57 -08:00
|
|
|
|
2015-02-26 11:28:41 -08:00
|
|
|
ColumnFamilyHandle* DefaultColumnFamily() const override;
|
2014-02-10 17:04:44 -08:00
|
|
|
|
2014-12-05 16:12:10 -08:00
|
|
|
const SnapshotList& snapshots() const { return snapshots_; }
|
|
|
|
|
2015-05-05 16:54:47 -07:00
|
|
|
void CancelAllBackgroundWork(bool wait);
|
2015-03-11 10:31:02 -07:00
|
|
|
|
2015-05-29 14:36:35 -07:00
|
|
|
// Find Super version and reference it. Based on options, it might return
|
|
|
|
// the thread local cached one.
|
|
|
|
// Call ReturnAndCleanupSuperVersion() when it is no longer needed.
|
|
|
|
SuperVersion* GetAndRefSuperVersion(ColumnFamilyData* cfd);
|
|
|
|
|
|
|
|
// Similar to the previous function but looks up based on a column family id.
|
|
|
|
// nullptr will be returned if this column family no longer exists.
|
|
|
|
// REQUIRED: this function should only be called on the write thread or if the
|
|
|
|
// mutex is held.
|
|
|
|
SuperVersion* GetAndRefSuperVersion(uint32_t column_family_id);
|
|
|
|
|
Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it.
MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.
Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable.
Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.
Test Plan: Unit tests, db_bench parallel testing.
Reviewers: igor, rven, sdong, yhchiang, yoshinorim
Reviewed By: sdong
Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40869
2015-05-25 17:37:33 -07:00
|
|
|
// Same as above, should called without mutex held and not on write thread.
|
|
|
|
SuperVersion* GetAndRefSuperVersionUnlocked(uint32_t column_family_id);
|
|
|
|
|
2015-05-29 14:36:35 -07:00
|
|
|
// Un-reference the super version and return it to thread local cache if
|
|
|
|
// needed. If it is the last reference of the super version. Clean it up
|
|
|
|
// after un-referencing it.
|
|
|
|
void ReturnAndCleanupSuperVersion(ColumnFamilyData* cfd, SuperVersion* sv);
|
|
|
|
|
|
|
|
// Similar to the previous function but looks up based on a column family id.
|
|
|
|
// nullptr will be returned if this column family no longer exists.
|
|
|
|
// REQUIRED: this function should only be called on the write thread.
|
|
|
|
void ReturnAndCleanupSuperVersion(uint32_t colun_family_id, SuperVersion* sv);
|
|
|
|
|
Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it.
MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.
Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable.
Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.
Test Plan: Unit tests, db_bench parallel testing.
Reviewers: igor, rven, sdong, yhchiang, yoshinorim
Reviewed By: sdong
Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40869
2015-05-25 17:37:33 -07:00
|
|
|
// Same as above, should called without mutex held and not on write thread.
|
|
|
|
void ReturnAndCleanupSuperVersionUnlocked(uint32_t colun_family_id,
|
|
|
|
SuperVersion* sv);
|
|
|
|
|
2015-05-29 14:36:35 -07:00
|
|
|
// REQUIRED: this function should only be called on the write thread or if the
|
|
|
|
// mutex is held. Return value only valid until next call to this function or
|
|
|
|
// mutex is released.
|
|
|
|
ColumnFamilyHandle* GetColumnFamilyHandle(uint32_t column_family_id);
|
|
|
|
|
Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it.
MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.
Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable.
Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.
Test Plan: Unit tests, db_bench parallel testing.
Reviewers: igor, rven, sdong, yhchiang, yoshinorim
Reviewed By: sdong
Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40869
2015-05-25 17:37:33 -07:00
|
|
|
// Same as above, should called without mutex held and not on write thread.
|
|
|
|
ColumnFamilyHandle* GetColumnFamilyHandleUnlocked(uint32_t column_family_id);
|
|
|
|
|
2015-10-17 00:16:36 -07:00
|
|
|
// Returns the number of currently running flushes.
|
|
|
|
// REQUIREMENT: mutex_ must be held when calling this function.
|
|
|
|
int num_running_flushes() {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
return num_running_flushes_;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Returns the number of currently running compactions.
|
|
|
|
// REQUIREMENT: mutex_ must be held when calling this function.
|
|
|
|
int num_running_compactions() {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
return num_running_compactions_;
|
|
|
|
}
|
|
|
|
|
2012-12-18 13:05:39 -08:00
|
|
|
protected:
|
2012-11-05 19:18:49 -08:00
|
|
|
Env* const env_;
|
|
|
|
const std::string dbname_;
|
2013-01-20 02:07:13 -08:00
|
|
|
unique_ptr<VersionSet> versions_;
|
2014-09-05 11:48:17 -07:00
|
|
|
const DBOptions db_options_;
|
2014-07-28 12:05:36 -07:00
|
|
|
Statistics* stats_;
|
2012-11-05 19:18:49 -08:00
|
|
|
|
2015-10-12 15:06:38 -07:00
|
|
|
InternalIterator* NewInternalIterator(const ReadOptions&,
|
|
|
|
ColumnFamilyData* cfd,
|
|
|
|
SuperVersion* super_version,
|
|
|
|
Arena* arena);
|
2013-02-15 15:28:24 -08:00
|
|
|
|
2015-12-08 17:01:02 -08:00
|
|
|
// Except in DB::Open(), WriteOptionsFile can only be called when:
|
|
|
|
// 1. WriteThread::Writer::EnterUnbatched() is used.
|
|
|
|
// 2. db_mutex is held
|
2015-11-10 22:58:01 -08:00
|
|
|
Status WriteOptionsFile();
|
2015-12-08 17:01:02 -08:00
|
|
|
|
|
|
|
// The following two functions can only be called when:
|
|
|
|
// 1. WriteThread::Writer::EnterUnbatched() is used.
|
|
|
|
// 2. db_mutex is NOT held
|
2015-11-10 22:58:01 -08:00
|
|
|
Status RenameTempFileToOptionsFile(const std::string& file_name);
|
|
|
|
Status DeleteObsoleteOptionsFiles();
|
|
|
|
|
2015-06-11 15:22:22 -07:00
|
|
|
void NotifyOnFlushCompleted(ColumnFamilyData* cfd, FileMetaData* file_meta,
|
2015-06-05 12:28:51 -07:00
|
|
|
const MutableCFOptions& mutable_cf_options,
|
2015-09-15 09:03:08 -07:00
|
|
|
int job_id, TableProperties prop);
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
|
|
|
|
2015-01-27 14:44:02 -08:00
|
|
|
void NotifyOnCompactionCompleted(ColumnFamilyData* cfd,
|
2015-06-02 17:07:16 -07:00
|
|
|
Compaction *c, const Status &st,
|
|
|
|
const CompactionJobStats& job_stats,
|
2015-06-02 17:35:39 -07:00
|
|
|
int job_id);
|
2015-01-27 14:44:02 -08:00
|
|
|
|
2014-11-20 10:49:32 -08:00
|
|
|
void NewThreadStatusCfInfo(ColumnFamilyData* cfd) const;
|
|
|
|
|
|
|
|
void EraseThreadStatusCfInfo(ColumnFamilyData* cfd) const;
|
|
|
|
|
|
|
|
void EraseThreadStatusDbInfo() const;
|
|
|
|
|
2015-05-29 14:36:35 -07:00
|
|
|
Status WriteImpl(const WriteOptions& options, WriteBatch* updates,
|
|
|
|
WriteCallback* callback);
|
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
private:
|
|
|
|
friend class DB;
|
2014-03-27 11:59:37 -07:00
|
|
|
friend class InternalStats;
|
2014-04-15 13:39:26 -07:00
|
|
|
#ifndef ROCKSDB_LITE
|
2014-05-30 14:31:55 -07:00
|
|
|
friend class ForwardIterator;
|
2014-04-15 13:39:26 -07:00
|
|
|
#endif
|
2014-03-08 21:12:13 -08:00
|
|
|
friend struct SuperVersion;
|
2014-09-25 11:14:01 -07:00
|
|
|
friend class CompactedDBImpl;
|
2015-05-29 14:36:35 -07:00
|
|
|
#ifndef NDEBUG
|
|
|
|
friend class XFTransactionWriteHandler;
|
|
|
|
#endif
|
2012-03-08 16:23:21 -08:00
|
|
|
struct CompactionState;
|
2014-09-05 15:20:05 -07:00
|
|
|
|
2014-08-11 22:10:32 -07:00
|
|
|
struct WriteContext;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
|
|
|
Status NewDB();
|
|
|
|
|
|
|
|
// Recover the descriptor from persistent storage. May do a significant
|
|
|
|
// amount of work to recover recently logged updates. Any changes to
|
|
|
|
// be made to the descriptor are added to *edit.
|
2014-01-22 10:59:07 -08:00
|
|
|
Status Recover(const std::vector<ColumnFamilyDescriptor>& column_families,
|
|
|
|
bool read_only = false, bool error_if_log_file_exist = false);
|
2011-03-18 22:37:00 +00:00
|
|
|
|
|
|
|
void MaybeIgnoreError(Status* s) const;
|
|
|
|
|
2012-11-26 13:56:45 -08:00
|
|
|
const Status CreateArchivalDirectory();
|
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
// Delete any unneeded files and stale in-memory entries.
|
|
|
|
void DeleteObsoleteFiles();
|
|
|
|
|
2014-11-07 11:50:34 -08:00
|
|
|
// Background process needs to call
|
|
|
|
// auto x = CaptureCurrentFileNumberInPendingOutputs()
|
2016-03-03 18:25:07 -08:00
|
|
|
// auto file_num = versions_->NewFileNumber();
|
2014-11-07 11:50:34 -08:00
|
|
|
// <do something>
|
|
|
|
// ReleaseFileNumberFromPendingOutputs(x)
|
2016-03-03 18:25:07 -08:00
|
|
|
// This will protect any file with number `file_num` or greater from being
|
|
|
|
// deleted while <do something> is running.
|
2014-11-07 11:50:34 -08:00
|
|
|
// -----------
|
|
|
|
// This function will capture current file number and append it to
|
|
|
|
// pending_outputs_. This will prevent any background process to delete any
|
|
|
|
// file created after this point.
|
|
|
|
std::list<uint64_t>::iterator CaptureCurrentFileNumberInPendingOutputs();
|
|
|
|
// This function should be called with the result of
|
|
|
|
// CaptureCurrentFileNumberInPendingOutputs(). It then marks that any file
|
|
|
|
// created between the calls CaptureCurrentFileNumberInPendingOutputs() and
|
|
|
|
// ReleaseFileNumberFromPendingOutputs() can now be deleted (if it's not live
|
|
|
|
// and blocked by any other pending_outputs_ calls)
|
|
|
|
void ReleaseFileNumberFromPendingOutputs(std::list<uint64_t>::iterator v);
|
|
|
|
|
2013-10-14 15:12:15 -07:00
|
|
|
// Flush the in-memory write buffer to storage. Switches to a new
|
2011-03-18 22:37:00 +00:00
|
|
|
// log-file/memtable and writes a new descriptor iff successful.
|
2014-10-28 11:54:33 -07:00
|
|
|
Status FlushMemTableToOutputFile(ColumnFamilyData* cfd,
|
|
|
|
const MutableCFOptions& mutable_cf_options,
|
|
|
|
bool* madeProgress, JobContext* job_context,
|
|
|
|
LogBuffer* log_buffer);
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2014-09-09 11:18:50 -07:00
|
|
|
// REQUIRES: log_numbers are sorted in ascending order
|
|
|
|
Status RecoverLogFiles(const std::vector<uint64_t>& log_numbers,
|
|
|
|
SequenceNumber* max_sequence, bool read_only);
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2012-10-19 14:00:53 -07:00
|
|
|
// The following two methods are used to flush a memtable to
|
2016-03-24 19:39:13 -07:00
|
|
|
// storage. The first one is used at database RecoveryTime (when the
|
2012-10-19 14:00:53 -07:00
|
|
|
// database is opened) and is heavyweight because it holds the mutex
|
|
|
|
// for the entire period. The second method WriteLevel0Table supports
|
|
|
|
// concurrent flush memtables to storage.
|
Include bunch of more events into EventLogger
Summary:
Added these events:
* Recovery start, finish and also when recovery creates a file
* Trivial move
* Compaction start, finish and when compaction creates a file
* Flush start, finish
Also includes small fix to EventLogger
Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files.
Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff.
TODO:
* Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec
* Write tools that parse and analyze LOGs. This can be in python or go. Good intern task.
Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976
Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37521
2015-04-27 15:20:02 -07:00
|
|
|
Status WriteLevel0TableForRecovery(int job_id, ColumnFamilyData* cfd,
|
|
|
|
MemTable* mem, VersionEdit* edit);
|
2015-05-15 15:52:51 -07:00
|
|
|
|
|
|
|
// num_bytes: for slowdown case, delay time is calculated based on
|
|
|
|
// `num_bytes` going through.
|
Deprecate WriteOptions::timeout_hint_us
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.
Test Plan: make check
Reviewers: anthony, kradhakrishnan, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41937
2015-07-14 09:35:48 +02:00
|
|
|
Status DelayWrite(uint64_t num_bytes);
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
|
|
|
|
2014-09-10 18:46:09 -07:00
|
|
|
Status ScheduleFlushes(WriteContext* context);
|
2014-04-07 11:29:48 -07:00
|
|
|
|
2015-06-17 12:37:59 -07:00
|
|
|
Status SwitchMemtable(ColumnFamilyData* cfd, WriteContext* context);
|
2014-08-11 22:10:32 -07:00
|
|
|
|
2012-07-06 11:42:09 -07:00
|
|
|
// Force current memtable contents to be flushed.
|
2014-01-30 17:48:42 -08:00
|
|
|
Status FlushMemTable(ColumnFamilyData* cfd, const FlushOptions& options);
|
2012-07-06 11:42:09 -07:00
|
|
|
|
2013-10-14 15:12:15 -07:00
|
|
|
// Wait for memtable flushed
|
2014-01-30 17:48:42 -08:00
|
|
|
Status WaitForFlushMemTable(ColumnFamilyData* cfd);
|
2012-07-06 11:42:09 -07:00
|
|
|
|
2014-07-03 16:28:03 -07:00
|
|
|
void RecordFlushIOStats();
|
|
|
|
void RecordCompactionIOStats();
|
|
|
|
|
2014-11-13 16:45:33 -05:00
|
|
|
#ifndef ROCKSDB_LITE
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
|
|
|
Status CompactFilesImpl(
|
|
|
|
const CompactionOptions& compact_options, ColumnFamilyData* cfd,
|
|
|
|
Version* version, const std::vector<std::string>& input_file_names,
|
2015-03-11 13:06:59 -07:00
|
|
|
const int output_level, int output_path_id, JobContext* job_context,
|
|
|
|
LogBuffer* log_buffer);
|
2014-11-13 16:45:33 -05:00
|
|
|
#endif // ROCKSDB_LITE
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
2014-11-07 14:45:18 -08:00
|
|
|
|
|
|
|
ColumnFamilyData* GetColumnFamilyDataByName(const std::string& cf_name);
|
|
|
|
|
2013-10-14 15:12:15 -07:00
|
|
|
void MaybeScheduleFlushOrCompaction();
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
void SchedulePendingFlush(ColumnFamilyData* cfd);
|
|
|
|
void SchedulePendingCompaction(ColumnFamilyData* cfd);
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
static void BGWorkCompaction(void* arg);
|
2013-09-13 14:38:37 -07:00
|
|
|
static void BGWorkFlush(void* db);
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
static void UnscheduleCallback(void* arg);
|
|
|
|
void BackgroundCallCompaction(void* arg);
|
2013-09-13 14:38:37 -07:00
|
|
|
void BackgroundCallFlush();
|
2014-10-28 11:54:33 -07:00
|
|
|
Status BackgroundCompaction(bool* madeProgress, JobContext* job_context,
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
LogBuffer* log_buffer, void* m = 0);
|
2014-10-28 11:54:33 -07:00
|
|
|
Status BackgroundFlush(bool* madeProgress, JobContext* job_context,
|
2014-03-09 22:01:13 -07:00
|
|
|
LogBuffer* log_buffer);
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2013-05-28 12:35:43 -07:00
|
|
|
void PrintStatistics();
|
|
|
|
|
2013-10-04 22:32:05 -07:00
|
|
|
// dump rocksdb.stats to LOG
|
2013-05-10 15:21:04 -07:00
|
|
|
void MaybeDumpStats();
|
|
|
|
|
2013-06-29 23:21:36 -07:00
|
|
|
// Return the minimum empty level that could hold the total data in the
|
|
|
|
// input level. Return the input level, if such level could not be found.
|
2014-10-01 16:19:16 -07:00
|
|
|
int FindMinimumEmptyLevelFitting(ColumnFamilyData* cfd,
|
|
|
|
const MutableCFOptions& mutable_cf_options, int level);
|
2013-06-29 23:21:36 -07:00
|
|
|
|
2013-09-04 13:13:08 -07:00
|
|
|
// Move the files in the input level to the target level.
|
|
|
|
// If target_level < 0, automatically calculate the minimum level that could
|
|
|
|
// hold the data set.
|
2014-01-31 16:45:20 -08:00
|
|
|
Status ReFitLevel(ColumnFamilyData* cfd, int level, int target_level = -1);
|
2013-06-29 23:21:36 -07:00
|
|
|
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
// helper functions for adding and removing from flush & compaction queues
|
|
|
|
void AddToCompactionQueue(ColumnFamilyData* cfd);
|
|
|
|
ColumnFamilyData* PopFirstFromCompactionQueue();
|
|
|
|
void AddToFlushQueue(ColumnFamilyData* cfd);
|
|
|
|
ColumnFamilyData* PopFirstFromFlushQueue();
|
|
|
|
|
[wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.
Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.
Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.
Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba
Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 06:06:39 -07:00
|
|
|
// helper function to call after some of the logs_ were synced
|
|
|
|
void MarkLogsSynced(uint64_t up_to, bool synced_dir, const Status& status);
|
|
|
|
|
2015-12-08 12:25:48 -08:00
|
|
|
const Snapshot* GetSnapshotImpl(bool is_write_conflict_boundary);
|
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
// table_cache_ provides its own synchronization
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
2014-02-05 09:07:55 -08:00
|
|
|
std::shared_ptr<Cache> table_cache_;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2013-02-15 15:28:24 -08:00
|
|
|
// Lock over the persistent DB state. Non-nullptr iff successfully acquired.
|
2011-03-18 22:37:00 +00:00
|
|
|
FileLock* db_lock_;
|
|
|
|
|
2015-11-10 22:58:01 -08:00
|
|
|
// The mutex for options file related operations.
|
|
|
|
// NOTE: should never acquire options_file_mutex_ and mutex_ at the
|
|
|
|
// same time.
|
|
|
|
InstrumentedMutex options_files_mutex_;
|
2011-03-18 22:37:00 +00:00
|
|
|
// State below is protected by mutex_
|
2015-02-04 21:39:45 -08:00
|
|
|
InstrumentedMutex mutex_;
|
2015-11-10 22:58:01 -08:00
|
|
|
|
2014-10-27 14:50:21 -07:00
|
|
|
std::atomic<bool> shutting_down_;
|
2014-06-02 17:23:55 -07:00
|
|
|
// This condition variable is signaled on these conditions:
|
|
|
|
// * whenever bg_compaction_scheduled_ goes down to 0
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
// * if AnyManualCompaction, whenever a compaction finishes, even if it hasn't
|
2014-06-02 17:23:55 -07:00
|
|
|
// made any progress
|
|
|
|
// * whenever a compaction made any progress
|
|
|
|
// * whenever bg_flush_scheduled_ value decreases (i.e. whenever a flush is
|
|
|
|
// done, even if it didn't make any progress)
|
|
|
|
// * whenever there is an error in background flush or compaction
|
2015-02-04 21:39:45 -08:00
|
|
|
InstrumentedCondVar bg_cv_;
|
2011-06-22 02:36:45 +00:00
|
|
|
uint64_t logfile_number_;
|
2015-10-01 09:12:04 -04:00
|
|
|
std::deque<uint64_t>
|
|
|
|
log_recycle_files; // a list of log files that we can recycle
|
2015-01-26 15:48:59 -08:00
|
|
|
bool log_dir_synced_;
|
2014-04-15 09:57:25 -07:00
|
|
|
bool log_empty_;
|
2014-02-10 17:04:44 -08:00
|
|
|
ColumnFamilyHandleImpl* default_cf_handle_;
|
make internal stats independent of statistics
Summary:
also make it aware of column family
output from db_bench
```
** Compaction Stats [default] **
Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) RW-Amp W-Amp Rd(MB/s) Wr(MB/s) Rn(cnt) Rnp1(cnt) Wnp1(cnt) Wnew(cnt) Comp(sec) Comp(cnt) Avg(sec) Stall(sec) Stall(cnt) Avg(ms)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 14 956 0.9 0.0 0.0 0.0 2.7 2.7 0.0 0.0 0.0 111.6 0 0 0 0 24 40 0.612 75.20 492387 0.15
L1 21 2001 2.0 5.7 2.0 3.7 5.3 1.6 5.4 2.6 71.2 65.7 31 43 55 12 82 2 41.242 43.72 41183 1.06
L2 217 18974 1.9 16.5 2.0 14.4 15.1 0.7 15.6 7.4 70.1 64.3 17 182 185 3 241 16 15.052 0.00 0 0.00
L3 1641 188245 1.8 9.1 1.1 8.0 8.5 0.5 15.4 7.4 61.3 57.2 9 75 76 1 152 9 16.887 0.00 0 0.00
L4 4447 449025 0.4 13.4 4.8 8.6 9.1 0.5 4.7 1.9 77.8 52.7 38 79 100 21 176 38 4.639 0.00 0 0.00
Sum 6340 659201 0.0 44.7 10.0 34.7 40.6 6.0 32.0 15.2 67.7 61.6 95 379 416 37 676 105 6.439 118.91 533570 0.22
Int 0 0 0.0 1.2 0.4 0.8 1.3 0.5 5.2 2.7 59.1 65.6 3 7 9 2 20 10 2.003 0.00 0 0.00
Stalls(secs): 75.197 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 43.717 leveln_slowdown
Stalls(count): 492387 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 41183 leveln_slowdown
** DB Stats **
Uptime(secs): 202.1 total, 13.5 interval
Cumulative writes: 6291456 writes, 6291456 batches, 1.0 writes per batch, 4.90 ingest GB
Cumulative WAL: 6291456 writes, 6291456 syncs, 1.00 writes per sync, 4.90 GB written
Interval writes: 1048576 writes, 1048576 batches, 1.0 writes per batch, 836.0 ingest MB
Interval WAL: 1048576 writes, 1048576 syncs, 1.00 writes per sync, 0.82 MB written
Test Plan: ran it
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19917
2014-07-21 12:57:29 -07:00
|
|
|
InternalStats* default_cf_internal_stats_;
|
2014-01-28 11:05:04 -08:00
|
|
|
unique_ptr<ColumnFamilyMemTablesImpl> column_family_memtables_;
|
2014-04-30 14:33:40 -04:00
|
|
|
struct LogFileNumberSize {
|
|
|
|
explicit LogFileNumberSize(uint64_t _number)
|
[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss
Summary:
I'll just copy internal task summary here:
"
This sequence will cause data loss in the middle after an sync write:
non-sync write key 1
flush triggered, not yet scheduled
sync write key 2
system crash
After rebooting, users might see key 2 but not key 1, which violates the API of sync write.
This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.
One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
"
This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.
Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind
Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40899
2015-07-22 03:28:08 -07:00
|
|
|
: number(_number) {}
|
2014-04-30 14:33:40 -04:00
|
|
|
void AddSize(uint64_t new_size) { size += new_size; }
|
|
|
|
uint64_t number;
|
[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss
Summary:
I'll just copy internal task summary here:
"
This sequence will cause data loss in the middle after an sync write:
non-sync write key 1
flush triggered, not yet scheduled
sync write key 2
system crash
After rebooting, users might see key 2 but not key 1, which violates the API of sync write.
This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.
One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
"
This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.
Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind
Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40899
2015-07-22 03:28:08 -07:00
|
|
|
uint64_t size = 0;
|
|
|
|
bool getting_flushed = false;
|
|
|
|
};
|
|
|
|
struct LogWriterNumber {
|
2015-08-05 20:51:27 -07:00
|
|
|
// pass ownership of _writer
|
|
|
|
LogWriterNumber(uint64_t _number, log::Writer* _writer)
|
|
|
|
: number(_number), writer(_writer) {}
|
|
|
|
|
|
|
|
log::Writer* ReleaseWriter() {
|
|
|
|
auto* w = writer;
|
|
|
|
writer = nullptr;
|
|
|
|
return w;
|
|
|
|
}
|
|
|
|
void ClearWriter() {
|
|
|
|
delete writer;
|
|
|
|
writer = nullptr;
|
|
|
|
}
|
|
|
|
|
[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss
Summary:
I'll just copy internal task summary here:
"
This sequence will cause data loss in the middle after an sync write:
non-sync write key 1
flush triggered, not yet scheduled
sync write key 2
system crash
After rebooting, users might see key 2 but not key 1, which violates the API of sync write.
This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.
One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
"
This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.
Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind
Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40899
2015-07-22 03:28:08 -07:00
|
|
|
uint64_t number;
|
2015-08-05 20:51:27 -07:00
|
|
|
// Visual Studio doesn't support deque's member to be noncopyable because
|
|
|
|
// of a unique_ptr as a member.
|
|
|
|
log::Writer* writer; // own
|
[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss
Summary:
I'll just copy internal task summary here:
"
This sequence will cause data loss in the middle after an sync write:
non-sync write key 1
flush triggered, not yet scheduled
sync write key 2
system crash
After rebooting, users might see key 2 but not key 1, which violates the API of sync write.
This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.
One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
"
This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.
Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind
Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40899
2015-07-22 03:28:08 -07:00
|
|
|
// true for some prefix of logs_
|
|
|
|
bool getting_synced = false;
|
2014-04-30 14:33:40 -04:00
|
|
|
};
|
|
|
|
std::deque<LogFileNumberSize> alive_log_files_;
|
[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss
Summary:
I'll just copy internal task summary here:
"
This sequence will cause data loss in the middle after an sync write:
non-sync write key 1
flush triggered, not yet scheduled
sync write key 2
system crash
After rebooting, users might see key 2 but not key 1, which violates the API of sync write.
This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.
One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
"
This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.
Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind
Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40899
2015-07-22 03:28:08 -07:00
|
|
|
// Log files that aren't fully synced, and the current log file.
|
|
|
|
// Synchronization:
|
|
|
|
// - push_back() is done from write thread with locked mutex_,
|
|
|
|
// - pop_front() is done from any thread with locked mutex_,
|
|
|
|
// - back() and items with getting_synced=true are not popped,
|
|
|
|
// - it follows that write thread with unlocked mutex_ can safely access
|
|
|
|
// back() and items with getting_synced=true.
|
|
|
|
std::deque<LogWriterNumber> logs_;
|
|
|
|
// Signaled when getting_synced becomes false for some of the logs_.
|
|
|
|
InstrumentedCondVar log_sync_cv_;
|
2014-04-30 14:33:40 -04:00
|
|
|
uint64_t total_log_size_;
|
|
|
|
// only used for dynamically adjusting max_total_wal_size. it is a sum of
|
|
|
|
// [write_buffer_size * max_write_buffer_number] over all column families
|
|
|
|
uint64_t max_total_in_memory_state_;
|
2014-06-06 17:26:23 -07:00
|
|
|
// If true, we have only one (default) column family. We use this to optimize
|
|
|
|
// some code-paths
|
|
|
|
bool single_column_family_mode_;
|
2015-03-30 15:04:10 -04:00
|
|
|
// If this is non-empty, we need to delete these log files in background
|
|
|
|
// threads. Protected by db mutex.
|
|
|
|
autovector<log::Writer*> logs_to_free_;
|
2013-12-20 09:57:58 -08:00
|
|
|
|
2014-12-10 18:39:09 -08:00
|
|
|
bool is_snapshot_supported_;
|
|
|
|
|
2015-01-26 13:59:38 -08:00
|
|
|
// Class to maintain directories for all database paths other than main one.
|
|
|
|
class Directories {
|
|
|
|
public:
|
|
|
|
Status SetDirectories(Env* env, const std::string& dbname,
|
|
|
|
const std::string& wal_dir,
|
|
|
|
const std::vector<DbPath>& data_paths);
|
|
|
|
|
|
|
|
Directory* GetDataDir(size_t path_id);
|
|
|
|
|
|
|
|
Directory* GetWalDir() {
|
|
|
|
if (wal_dir_) {
|
|
|
|
return wal_dir_.get();
|
|
|
|
}
|
|
|
|
return db_dir_.get();
|
|
|
|
}
|
|
|
|
|
|
|
|
Directory* GetDbDir() { return db_dir_.get(); }
|
|
|
|
|
|
|
|
private:
|
|
|
|
std::unique_ptr<Directory> db_dir_;
|
|
|
|
std::vector<std::unique_ptr<Directory>> data_dirs_;
|
|
|
|
std::unique_ptr<Directory> wal_dir_;
|
|
|
|
|
|
|
|
Status CreateAndNewDirectory(Env* env, const std::string& dirname,
|
|
|
|
std::unique_ptr<Directory>* directory) const;
|
|
|
|
};
|
|
|
|
|
|
|
|
Directories directories_;
|
2014-01-27 11:02:21 -08:00
|
|
|
|
2014-12-02 12:09:20 -08:00
|
|
|
WriteBuffer write_buffer_;
|
|
|
|
|
2014-09-12 16:23:58 -07:00
|
|
|
WriteThread write_thread_;
|
|
|
|
|
2013-03-28 15:19:28 -07:00
|
|
|
WriteBatch tmp_batch_;
|
2012-03-08 16:23:21 -08:00
|
|
|
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
|
|
|
WriteController write_controller_;
|
2015-05-15 15:52:51 -07:00
|
|
|
|
|
|
|
// Size of the last batch group. In slowdown mode, next write needs to
|
|
|
|
// sleep if it uses up the quota.
|
|
|
|
uint64_t last_batch_group_size_;
|
|
|
|
|
2014-09-10 18:46:09 -07:00
|
|
|
FlushScheduler flush_scheduler_;
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
SnapshotList snapshots_;
|
|
|
|
|
2014-11-07 11:50:34 -08:00
|
|
|
// For each background job, pending_outputs_ keeps the current file number at
|
|
|
|
// the time that background job started.
|
|
|
|
// FindObsoleteFiles()/PurgeObsoleteFiles() never deletes any file that has
|
|
|
|
// number bigger than any of the file number in pending_outputs_. Since file
|
|
|
|
// numbers grow monotonically, this also means that pending_outputs_ is always
|
|
|
|
// sorted. After a background job is done executing, its file number is
|
|
|
|
// deleted from pending_outputs_, which allows PurgeObsoleteFiles() to clean
|
|
|
|
// it up.
|
|
|
|
// State is protected with db mutex.
|
|
|
|
std::list<uint64_t> pending_outputs_;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
// flush_queue_ and compaction_queue_ hold column families that we need to
|
|
|
|
// flush and compact, respectively.
|
|
|
|
// A column family is inserted into flush_queue_ when it satisfies condition
|
|
|
|
// cfd->imm()->IsFlushPending()
|
|
|
|
// A column family is inserted into compaction_queue_ when it satisfied
|
|
|
|
// condition cfd->NeedsCompaction()
|
|
|
|
// Column families in this list are all Ref()-erenced
|
|
|
|
// TODO(icanadi) Provide some kind of ReferencedColumnFamily class that will
|
|
|
|
// do RAII on ColumnFamilyData
|
|
|
|
// Column families are in this queue when they need to be flushed or
|
|
|
|
// compacted. Consumers of these queues are flush and compaction threads. When
|
|
|
|
// column family is put on this queue, we increase unscheduled_flushes_ and
|
|
|
|
// unscheduled_compactions_. When these variables are bigger than zero, that
|
|
|
|
// means we need to schedule background threads for compaction and thread.
|
|
|
|
// Once the background threads are scheduled, we decrease unscheduled_flushes_
|
|
|
|
// and unscheduled_compactions_. That way we keep track of number of
|
|
|
|
// compaction and flush threads we need to schedule. This scheduling is done
|
|
|
|
// in MaybeScheduleFlushOrCompaction()
|
|
|
|
// invariant(column family present in flush_queue_ <==>
|
|
|
|
// ColumnFamilyData::pending_flush_ == true)
|
|
|
|
std::deque<ColumnFamilyData*> flush_queue_;
|
|
|
|
// invariant(column family present in compaction_queue_ <==>
|
|
|
|
// ColumnFamilyData::pending_compaction_ == true)
|
|
|
|
std::deque<ColumnFamilyData*> compaction_queue_;
|
|
|
|
int unscheduled_flushes_;
|
|
|
|
int unscheduled_compactions_;
|
Fix data race against logging data structure because of LogBuffer
Summary:
@igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.
With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.
Thanks @igor for this finding!
Test Plan: make all check
Reviewers: haobo, igor
Reviewed By: haobo
CC: dhruba, ljin, yhchiang, igor, leveldb
Differential Revision: https://reviews.facebook.net/D16767
2014-03-11 10:36:30 -07:00
|
|
|
|
Fix a deadlock in CompactRange()
Summary:
The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions
can cause it to deadlock when CompactRange() is called concurrently from
multiple threads. Imagine a following scenario with only two threads
(max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0):
1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets
bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction
(bg_compaction_scheduled_ is now 10) and waits for it to complete.
2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_
(now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to
drop to LargeNumber.
3. BG thread completes the first manual compaction, decrements
bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_.
Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again
(now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue,
thread #2 also goes to sleep (waiting on bg_cv_), without resetting
bg_compaction_scheduled_.
This diff attempts to address the problem by introducing a new counter
bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only
schedule manual compactions).
Test Plan:
I could pretty much consistently reproduce the deadlock with a program that
calls CompactRange(nullptr, nullptr) immediately after Write() from multiple
threads. This no longer happens with this patch.
Tests (make check) pass.
Reviewers: dhruba, igor, sdong, haobo
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14799
2013-12-21 15:10:39 -08:00
|
|
|
// count how many background compactions are running or have been scheduled
|
2012-10-19 14:00:53 -07:00
|
|
|
int bg_compaction_scheduled_;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2015-10-17 00:16:36 -07:00
|
|
|
// stores the number of compactions are currently running
|
|
|
|
int num_running_compactions_;
|
|
|
|
|
2013-09-13 14:38:37 -07:00
|
|
|
// number of background memtable flush jobs, submitted to the HIGH pool
|
|
|
|
int bg_flush_scheduled_;
|
|
|
|
|
2015-10-17 00:16:36 -07:00
|
|
|
// stores the number of flushes are currently running
|
|
|
|
int num_running_flushes_;
|
|
|
|
|
2011-06-07 14:40:26 +00:00
|
|
|
// Information for a manual compaction
|
|
|
|
struct ManualCompaction {
|
2014-01-31 16:45:20 -08:00
|
|
|
ColumnFamilyData* cfd;
|
2014-01-14 16:19:09 -08:00
|
|
|
int input_level;
|
|
|
|
int output_level;
|
2014-07-16 17:39:18 -07:00
|
|
|
uint32_t output_path_id;
|
2014-01-22 12:46:24 -08:00
|
|
|
Status status;
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
bool done;
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
2015-06-04 16:51:25 -07:00
|
|
|
bool in_progress; // compaction request being processed?
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
bool incomplete; // only part of requested range compacted
|
|
|
|
bool exclusive; // current behavior of only one manual
|
|
|
|
bool disallow_trivial_move; // Force actual compaction to run
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
2015-06-04 16:51:25 -07:00
|
|
|
const InternalKey* begin; // nullptr means beginning of key range
|
|
|
|
const InternalKey* end; // nullptr means end of key range
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
InternalKey* manual_end; // how far we are compacting
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
2015-06-04 16:51:25 -07:00
|
|
|
InternalKey tmp_storage; // Used to keep track of compaction progress
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
InternalKey tmp_storage1; // Used to keep track of compaction progress
|
|
|
|
Compaction* compaction;
|
|
|
|
};
|
|
|
|
std::deque<ManualCompaction*> manual_compaction_dequeue_;
|
|
|
|
|
|
|
|
struct CompactionArg {
|
|
|
|
DBImpl* db;
|
|
|
|
ManualCompaction* m;
|
2011-06-07 14:40:26 +00:00
|
|
|
};
|
2011-03-18 22:37:00 +00:00
|
|
|
|
|
|
|
// Have we encountered a background error in paranoid mode?
|
|
|
|
Status bg_error_;
|
|
|
|
|
2012-09-14 17:11:35 -07:00
|
|
|
// shall we disable deletion of obsolete files
|
2014-01-02 03:33:42 -08:00
|
|
|
// if 0 the deletion is enabled.
|
|
|
|
// if non-zero, files will not be getting deleted
|
|
|
|
// This enables two different threads to call
|
|
|
|
// EnableFileDeletions() and DisableFileDeletions()
|
|
|
|
// without any synchronization
|
|
|
|
int disable_delete_obsolete_files_;
|
2012-09-14 17:11:35 -07:00
|
|
|
|
2014-12-22 12:04:45 +01:00
|
|
|
// next time when we should run DeleteObsoleteFiles with full scan
|
|
|
|
uint64_t delete_obsolete_files_next_run_;
|
2012-10-16 08:53:46 -07:00
|
|
|
|
2013-05-10 15:21:04 -07:00
|
|
|
// last time stats were dumped to LOG
|
2013-05-24 12:52:45 -07:00
|
|
|
std::atomic<uint64_t> last_stats_dump_time_microsec_;
|
2013-05-10 15:21:04 -07:00
|
|
|
|
2015-02-12 09:54:48 -08:00
|
|
|
// Each flush or compaction gets its own job id. this counter makes sure
|
|
|
|
// they're unique
|
|
|
|
std::atomic<int> next_job_id_;
|
|
|
|
|
2016-02-09 11:20:22 -08:00
|
|
|
// A flag indicating whether the current rocksdb database has any
|
|
|
|
// data that is not yet persisted into either WAL or SST file.
|
|
|
|
// Used when disableWAL is true.
|
|
|
|
bool has_unpersisted_data_;
|
2012-11-06 11:21:57 -08:00
|
|
|
|
2012-08-17 16:06:05 -07:00
|
|
|
static const int KEEP_LOG_FILE_NUM = 1000;
|
2015-07-01 16:13:49 -07:00
|
|
|
// MSVC version 1800 still does not have constexpr for ::max()
|
2015-07-11 09:54:33 -07:00
|
|
|
static const uint64_t kNoTimeOut = port::kMaxUint64;
|
2015-07-01 16:13:49 -07:00
|
|
|
|
2012-09-05 17:44:13 -07:00
|
|
|
std::string db_absolute_path_;
|
2012-08-17 16:06:05 -07:00
|
|
|
|
2013-03-14 17:00:04 -07:00
|
|
|
// The options to access storage files
|
2014-09-04 16:18:36 -07:00
|
|
|
const EnvOptions env_options_;
|
2013-03-14 17:00:04 -07:00
|
|
|
|
2014-10-29 17:43:37 -07:00
|
|
|
#ifndef ROCKSDB_LITE
|
|
|
|
WalManager wal_manager_;
|
|
|
|
#endif // ROCKSDB_LITE
|
|
|
|
|
EventLogger
Summary:
Here's my proposal for making our LOGs easier to read by machines.
The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize.
I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc)
Test Plan:
Ran db_bench. Observed:
2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699}
2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12}
Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31647
2015-03-13 10:15:54 -07:00
|
|
|
// Unified interface for logging events
|
|
|
|
EventLogger event_logger_;
|
|
|
|
|
Fix a bug where flush does not happen when a manual compaction is running
Summary:
Currently, when rocksdb tries to run manual compaction to refit data into a level,
there's a ReFitLevel() process that requires no bg work is currently running.
When RocksDB plans to ReFitLevel(), it will do the following:
1. pause scheduling new bg work.
2. wait until all bg work finished
3. do the ReFitLevel()
4. unpause scheduling new bg work.
However, as it pause scheduling new bg work at step one and waiting for all bg work
finished in step 2, RocksDB will stop flushing until all bg work is done (which
could take a long time.)
This patch fix this issue by changing the way ReFitLevel() pause the background work:
1. pause scheduling compaction.
2. wait until all bg work finished.
3. pause scheduling flush
4. do ReFitLevel()
5. unpause both flush and compaction.
The major difference is that. We only pause scheduling compaction in step 1 and wait
for all bg work finished in step 2. This prevent flush being blocked for a long time.
Although there's a very rare case that ReFitLevel() might be in starvation in step 2,
but it's less likely the case as flush typically finish very fast.
Test Plan: existing test.
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55029
2016-03-04 14:24:52 -08:00
|
|
|
// A value of > 0 temporarily disables scheduling of background work
|
2015-10-02 13:17:34 -07:00
|
|
|
int bg_work_paused_;
|
2013-06-29 23:21:36 -07:00
|
|
|
|
Fix a bug where flush does not happen when a manual compaction is running
Summary:
Currently, when rocksdb tries to run manual compaction to refit data into a level,
there's a ReFitLevel() process that requires no bg work is currently running.
When RocksDB plans to ReFitLevel(), it will do the following:
1. pause scheduling new bg work.
2. wait until all bg work finished
3. do the ReFitLevel()
4. unpause scheduling new bg work.
However, as it pause scheduling new bg work at step one and waiting for all bg work
finished in step 2, RocksDB will stop flushing until all bg work is done (which
could take a long time.)
This patch fix this issue by changing the way ReFitLevel() pause the background work:
1. pause scheduling compaction.
2. wait until all bg work finished.
3. pause scheduling flush
4. do ReFitLevel()
5. unpause both flush and compaction.
The major difference is that. We only pause scheduling compaction in step 1 and wait
for all bg work finished in step 2. This prevent flush being blocked for a long time.
Although there's a very rare case that ReFitLevel() might be in starvation in step 2,
but it's less likely the case as flush typically finish very fast.
Test Plan: existing test.
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55029
2016-03-04 14:24:52 -08:00
|
|
|
// A value of > 0 temporarily disables scheduling of background compaction
|
|
|
|
int bg_compaction_paused_;
|
|
|
|
|
2013-06-29 23:21:36 -07:00
|
|
|
// Guard against multiple concurrent refitting
|
|
|
|
bool refitting_level_;
|
|
|
|
|
2014-02-27 11:38:55 -08:00
|
|
|
// Indicate DB was opened successfully
|
|
|
|
bool opened_successfully_;
|
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
// No copying allowed
|
|
|
|
DBImpl(const DBImpl&);
|
|
|
|
void operator=(const DBImpl&);
|
|
|
|
|
2013-03-21 15:59:47 -07:00
|
|
|
// Return the earliest snapshot where seqno is visible.
|
|
|
|
// Store the snapshot right before that, if any, in prev_snapshot
|
|
|
|
inline SequenceNumber findEarliestVisibleSnapshot(
|
|
|
|
SequenceNumber in,
|
|
|
|
std::vector<SequenceNumber>& snapshots,
|
|
|
|
SequenceNumber* prev_snapshot);
|
2013-07-05 18:49:18 -07:00
|
|
|
|
2013-12-20 09:57:58 -08:00
|
|
|
// Background threads call this function, which is just a wrapper around
|
2014-10-28 11:54:33 -07:00
|
|
|
// the InstallSuperVersion() function. Background threads carry
|
|
|
|
// job_context which can have new_superversion already
|
|
|
|
// allocated.
|
2015-06-17 12:37:59 -07:00
|
|
|
void InstallSuperVersionAndScheduleWorkWrapper(
|
2014-10-28 11:54:33 -07:00
|
|
|
ColumnFamilyData* cfd, JobContext* job_context,
|
|
|
|
const MutableCFOptions& mutable_cf_options);
|
2014-10-16 16:57:59 -07:00
|
|
|
|
Rewritten system for scheduling background work
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
2014-12-19 20:38:12 +01:00
|
|
|
// All ColumnFamily state changes go through this function. Here we analyze
|
|
|
|
// the new state and we schedule background work if we detect that the new
|
|
|
|
// state needs flush or compaction.
|
2015-06-17 12:37:59 -07:00
|
|
|
SuperVersion* InstallSuperVersionAndScheduleWork(
|
|
|
|
ColumnFamilyData* cfd, SuperVersion* new_sv,
|
|
|
|
const MutableCFOptions& mutable_cf_options);
|
2013-12-20 09:57:58 -08:00
|
|
|
|
2014-04-15 13:39:26 -07:00
|
|
|
#ifndef ROCKSDB_LITE
|
2014-02-14 17:02:10 -08:00
|
|
|
using DB::GetPropertiesOfAllTables;
|
|
|
|
virtual Status GetPropertiesOfAllTables(ColumnFamilyHandle* column_family,
|
|
|
|
TablePropertiesCollection* props)
|
2014-02-13 16:28:21 -08:00
|
|
|
override;
|
2015-10-13 14:24:45 -07:00
|
|
|
virtual Status GetPropertiesOfTablesInRange(
|
2015-10-19 10:34:55 -07:00
|
|
|
ColumnFamilyHandle* column_family, const Range* range, std::size_t n,
|
2015-10-13 14:24:45 -07:00
|
|
|
TablePropertiesCollection* props) override;
|
|
|
|
|
2014-04-15 13:39:26 -07:00
|
|
|
#endif // ROCKSDB_LITE
|
2014-02-13 16:28:21 -08:00
|
|
|
|
2013-07-26 12:57:01 -07:00
|
|
|
// Function that Get and KeyMayExist call with no_io true or false
|
|
|
|
// Note: 'value_found' from KeyMayExist propagates here
|
2014-02-10 17:04:44 -08:00
|
|
|
Status GetImpl(const ReadOptions& options, ColumnFamilyHandle* column_family,
|
|
|
|
const Slice& key, std::string* value,
|
|
|
|
bool* value_found = nullptr);
|
2014-08-05 11:27:34 -07:00
|
|
|
|
2015-11-03 15:54:18 -08:00
|
|
|
bool GetIntPropertyInternal(ColumnFamilyData* cfd,
|
Eliminate duplicated property constants
Summary:
Before this diff, there were duplicated constants to refer to properties (user-
facing API had strings and InternalStats had an enum). I noticed these were
inconsistent in terms of which constants are provided, names of constants, and
documentation of constants. Overall it seemed annoying/error-prone to maintain
these duplicated constants.
So, this diff gets rid of InternalStats's constants and replaces them with a map
keyed on the user-facing constant. The value in that map contains a function
pointer to get the property value, so we don't need to do string matching while
holding db->mutex_. This approach has a side benefit of making many small
handler functions rather than a giant switch-statement.
Test Plan: db_properties_test passes, running "make commit-prereq -j32"
Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53253
2016-02-02 19:14:56 -08:00
|
|
|
const DBPropertyInfo& property_info,
|
|
|
|
bool is_locked, uint64_t* value);
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
|
|
|
|
bool HasPendingManualCompaction();
|
|
|
|
bool HasExclusiveManualCompaction();
|
|
|
|
void AddManualCompaction(ManualCompaction* m);
|
|
|
|
void RemoveManualCompaction(ManualCompaction* m);
|
2015-12-17 16:59:00 -08:00
|
|
|
bool ShouldntRunManualCompaction(ManualCompaction* m);
|
Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
|
|
|
bool HaveManualCompaction(ColumnFamilyData* cfd);
|
|
|
|
bool MCOverlap(ManualCompaction* m, ManualCompaction* m1);
|
2011-03-18 22:37:00 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
// Sanitize db options. The caller should delete result.info_log if
|
|
|
|
// it is not equal to src.info_log.
|
|
|
|
extern Options SanitizeOptions(const std::string& db,
|
|
|
|
const InternalKeyComparator* icmp,
|
|
|
|
const Options& src);
|
2014-02-04 16:31:18 -08:00
|
|
|
extern DBOptions SanitizeOptions(const std::string& db, const DBOptions& src);
|
2013-10-30 10:52:33 -07:00
|
|
|
|
2014-08-07 13:05:04 -04:00
|
|
|
// Fix user-supplied options to be reasonable
|
|
|
|
template <class T, class V>
|
|
|
|
static void ClipToRange(T* ptr, V minvalue, V maxvalue) {
|
|
|
|
if (static_cast<V>(*ptr) > maxvalue) *ptr = maxvalue;
|
|
|
|
if (static_cast<V>(*ptr) < minvalue) *ptr = minvalue;
|
|
|
|
}
|
|
|
|
|
2013-10-03 21:49:15 -07:00
|
|
|
} // namespace rocksdb
|