Commit Graph

302 Commits

Author SHA1 Message Date
Sagar Vemuri
70645355ad Move FIFOCompactionPicker to a separate file (#4724)
Summary:
**Summary:**
Simplified the code layout by moving FIFOCompactionPicker to a separate file.
**Why?:**
While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724

Differential Revision: D13227914

Pulled By: sagar0

fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
2018-11-29 16:04:52 -08:00
Huachao Huang
5e72bc113a Add SstFileReader to read sst files (#4717)
Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.

Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.

TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717

Differential Revision: D13212686

Pulled By: ajkr

fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
2018-11-27 13:02:23 -08:00
Abhishek Madan
457f77b9ff Introduce RangeDelAggregatorV2 (#4649)
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649

Differential Revision: D13146964

Pulled By: abhimadan

fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
2018-11-21 10:56:45 -08:00
Yi Wu
5f5fddabc7 port folly::JemallocNodumpAllocator (#4534)
Summary:
Introduce `JemallocNodumpAllocator`, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set `MADV_DONTDUMP ` to the memory. The implementation is basically the same as `folly::JemallocNodumpAllocator`, except for some minor difference:
1. It only support jemalloc >= 5.0
2. When the allocator destruct, it explicitly destruct the corresponding arena via `arena.<i>.destroy` via `mallctl`.

Depending on #4502.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4534

Differential Revision: D10435474

Pulled By: yiwu-arbug

fbshipit-source-id: e80edea755d3853182485d2be710376384ce0bb4
2018-10-26 17:29:18 -07:00
Abhishek Madan
8c78348c77 Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.

In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```

...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```

The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.

Readrandom results before PR:
```
readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
```

Readrandom results after PR:
```
readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
```

So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).

----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449

Differential Revision: D10370575

Pulled By: abhimadan

fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
2018-10-24 12:31:12 -07:00
Jigar Bhati
a4d9aa6b18 Plumb WriteBufferManager through JNI (#4492)
Summary:
Allow rocks java to explicitly create WriteBufferManager by plumbing it to the native code through JNI.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4492

Differential Revision: D10428506

Pulled By: sagar0

fbshipit-source-id: cd9dd8c2ef745a0303416b44e2080547bdcca1fd
2018-10-17 11:49:57 -07:00
Ben Clay
c9048021ad RocksJava: memory_util support (#4446)
Summary:
JNI passthrough for utilities/memory/memory_util.cc

sagar0 adamretter
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4446

Differential Revision: D10174578

Pulled By: sagar0

fbshipit-source-id: d1d196d771dff22afb7ef7500f308233675696f8
2018-10-08 11:05:27 -07:00
Yi Wu
d6f2ecf49c Utility to run task periodically in a thread (#4423)
Summary:
Introduce `RepeatableThread` utility to run task periodically in a separate thread. It is basically the same as the the same class in fbcode, and in addition provide a helper method to let tests mock time and trigger execution one at a time.

We can use this class to replace `TimerQueue` in #4382 and `BlobDB`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4423

Differential Revision: D10020932

Pulled By: yiwu-arbug

fbshipit-source-id: 3616bef108c39a33c92eedb1256de424b7c04087
2018-09-27 15:28:00 -07:00
Abhishek Madan
1626f6ab6b Add RangeDelAggregator microbenchmarks (#4363)
Summary:
To measure the results of upcoming DeleteRange v2 work, this commit adds
simple benchmarks for RangeDelAggregator. It measures the average time
for AddTombstones and ShouldDelete calls.

Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results:

Before #4014:
```
=======================
Results:
=======================
AddTombstones:          1356.28 us
ShouldDelete:           0.401732 us
```

Latest master:
```
=======================
Results:
=======================
AddTombstones:          740.82 us
ShouldDelete:           0.383271 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363

Differential Revision: D9881676

Pulled By: abhimadan

fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
2018-09-17 14:58:31 -07:00
Sagar Vemuri
f46dd5cbeb Remove trace_analyzer_tool from LIB_SOURCES (#4331)
Summary:
trace_analyzer_tool should only be in ANALYZER_LIB_SOURCES and not in LIB_SOURCES.
This fixes java_test travis build failures seen in jtest.
Blame: a6d3de4e7a
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4331

Differential Revision: D9560377

Pulled By: sagar0

fbshipit-source-id: 6b9636201a920b56ee0f61e367fee5d3dca692b0
2018-08-29 21:28:40 -07:00
Yi Wu
a6d3de4e7a BlobDB: Implement DisableFileDeletions (#4314)
Summary:
`DB::DiableFileDeletions` and `DB::EnableFileDeletions` are used for applications to stop RocksDB background jobs to delete files while they are doing replication. Implement these methods for BlobDB. `DeleteObsolteFiles` now needs to check `disable_file_deletions_` before starting, and will hold `delete_file_mutex_` the whole time while it is running. `DisableFileDeletions` needs to wait on `delete_file_mutex_` for running `DeleteObsolteFiles` job and set `disable_file_deletions_` flag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4314

Differential Revision: D9501373

Pulled By: yiwu-arbug

fbshipit-source-id: 81064c1228f1724eff46da22b50ff765b16292cd
2018-08-27 10:58:29 -07:00
Zhichao Cao
9e2d5ab6bf Adjusted the Makefile of trace_analyzer to isolate the Gflags from other (#4290)
Summary:
Previously, the trace_analyzer_tool will be complied with other libobjects, which let the GFLAGS of trace_analyzer appear in other tools (e.g., db_bench, rocksdb_dump, and etc.). When using '--help', the help information of trace_analyzer will appear in other tool help information, which will cause confusion issues.

Currently, trace_analyzer_tool is built and used only by trace_analyzer and trace_analyzer_test to avoid the issues.

Tested with make asan_check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4290

Differential Revision: D9413163

Pulled By: zhichao-cao

fbshipit-source-id: ed5d20c4575a53ca15ff62a2ffe601d5cf278cc4
2018-08-21 10:47:24 -07:00
Christian Esken
c7cf981a85 Add CompactRangeOptions for Java (#4220)
Summary:
Closes https://github.com/facebook/rocksdb/issues/4195

CompactRangeOptions are available the CPP API, but not in the Java API. This PR adds CompactRangeOptions to the Java API and adds an overloaded compactRange() method. See https://github.com/facebook/rocksdb/issues/4195 for the original discussion.

This change supports all fields of CompactRangeOptions, including the required enum converters in the JNI portal.

Significant changes:
- Make CompactRangeOptions available in the compactRange() for Java.
- Deprecate other compactRange() methods that have individual option params, like in the CPP code.
- Migrate rocksdb_compactrange_helper() to  CompactRangeOptions.
- Add Java unit tests for CompactRangeOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4220

Differential Revision: D9380007

Pulled By: sagar0

fbshipit-source-id: 6af6c334f221427f1997b33fb24c3986b092fed6
2018-08-17 10:57:25 -07:00
Fenggang Wu
19ec44fd39 Improve point-lookup performance using a data block hash index (#4174)
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`

The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174

Differential Revision: D8965914

Pulled By: fgwu

fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
2018-08-15 14:30:03 -07:00
Zhichao Cao
999d955e4f RocksDB Trace Analyzer (#4091)
Summary:
A framework of trace analyzing for RocksDB

After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
**Input:**
1. trace file
2. Whole keys space file

**Statistics:**
1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
2. Key hotness (access count) of each one
3. Key space separation based on given prefix
4. Key size distribution
5. Value size distribution if appliable
6. Top K accessed keys
7. QPS statistics including the average QPS and peak QPS
8. Top K accessed prefix
9. The query correlation analyzing, output the number of X after Y and the corresponding average time
    intervals

**Output:**
1. key access heat map (either in the accessed key space or whole key space)
2. trace sequence file (interpret the raw trace file to line base text file for future use)
3. Time serial (The key space ID and its access time)
4. Key access count distritbution
5. Key size distribution
6. Value size distribution (in each intervals)
7. whole key space separation by the prefix
8. Accessed key space separation by the prefix
9. QPS of each operation and each column family
10. Top K QPS and their accessed prefix range

**Test:**
1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
2. Generated the trace and analyze the trace

**Implemented but not tested (due to the limitation of trace_replay):**
1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
2. Analyzing the number of Key found by Get

**Future Work:**
1.  Support execution time analyzing of each requests
2.  Support cache hit situation and block read situation of Get
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091

Differential Revision: D9256157

Pulled By: zhichao-cao

fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
2018-08-13 11:44:02 -07:00
Yi Wu
140f256da2 BlobDB: Cleanup TTLExtractor interface (#4229)
Summary:
Cleanup TTLExtractor interface. The original purpose of it is to allow our users keep using existing `Write()` interface but allow it to accept TTL via `TTLExtractor`. However the interface is confusing. Will replace it with something like `WriteWithTTL(batch, ttl)` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4229

Differential Revision: D9174390

Pulled By: yiwu-arbug

fbshipit-source-id: 68201703d784408b851336ab4dd9b84188245b2d
2018-08-06 11:58:05 -07:00
Sagar Vemuri
12b6cdeed3 Trace and Replay for RocksDB (#3837)
Summary:
A framework for tracing and replaying RocksDB operations.

A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.

- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.

Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837

Differential Revision: D7974837

Pulled By: sagar0

fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
2018-08-01 00:27:08 -07:00
Fenggang Wu
8805ec2f49 DataBlockHashIndex: Standalone Implementation with Unit Test (#4139)
Summary:
The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested.

`DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called.
`DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139

Reviewed By: sagar0

Differential Revision: D8866764

Pulled By: fgwu

fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10
2018-07-24 11:43:37 -07:00
Chang Su
374c37da5b move static msgs out of Status class (#4144)
Summary:
The member msgs of class Status contains all types of status messages.
When users dump a Status object, msgs will confuse users. So move it out
of class Status by making it as file-local static variable.

Closes #3831 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4144

Differential Revision: D8941419

Pulled By: sagar0

fbshipit-source-id: 56b0510258465ff26db15aa6b04e01532e053e3d
2018-07-23 15:44:16 -07:00
Siying Dong
ddc07b40fc Remove managed iterator
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4124

Differential Revision: D8829910

Pulled By: siying

fbshipit-source-id: f3e952ccf3a631071a5d77c48e327046f8abb560
2018-07-17 14:43:18 -07:00
Anand Ananthabhotla
52d4c9b7f6 Allow DB resume after background errors (#3997)
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance

This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997

Differential Revision: D8653831

Pulled By: anand1976

fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
2018-06-28 12:34:40 -07:00
Dmitri Smirnov
f4b72d7056 Provide a way to override windows memory allocator with jemalloc for ZSTD
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.

For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.

The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838

Differential Revision: D8229794

Pulled By: miasantreble

fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
2018-06-04 12:12:48 -07:00
Manuel Ung
01e3c30def Extend existing unit tests to run with WriteUnprepared as well
Summary:
As titled.

I have not extended the Compatibility tests because the new WAL markers are still unimplemented.
Closes https://github.com/facebook/rocksdb/pull/3941

Differential Revision: D8238394

Pulled By: lth

fbshipit-source-id: 980e3d44837bbf2cfa64047f9738f559dfac4b1d
2018-06-01 14:58:41 -07:00
Manuel Ung
aaac6cd16f Add write unprepared classes by inheriting from write prepared
Summary: Closes https://github.com/facebook/rocksdb/pull/3907

Differential Revision: D8218325

Pulled By: lth

fbshipit-source-id: ff32d8dab4a159cd2762876cba4b15e3dc51ff3b
2018-05-31 10:47:42 -07:00
Andrew Kryczka
e410501eeb Add missing test files to src.mk
Summary:
We only generate the header dependency (".cc.d") files for files mentioned in "src.mk". When we don't generate them, changes to header dependencies do not cause `make` to recompile the dependent ".o". Then it takes a while for developers (or maybe just me) to realize `make clean` is necessary.
Closes https://github.com/facebook/rocksdb/pull/3876

Differential Revision: D8065389

Pulled By: ajkr

fbshipit-source-id: 0f62eee7bcab15b0215791564e6ab3775d46996b
2018-05-21 09:43:29 -07:00
Mike Kolupaev
8bf555f487 Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
 * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
 * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.

However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).

This PR changes the convention to:
 * If status() is not ok, Valid() always returns false.
 * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)

This does sacrifice the two use cases listed above, but siying said it's ok.

Overview of the changes:
 * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
 * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
 * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.

To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:

Iterators that didn't need changes:
 * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
 * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.

Iterators with changes (see inline comments for details):
 * DBIter - an overhaul:
    - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
    - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
    - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
    - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
    - It used to not reset status on seek for some types of errors.
    - Some simplifications and better comments.
    - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
 * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
 * ForwardIterator - changed to the new convention, also slightly simplified.
 * ForwardLevelIterator - fixed some bugs and simplified.
 * LevelIterator - simplified.
 * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
 * BlockBasedTableIterator - minor changes.
 * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
 * PlainTableIterator - some seeks used to not reset status.
 * CuckooTableIterator - tiny code cleanup.
 * ManagedIterator - fixed some bugs.
 * BaseDeltaIterator - changed to the new convention and fixed a bug.
 * BlobDBIterator - seeks used to not reset status.
 * KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810

Differential Revision: D7888019

Pulled By: al13n321

fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
2018-05-17 02:56:56 -07:00
Siying Dong
d59549298f Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.

Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)

This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765

Differential Revision: D7747618

Pulled By: siying

fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
2018-05-03 15:43:09 -07:00
Adam Retter
ca87aef82d Added support for SstFileManager to RocksJava
Summary: Closes https://github.com/facebook/rocksdb/pull/3666

Differential Revision: D7457634

Pulled By: sagar0

fbshipit-source-id: 47741e2ee66e9255c580f4e38cfb86b284c27c2f
2018-04-06 21:26:32 -07:00
Yanqin Jin
1f5def1653 Fix race condition causing double deletion of ssts
Summary:
Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files,  but this PR targets sst.
Also add a unit test to verify that this PR fixes the issue.

The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
```
timeline              user_thread                background_compaction thread
t1   |                                          FindObsoleteFiles(full_scan=false)
t2   |     FindObsoleteFiles(full_scan=true)
t3   |                                          PurgeObsoleteFiles
t4   |     PurgeObsoleteFiles
     V
```
When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.

ajkr could you take a look and comment? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3638

Differential Revision: D7384372

Pulled By: riversand963

fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
2018-03-28 10:29:59 -07:00
Dmitri Smirnov
53d66df0c4 Refactor sync_point to make implementation either customizable or replaceable
Summary: Closes https://github.com/facebook/rocksdb/pull/3637

Differential Revision: D7354373

Pulled By: ajkr

fbshipit-source-id: 6816c7bbc192ed0fb944942b11c7074bf24eddf1
2018-03-23 12:56:52 -07:00
Adam Retter
c5302a8a58 Java wrapper for Native Comparators
Summary:
This is an abstraction for working with custom Comparators implemented in native C++ code from Java. Native code must directly extend `rocksdb::Comparator`. When the native code comparator is compiled into the RocksDB codebase, you can then create a Java Class, and JNI stub to wrap it.

Useful if the C++/JNI barrier overhead is too much for your applications comparator performance.

An example is provided in `java/rocksjni/native_comparator_wrapper_test.cc` and `java/src/main/java/org/rocksdb/NativeComparatorWrapperTest.java`.
Closes https://github.com/facebook/rocksdb/pull/3334

Differential Revision: D7172605

Pulled By: miasantreble

fbshipit-source-id: e24b7eb267a3bcb6afa214e0379a1d5e8a2ceabe
2018-03-08 11:27:42 -08:00
Yi Wu
b864bc9b5b Blob DB: Improve FIFO eviction
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556

Differential Revision: D7139328

Pulled By: yiwu-arbug

fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
2018-03-06 11:57:42 -08:00
Pooya Shareghi
0a2354ca8f Added bytes XOR merge operator
Summary:
Closes https://github.com/facebook/rocksdb/pull/575

I fixed the merge conflicts etc.
Closes https://github.com/facebook/rocksdb/pull/3065

Differential Revision: D7128233

Pulled By: sagar0

fbshipit-source-id: 2c23a48c9f0432c290b0cd16a12fb691bb37820c
2018-03-06 10:27:36 -08:00
Adam Retter
2ac988c67e Add TransactionDB and OptimisticTransactionDB to the Java API
Summary:
Closes https://github.com/facebook/rocksdb/issues/697
Closes https://github.com/facebook/rocksdb/issues/1151
Closes https://github.com/facebook/rocksdb/pull/1298

Differential Revision: D7131402

Pulled By: sagar0

fbshipit-source-id: bcd34ce95ed88cc641786089ff4232df7b2f089f
2018-03-02 10:34:13 -08:00
Siying Dong
2f1a3a4d74 Refactor ReadBlockContents()
Summary:
Divide ReadBlockContents() to multiple sub-functions. Maintaining the input and intermediate data in a new class BlockFetcher.
I hope in general it makes the code easier to maintain.
Another motivation to do it is to clearly divide the logic before file reading and after file reading. The refactor will help us evaluate how can we make I/O async in the future.
Closes https://github.com/facebook/rocksdb/pull/3244

Differential Revision: D6520983

Pulled By: siying

fbshipit-source-id: 338d90bc0338472d46be7a7682028dc9114b12e9
2017-12-11 15:27:32 -08:00
Yi Wu
dd49f89466 Fix TARGETS lint warnings.
Summary:
Fix buckifier script and regenerate TARGETS file with no lint warnings.
Closes https://github.com/facebook/rocksdb/pull/3170

Differential Revision: D6328993

Pulled By: yiwu-arbug

fbshipit-source-id: 17d0e4ed92f676f35fed76659386611cc72b00b2
2017-11-15 14:28:34 -08:00
Yi Wu
42564ada53 Blob DB: not using PinnableSlice move assignment
Summary:
The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky.
Closes https://github.com/facebook/rocksdb/pull/3164

Differential Revision: D6319201

Pulled By: yiwu-arbug

fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48
2017-11-13 18:12:20 -08:00
Maysam Yabandeh
60d83df23d WritePrepared Txn: Move DB class to its own file
Summary:
Move  WritePreparedTxnDB from pessimistic_transaction_db.h to its own header, write_prepared_txn_db.h
Closes https://github.com/facebook/rocksdb/pull/3114

Differential Revision: D6220987

Pulled By: maysamyabandeh

fbshipit-source-id: 18893fb4fdc6b809fe117dabb544080f9b4a301b
2017-11-02 11:14:30 -07:00
Yi Wu
31d3e41810 PinnableSlice move assignment
Summary:
Allow `std::move(pinnable_slice)`.
Closes https://github.com/facebook/rocksdb/pull/2997

Differential Revision: D6036782

Pulled By: yiwu-arbug

fbshipit-source-id: 583fb0419a97e437ff530f4305822341cd3381fa
2017-10-12 18:28:24 -07:00
Adam Retter
560e984995 Added CompactionFilterFactory support to RocksJava
Summary:
This PR also includes some cleanup, bugfixes and refactoring of the Java API. However these are really pre-cursors on the road to CompactionFilterFactory support.
Closes https://github.com/facebook/rocksdb/pull/1241

Differential Revision: D6012778

Pulled By: sagar0

fbshipit-source-id: 0774465940ee99001a78906e4fed4ef57068ad5c
2017-10-12 11:12:16 -07:00
Yi Wu
d1b74b0c82 WritePrepared Txn: Compaction/Flush
Summary:
Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator:
* check if key has been committed. If not, output uncommitted keys AS-IS.
* use SnapshotChecker to check if key is visible to a snapshot when in need.
* do not output key with seq = 0 if the key is not committed.
Closes https://github.com/facebook/rocksdb/pull/2926

Differential Revision: D5902907

Pulled By: yiwu-arbug

fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320
2017-10-06 10:41:53 -07:00
Sagar Vemuri
c8f3606731 Expose LoadLatestOptions, LoadOptionsFromFile and GetLatestOptionsFileName APIs in RocksJava
Summary:
JNI wrappers for LoadLatestOptions, LoadOptionsFromFile and GetLatestOptionsFileName APIs.
Closes https://github.com/facebook/rocksdb/pull/2898

Differential Revision: D5857934

Pulled By: sagar0

fbshipit-source-id: 68b79e83eab8de9416e3f1fef73e11cf7947e90a
2017-09-21 17:29:13 -07:00
Kamalalochana Subbaiah
e612e31740 Updated CRC32 Power Optimization Changes
Summary:
Support for PowerPC Architecture
Detecting AltiVec Support
Closes https://github.com/facebook/rocksdb/pull/2716

Differential Revision: D5606836

Pulled By: siying

fbshipit-source-id: 720262453b1546e5fdbbc668eff56848164113f3
2017-08-31 14:16:30 -07:00
Pengchao Wang
825a22c00c garbage collect tombstones in merge operator
Summary:
Remove cassandra tombstone when reaching the max compaction level (full merge). if all columns collected key will be removed in next compaction via compaction filter
Closes https://github.com/facebook/rocksdb/pull/2791

Reviewed By: sagar0

Differential Revision: D5722465

Pulled By: wpc

fbshipit-source-id: 61e9898a5686551653a16383255aeaab3197e65e
2017-08-31 10:11:54 -07:00
Maysam Yabandeh
26ac24f199 Add more unit test to write_prepared txns
Summary: Closes https://github.com/facebook/rocksdb/pull/2798

Differential Revision: D5724173

Pulled By: maysamyabandeh

fbshipit-source-id: fb6b782d933fb4be315b1a231a6a67a66fdc9c96
2017-08-31 09:41:27 -07:00
Maysam Yabandeh
bdc056f8aa Refactor PessimisticTransaction
Summary:
This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
Closes https://github.com/facebook/rocksdb/pull/2691

Differential Revision: D5569464

Pulled By: maysamyabandeh

fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325
2017-08-07 16:12:29 -07:00
Maysam Yabandeh
c9804e007a Refactor TransactionDBImpl
Summary:
This opens space for the new implementations of TransactionDBImpl such as WritePreparedTxnDBImpl that has a different policy of how to write to DB.
Closes https://github.com/facebook/rocksdb/pull/2689

Differential Revision: D5568918

Pulled By: maysamyabandeh

fbshipit-source-id: f7eac866e175daf3793ae79da108f65cc7dc7b25
2017-08-05 17:26:15 -07:00
Maysam Yabandeh
c3d5c4d38a Refactor TransactionImpl
Summary:
This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
Closes https://github.com/facebook/rocksdb/pull/2676

Differential Revision: D5549998

Pulled By: maysamyabandeh

fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
2017-08-03 08:57:22 -07:00
Yi Wu
1900771bd2 Dump Blob DB options to info log
Summary:
* Dump blob db options to info log
* Remove BlobDBOptionsImpl to disallow dynamic cast *BlobDBOptions into *BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645
* Change some of the default options
* Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon.
Closes https://github.com/facebook/rocksdb/pull/2671

Differential Revision: D5529912

Pulled By: yiwu-arbug

fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7
2017-08-01 13:01:47 -07:00
Yi Wu
6083bc79f8 Blob DB TTL extractor
Summary:
Introducing blob_db::TTLExtractor to replace extract_ttl_fn. The TTL
extractor can be use to extract TTL from keys insert with Put or
WriteBatch. Change over existing extract_ttl_fn are:
* If value is changed, it will be return via std::string* (rather than Slice*). With Slice* the new value has to be part of the existing value. With std::string* the limitation is removed.
* It can optionally return TTL or expiration.

Other changes in this PR:
* replace `std::chrono::system_clock` with `Env::NowMicros` so that I can mock time in tests.
* add several TTL tests.
* other minor naming change.
Closes https://github.com/facebook/rocksdb/pull/2659

Differential Revision: D5512627

Pulled By: yiwu-arbug

fbshipit-source-id: 0dfcb00d74d060b8534c6130c808e4d5d0a54440
2017-07-27 23:26:04 -07:00
Siying Dong
c281b44829 Revert "CRC32 Power Optimization Changes"
Summary:
This reverts commit 2289d38115.
Closes https://github.com/facebook/rocksdb/pull/2652

Differential Revision: D5506163

Pulled By: siying

fbshipit-source-id: 105e31dd9d99090453a6b9f32c165206cd3affa3
2017-07-26 19:31:36 -07:00
Kamalalochana Subbaiah
2289d38115 CRC32 Power Optimization Changes
Summary:
Support for PowerPC Architecture
Detecting AltiVec Support
Closes https://github.com/facebook/rocksdb/pull/2353

Differential Revision: D5210948

Pulled By: siying

fbshipit-source-id: 859a8c063d37697addd89ba2b8a14e5efd5d24bf
2017-07-26 09:42:29 -07:00
Pengchao Wang
534c255c7a Cassandra compaction filter for purge expired columns and rows
Summary:
Major changes in this PR:
* Implement CassandraCompactionFilter to remove expired columns and rows (if all column expired)
* Move cassandra related code from utilities/merge_operators/cassandra to utilities/cassandra/*
* Switch to use shared_ptr<> from uniqu_ptr for Column membership management in RowValue. Since columns do have multiple owners in Merge and GC process, use shared_ptr helps make RowValue immutable.
* Rename cassandra_merge_test to cassandra_functional_test and add two TTL compaction related tests there.
Closes https://github.com/facebook/rocksdb/pull/2588

Differential Revision: D5430010

Pulled By: wpc

fbshipit-source-id: 9566c21e06de17491d486a68c70f52d501f27687
2017-07-21 14:57:44 -07:00
Adam Retter
000bf0af38 Improve the design and native object management of Stats in RocksJava
Summary: Closes https://github.com/facebook/rocksdb/pull/2551

Differential Revision: D5399288

Pulled By: sagar0

fbshipit-source-id: dd3df2ed6cc5ae612db0998ea746cc29fccf568e
2017-07-11 23:31:00 -07:00
Aaron Gao
a43c053ad8 remove duplicated utilities/merge_operators/cassandra/test_utils.cc in src.mk
Summary:
target `utilities/merge_operators/cassandra/test_utils.d' given more than once in the same rule
remove the duplicate one
Closes https://github.com/facebook/rocksdb/pull/2542

Differential Revision: D5379570

Pulled By: lightmark

fbshipit-source-id: 353f38d085e9f627c1b3c53acef7a2c4bc71014d
2017-07-06 17:28:57 -07:00
Yi Wu
982cec22af Fix TARGETS file tests list
Summary:
1. The buckifier script assume each test "foo" comes with a .cc file of the same name (i.e. foo.cc). Update cassandra tests to follow this pattern so that the buckifier script can recognize them.
2. add blob_db_test
Closes https://github.com/facebook/rocksdb/pull/2506

Differential Revision: D5331517

Pulled By: yiwu-arbug

fbshipit-source-id: 86f3eba471fc621186ab44cbd073b6162cde8e57
2017-06-27 14:12:02 -07:00
Ewout Prangsma
51778612c9 Encryption at rest support
Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.

It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).

The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.

To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:

```
ENCRYPTED_ENV=1 make check_some
```

There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424

Differential Revision: D5322178

Pulled By: sdwilsh

fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
2017-06-26 16:56:24 -07:00
Chen Shen
cbd825deea Create a MergeOperator for Cassandra Row Value
Summary:
This PR implements the MergeOperator for Cassandra Row Values.
Closes https://github.com/facebook/rocksdb/pull/2289

Differential Revision: D5055464

Pulled By: scv119

fbshipit-source-id: 45f276ef8cbc4704279202f6a20c64889bc1adef
2017-06-16 14:27:00 -07:00
Siying Dong
95b0e89b5d Improve write buffer manager (and allow the size to be tracked in block cache)
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350

Differential Revision: D5110648

Pulled By: siying

fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
2017-06-02 14:26:56 -07:00
Maysam Yabandeh
5a9b4d7435 Retire memenv https://github.com/facebook/rocksdb/pull/2082
Summary:
This is a manual commit of this PR:
Retire InMemoryEnv in favor of MockEnv #2082
With MockEnv doing the same yet being more mature, InMemoryEnv is redundant.

Reviewed By: IslamAbdelRahman

Differential Revision: D5162323

fbshipit-source-id: 59fd0082a891dc99cc531e4da9d68bf891eae3f5
2017-06-01 15:41:20 -07:00
Tamir Duberstein
0dc3040d54 db: avoid #includeing malloc and jemalloc simultaneously
Summary:
This fixes a compilation failure on Linux when the system libc is not
glibc. jemalloc's configure script incorrectly assumes that glibc is
always used on Linux systems, producing glibc-style signatures; when
the system libc is e.g. musl, the following error is observed:

```
  [  0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o
  In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0,
                   from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77:
  /x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void*)' has a different exception specifier
   size_t malloc_usable_size(void *);
          ^~~~~~~~~~~~~~~~~~
  In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0:
  /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()'
   #  define je_malloc_usable_size malloc_usable_size
                                   ^
  /go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size'
   JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size(
                                           ^~~~~~~~~~~~~~~~~~~~~
  CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed
```

This works around the issue by rearranging the sources such that
jemalloc's headers are never in the same scope as the system's malloc
header. The jemalloc issue has been reported as well, see:
https://github.com/jemalloc/jemalloc/issues/778.

cc tschottdorf
Closes https://github.com/facebook/rocksdb/pull/2188

Differential Revision: D5163048

Pulled By: siying

fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d
2017-05-31 22:43:02 -07:00
Yi Wu
ad19eb8686 Fixing blob db sequence number handling
Summary:
Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.

Stacking on #2375.
Closes https://github.com/facebook/rocksdb/pull/2385

Differential Revision: D5148358

Pulled By: yiwu-arbug

fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
2017-05-31 10:56:45 -07:00
Yi Wu
345878a7fb update blob_db_test
Summary:
Re-enable blob_db_test with some update:
* Commented out delay at the end of GC tests. Will update the logic later with sync point to properly trigger GC.
* Added some helper functions.

Also update make files to include blob_dump tool.
Closes https://github.com/facebook/rocksdb/pull/2375

Differential Revision: D5133793

Pulled By: yiwu-arbug

fbshipit-source-id: 95470b26d0c1f9592ba4b7637e027fdd263f425c
2017-05-30 22:26:13 -07:00
Yi Wu
578fb0b1dc Simple blob file dumper
Summary:
A simple blob file dumper.
Closes https://github.com/facebook/rocksdb/pull/2242

Differential Revision: D5097553

Pulled By: yiwu-arbug

fbshipit-source-id: c6e00d949fcd3658f9f68da9352f06339fac418d
2017-05-23 10:42:59 -07:00
Adam Retter
88c818e437 Replace deprecated RocksDB#addFile with RocksDB#ingestExternalFile
Summary:
Previously the Java implementation of `RocksDB#addFile` was both incomplete and not inline with the C++ API.

Rather than fix it, as I see that `rocksdb::DB::AddFile` is now deprecated in favour of `rocksdb::DB::IngestExternalFile`, I have removed the old broken implementation and implemented `RocksDB#ingestExternalFile`.

Closes https://github.com/facebook/rocksdb/issues/2261
Closes https://github.com/facebook/rocksdb/pull/2291

Differential Revision: D5061264

Pulled By: sagar0

fbshipit-source-id: 85df0899fa1b1fc3535175cac4f52353511d4104
2017-05-22 10:27:23 -07:00
Yi Wu
86d5492530 Fix build error with blob DB.
Summary:
snprintf is in <stdio.h> and not in namespace std.
Closes https://github.com/facebook/rocksdb/pull/2287

Reviewed By: anirbanr-fb

Differential Revision: D5054752

Pulled By: yiwu-arbug

fbshipit-source-id: 356807ec38f3c7d95951cdb41f31a3d3ae0714d4
2017-05-15 14:05:46 -07:00
Andrew Kryczka
3fa9a39c68 Add GetAllKeyVersions API
Summary:
- Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
- Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
- Migrated the "ldb idump" subcommand to use this function
- The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
Closes https://github.com/facebook/rocksdb/pull/2232

Differential Revision: D4976007

Pulled By: ajkr

fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
2017-05-12 15:54:06 -07:00
Anirban Rahut
d85ff4953c Blob storage pr
Summary:
The final pull request for Blob Storage.
Closes https://github.com/facebook/rocksdb/pull/2269

Differential Revision: D5033189

Pulled By: yiwu-arbug

fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06
2017-05-10 15:14:44 -07:00
Andrew Kryczka
f6a27d0bce Extract statistics tests into separate file
Summary:
I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there.
Closes https://github.com/facebook/rocksdb/pull/2211

Differential Revision: D4951558

Pulled By: ajkr

fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4
2017-04-26 14:47:23 -07:00
Andrew Kryczka
e5e545a021 Reunite checkpoint and backup core logic
Summary:
These code paths forked when checkpoint was introduced by copy/pasting the core backup logic. Over time they diverged and bug fixes were sometimes applied to one but not the other (like fix to include all relevant WALs for 2PC), or it required extra effort to fix both (like fix to forge CURRENT file). This diff reunites the code paths by extracting the core logic into a function, CreateCustomCheckpoint(), that is customizable via callbacks to implement both checkpoint and backup.

Related changes:

- flush_before_backup is now forcibly enabled when 2PC is enabled
- Extracted CheckpointImpl class definition into a header file. This is so the function, CreateCustomCheckpoint(), can be called by internal rocksdb code but not exposed to users.
- Implemented more functions in DummyDB/DummyLogFile (in backupable_db_test.cc) that are used by CreateCustomCheckpoint().
Closes https://github.com/facebook/rocksdb/pull/1932

Differential Revision: D4622986

Pulled By: ajkr

fbshipit-source-id: 157723884236ee3999a682673b64f7457a7a0d87
2017-04-24 15:06:46 -07:00
Siying Dong
ff97287016 Refactor compaction picker code
Summary:
1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h.
2. Rename some functions to make the code easier to understand.
3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions.
Closes https://github.com/facebook/rocksdb/pull/2100

Differential Revision: D4845948

Pulled By: siying

fbshipit-source-id: efa0ab4
2017-04-06 20:09:34 -07:00
Sagar Vemuri
343b59d6ee Move various string utility functions into string_util
Summary:
This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper.

Check the sub-commits for easier reviewing.
Closes https://github.com/facebook/rocksdb/pull/2094

Differential Revision: D4837730

Pulled By: sagar0

fbshipit-source-id: 344278a
2017-04-06 14:54:12 -07:00
Yi Wu
df6f5a3772 Move memtable related files into memtable directory
Summary:
Move memtable related files into memtable directory.
Closes https://github.com/facebook/rocksdb/pull/2087

Differential Revision: D4829242

Pulled By: yiwu-arbug

fbshipit-source-id: ca70ab6
2017-04-06 14:09:13 -07:00
Siying Dong
d2dce5611a Move some files under util/ to separate dirs
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090

Differential Revision: D4833681

Pulled By: siying

fbshipit-source-id: 2fd8bef
2017-04-05 19:09:16 -07:00
Siying Dong
ce64b8b719 Divide db/db_impl.cc
Summary:
db_impl.cc is too large to manage. Divide db_impl.cc into db/db_impl.cc, db/db_impl_compaction_flush.cc, db/db_impl_files.cc, db/db_impl_open.cc and db/db_impl_write.cc.
Closes https://github.com/facebook/rocksdb/pull/2095

Differential Revision: D4838188

Pulled By: siying

fbshipit-source-id: c5f3059
2017-04-05 17:24:19 -07:00
Andrew Kryczka
e2c6c06366 add TimedEnv
Summary:
I've needed Env timing measurements a few times now, so finally built something for it.
Closes https://github.com/facebook/rocksdb/pull/2073

Differential Revision: D4811231

Pulled By: ajkr

fbshipit-source-id: 218a249
2017-04-04 11:24:12 -07:00
Siying Dong
6ef8c620d3 Move auto_roll_logger and filename out of db/
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080

Differential Revision: D4821141

Pulled By: siying

fbshipit-source-id: ca7d768
2017-04-03 18:39:14 -07:00
Adam Retter
0ee7f04039 Added missing options to RocksJava
Summary:
This adds almost all missing options to RocksJava
Closes https://github.com/facebook/rocksdb/pull/2039

Differential Revision: D4779991

Pulled By: siying

fbshipit-source-id: 4a1bf28
2017-03-30 12:09:21 -07:00
Maysam Yabandeh
a2f7a514d1 Refactoring
Summary:
This is the first split of https://github.com/facebook/rocksdb/pull/1891 and will be needed for the upcoming partitioned filter patch.
Closes https://github.com/facebook/rocksdb/pull/1949

Differential Revision: D4652152

Pulled By: maysamyabandeh

fbshipit-source-id: 9801778
2017-03-03 18:24:12 -08:00
Siying Dong
ba4c77bd6b Divide external_sst_file_test
Summary:
Separate the platform dependent tests from external_sst_file_test. Only those tests need to run on platforms like OSX
Closes https://github.com/facebook/rocksdb/pull/1923

Differential Revision: D4622461

Pulled By: siying

fbshipit-source-id: d2d6f04
2017-02-28 14:24:11 -08:00
Siying Dong
8ad0fcdf99 Separate small subset tests in DBTest
Summary:
Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux.
Closes https://github.com/facebook/rocksdb/pull/1924

Differential Revision: D4616702

Pulled By: siying

fbshipit-source-id: 13e6549
2017-02-27 12:24:11 -08:00
Siying Dong
1ba2804b7f Remove XFunc tests
Summary:
Xfunc is hardly used. Remove it to keep the code simple.
Closes https://github.com/facebook/rocksdb/pull/1905

Differential Revision: D4603220

Pulled By: siying

fbshipit-source-id: 731f96d
2017-02-23 12:09:11 -08:00
Islam AbdelRahman
574b543f80 Rename merger.h -> merging_iterator.h
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836

Differential Revision: D4505357

Pulled By: IslamAbdelRahman

fbshipit-source-id: 07b28d8
2017-02-02 16:54:19 -08:00
Andrew Kryczka
17c1180603 Generalize Env registration framework
Summary:
The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config.

Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file.

This diff generalizes the Env registration logic to work with arbitrary types.

- Generalized registration and instantiation code by templating them
- The entire implementation is in a header file as that's Google style guide's recommendation for template definitions
- Pattern match with std::regex_match rather than checking prefix, which was the previous behavior
- Rename functions/files to be non-Env-specific
Closes https://github.com/facebook/rocksdb/pull/1776

Differential Revision: D4421933

Pulled By: ajkr

fbshipit-source-id: 34647d1
2017-01-25 16:09:14 -08:00
Yi Wu
c270735861 Iterator should be in corrupted status if merge operator return false
Summary:
Iterator should be in corrupted status if merge operator return false.
Also add test to make sure if max_successive_merges is hit during write,
data will not be lost.
Closes https://github.com/facebook/rocksdb/pull/1665

Differential Revision: D4322695

Pulled By: yiwu-arbug

fbshipit-source-id: b327b05
2016-12-16 11:09:16 -08:00
Igor Canadi
3f407b065c Kill flashcache code in RocksDB
Summary:
Now that we have userspace persisted cache, we don't need flashcache anymore.
Closes https://github.com/facebook/rocksdb/pull/1588

Differential Revision: D4245114

Pulled By: igorcanadi

fbshipit-source-id: e2c1c72
2016-12-01 10:09:22 -08:00
Andrew Kryczka
e333528991 DeleteRange write path end-to-end tests
Summary: Closes https://github.com/facebook/rocksdb/pull/1578

Differential Revision: D4241171

Pulled By: ajkr

fbshipit-source-id: ce5fd83
2016-11-29 11:09:22 -08:00
Islam AbdelRahman
86eb2b9ad9 Fix src.mk 2016-11-16 18:05:19 -08:00
Islam AbdelRahman
869ae5d786 Support IngestExternalFile (remove AddFile restrictions)
Summary:
Changes in the diff

API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API

Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers

Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob

Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)

Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: jkedgar, hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65061
2016-10-20 17:05:32 -07:00
Andrew Kryczka
6fbe96baf8 Compaction Support for Range Deletion
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.

For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.

To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.

RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.

One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.

Depends on D61473

Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927

Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62205
2016-10-18 12:04:56 -07:00
Injun Song
7260662b39 Add Java API for SstFileWriter
Add Java API for SstFileWriter. Closes https://github.com/facebook/rocksdb/issues/1248
2016-09-29 17:04:41 -04:00
Yi Wu
9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
Islam AbdelRahman
52ee07b021 Move AddFile() tests to external_sst_file_test.cc
Summary: Simply move the tests

Test Plan: make check -j64

Reviewers: andrewkr, lightmark, yiwu, yhchiang, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62529
2016-09-07 15:41:54 -07:00
Injun Song
ce1be2ce37 Fix build error on Windows (AppVeyor) (#1315)
Add 'cf_options' to source list and db_imple.cc

fix casting
2016-09-06 08:41:43 -07:00
Yi Wu
a88677d2cf Remove ImmutableCFOptions from public API
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63177
2016-09-02 14:16:31 -07:00
Islam AbdelRahman
e9b2af87f8 Expose ThreadPool under include/rocksdb/threadpool.h
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)

This allow us to expose ThreadPool to the user so we can use it as an option later

Test Plan: existing unit tests

Reviewers: andrewkr, yiwu, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62085
2016-08-26 10:41:35 -07:00
Adam Retter
ffdf6eee19 Add Status to RocksDBException so that meaningful function result Status from the C++ API isn't lost (#1273) 2016-08-22 11:01:42 -07:00
Yi Wu
4cc37f59e5 Introduce ClockCache
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.

Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.

Reviewers: kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61647
2016-08-19 12:28:19 -07:00
sdong
8b79422b52 [Proof-Of-Concept] RocksDB Blob Storage with a blob log file.
Summary:
This is a proof of concept of a RocksDB blob log file. The actual value of the Put() is appended to a blob log using normal data block format, and the handle of the block is written as the value of the key in RocksDB.

The prototype only supports Put() and Get(). It doesn't support DB restart, garbage collection, Write() call, iterator, snapshots, etc.

Test Plan: Add unit tests.

Reviewers: arahut

Reviewed By: arahut

Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61485
2016-08-10 17:05:17 -07:00
omegaga
44f5cc57a5 Add time series database (resubmitted)
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).

Test Plan: Add `date_tiered_test`.

Reviewers: dhruba, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61653
2016-08-05 15:56:22 -07:00
sdong
7c4615cf1f A utility function to help users migrate DB after options change
Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options.

Test Plan: Add unit tests for it.

Reviewers: andrewkr, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60783
2016-08-05 15:39:55 -07:00
omegaga
d51dc96a79 Experiments on column-aware encodings
Summary:
Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format.

There is still on-going work on this diff. More refactoring is necessary.

Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added.

Reviewers: sdong

Reviewed By: sdong

Subscribers: arahut, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60027
2016-08-01 14:50:19 -07:00
krad
c116b47804 Persistent Read Cache (part 6) Block Cache Tier Implementation
Summary:
The patch is a continuation of part 5. It glues the abstraction for
file layout and metadata, and flush out the implementation of the API. It
adds unit tests for the implementation.

Test Plan: Run unit tests

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57549
2016-08-01 14:15:14 -07:00
Islam AbdelRahman
f8061a237e Fix Statistics TickersNameMap miss match with Tickers enum
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)

Test Plan: added new unit test

Reviewers: sdong, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61083
2016-07-25 16:05:50 -07:00
Yi Wu
32604e6601 Fix flush not being commit while writing manifest
Summary:
Fix flush not being commit while writing manifest, which is a recent bug introduced by D60075.

The issue:
# Options.max_background_flushes > 1
# Background thread A pick up a flush job, flush, then commit to manifest. (Note that mutex is released before writing manifest.)
# Background thread B pick up another flush job, flush. When it gets to `MemTableList::InstallMemtableFlushResults`, it notices another thread is commiting, so it quit.
# After the first commit, thread A doesn't double check if there are more flush result need to commit, leaving the second flush uncommitted.

Test Plan: run the test. Also verify the new test hit deadlock without the fix.

Reviewers: sdong, igor, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, omegaga, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60969
2016-07-21 10:10:41 -07:00
krad
d9cfaa2b16 Persistent Read Cache (6) Persistent cache tier implentation - File layout
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.

Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.

We have specifically tested and adapted the IO path for NVM and SSD.

Persistent cache tier consists of there parts :

1) File layout

Provides the implementation for handling IO path for reading and writing data
(key/value pair).

2) Meta-data
Provides the implementation for handling the index for persistent read cache.

3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface

This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.

Test Plan: Compile and run check

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57117
2016-07-19 12:01:46 -07:00
Yi Wu
4b95253587 Refactor cache.cc
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.

Test Plan:
    make check -j64

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59655
2016-07-15 10:41:36 -07:00
Yi Wu
6ea41f8527 Fix deadlock when trying update options when write stalls
Summary:
When write stalls because of auto compaction is disabled, or stop write trigger is reached,
user may change these two options to unblock writes. Unfortunately we had issue where the write
thread will block the attempt to persist the options, thus creating a deadlock. This diff
fix the issue and add two test cases to detect such deadlock.

Test Plan:
Run unit tests.

Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60627
2016-07-12 15:30:38 -07:00
omegaga
e6f68faf99 Update Makefile to fix dependency
Summary: In D33849 we updated Makefile to generate .d files for all .cc sources. Since we have more types of source files now, this needs to be updated so that this mechanism can work for new files.

Test Plan: change a dependent .h file, re-make and see if .o file is recompiled.

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60591
2016-07-11 13:55:46 -07:00
Islam AbdelRahman
fa813f7478 Update DB::AddFile() to ingest the file to the lowest possible level
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level

Test Plan: unit tests

Reviewers: jkedgar, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D59637
2016-06-21 17:57:59 -07:00
sdong
6faddd7c55 Merge db/slice.cc into util/slice.cc
Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them.

Test Plan: Run existing tests

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59409
2016-06-10 16:37:36 -07:00
sdong
b2973eaaeb Remove options builder
Summary: AFIK, options builder is not used by anyone. Remove it.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, andrewkr, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59319
2016-06-09 16:56:17 -07:00
krad
d755c62f92 Persistent Read Cache (5) Volatile cache tier implementation
Summary:
This provides provides an implementation of PersistentCacheTier that is
specialized for RAM. This tier does not persist data though.

Why do we need this tier ?

This is ideal as tier 0. This tier can host data that is too hot.

Why can't we use Cache variants ?

Yes you can use them instead. This tier can potentially outperform BlockCache
in RAW mode by virtue of compression and compressed cache in block cache doesn't
seem very popular. Potentially this tier can be modified to under stand the
disadvantage of the tier below and retain data that the tier below is bad at
handling (for example index and bloom data that is huge in size)

Test Plan: Run unit tests added

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57069
2016-06-07 11:10:44 -07:00
krad
3070ed9021 Persistent Read Cache (4) Interface definitions
Summary:
This diff provides the basic interface definitions of persistent read
cache system

PersistentCacheOptions captures the persistent read cache options used to
configure and control the system
PersistentCacheTier provides the basic building block for constructing tiered
cache
PersistentTieredCache provides a logical abstraction of tiers of cache layered
over one another

Test Plan: Compile

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57051
2016-06-06 13:17:09 -07:00
Andrew Kryczka
6e6622abb9 Create env_basic_test [pluggable Env part 2]
Summary:
Extracted basic Env-related tests from mock_env_test and memenv_test into a
parameterized test for Envs: env_basic_test.

Depends on D58449. (The dependency is here only so I can keep this series of
diffs in a chain -- there is no dependency on that diff's code.)

Test Plan: ran tests

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58635
2016-06-03 15:13:03 -07:00
Andrew Kryczka
af0c9ac01d Env registry for URI-based Env selection [pluggable Env part 1]
Summary:
This enables configurable Envs without recompiling. For example, my
next diff will make env_test test an Env created by NewEnvFromUri(). Then,
users can determine which Env is tested simply by providing the URI for
NewEnvFromUri() (e.g., through a CLI argument or environment variable).

The registration process allows us to register any Env that is linked with the
RocksDB library, so we can register our internal Envs as well.

The registration code is inspired by our internal InitRegistry.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, lightmark, ldemailly, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D58449
2016-06-03 08:15:16 -07:00
Yueh-Hsuan Chiang
88acd932f6 Allows db_bench to take an options file
Summary:
This patch allows db_bench to initialize it's RocksDB Options via a
options file, specified by the --options_file flag.  Note that if
--options_file flag is set, then it has higher priority than the
command-line argument.

Test Plan: db_bench_tool_test

Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58533
2016-06-02 16:24:14 -07:00
Aaron Gao
5d660258e7 add simulator Cache as class SimCache/SimLRUCache(with test)
Summary: add class SimCache(base class with instrumentation api) and SimLRUCache(derived class with detailed implementation) which is used as an instrumented block cache that can predict hit rate for different cache size

Test Plan:
Add a test case in `db_block_cache_test.cc` called `SimCacheTest` to test basic logic of SimCache.
Also add option `-simcache_size` in db_bench. if set with a value other than -1, then the benchmark will use this value as the size of the simulator cache and finally output the simulation result.
```
[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 1000000
RocksDB:    version 4.8
Date:       Tue May 17 16:56:16 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       6.809 micros/op 146874 ops/sec;   16.2 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.343 micros/op 157665 ops/sec;   17.4 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 986559
SimCache HITs:    264760
SimCache HITRATE: 26.84%

[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 10000000
RocksDB:    version 4.8
Date:       Tue May 17 16:57:10 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       5.066 micros/op 197394 ops/sec;   21.8 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.457 micros/op 154870 ops/sec;   17.1 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 1059764
SimCache HITs:    374501
SimCache HITRATE: 35.34%

[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 100000000
RocksDB:    version 4.8
Date:       Tue May 17 16:57:32 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       5.632 micros/op 177572 ops/sec;   19.6 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.892 micros/op 145094 ops/sec;   16.1 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 1150767
SimCache HITs:    1034535
SimCache HITRATE: 89.90%
```

Reviewers: IslamAbdelRahman, andrewkr, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57999
2016-05-23 23:35:23 -07:00
sdong
1d725ca51d Deprecate BlockBasedTableOptions.hash_index_allow_collision=false.
Summary: Deprecate this one option and delete code and tests that are now superfluous.

Test Plan: all tests pass

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: msalib, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55317
2016-05-20 17:52:27 -07:00
Islam AbdelRahman
1f2dca0eaa Add MaxOperator to utilities/merge_operators/
Summary:
Introduce MaxOperator a simple merge operator that return the max of all operands.
This merge operand help me in benchmarking

Test Plan: Add new unitttests

Reviewers: sdong, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57873
2016-05-19 15:51:29 -07:00
omegaga
3c69f77c67 Move IO failure test to separate file
Summary:
This is a part of effort to reduce the size of db_test.cc. We move the following tests to a separate file `db_io_failure_test.cc`:

* DropWrites
* DropWritesFlush
* NoSpaceCompactRange
* NonWritableFileSystem
* ManifestWriteError
* PutFailsParanoid

Test Plan: Run `make check` to see if the tests are working properly.

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58341
2016-05-18 17:09:20 -07:00
krad
a08c8c851a Added PersistentCache abstraction
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.

RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in  COMPRESSED or UNCOMPRESSED mode.

Blame Rev:

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55707
2016-05-15 22:17:18 -07:00
Dmitri Smirnov
aab91b8d8f Use generic threadpool for Windows environment (#1120)
Conditionally retrofit thread_posix for use with std::thread
  and reuse the same logic. Posix users continue using Posix interfaces.
  Enable XPRESS compression in test runs.
  Fix master introduced signed/unsigned mismatch.
2016-05-12 18:34:04 -07:00
Reid Horuff
a657ee9a9c [rocksdb] Recovery path sequence miscount fix
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]

The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.

Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.

We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))

So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?

Test Plan: provided test.

Reviewers: sdong

Subscribers: andrewkr, leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D57645
2016-05-10 14:06:07 -07:00
Andrew Kryczka
3f16a836a4 Introduce chroot Env
Summary:
For testing backups, we needed an Env that is fully isolated from other
Envs on the same machine. Our in-memory Envs (MockEnv and InMemoryEnv) were
insufficient because they don't implement most directory operations.

This diff introduces a new Env, "ChrootEnv", that translates paths such that the
chroot directory appears to be the root directory. This way, multiple Envs can
be isolated in the filesystem by using different chroot directories. Since we
use the filesystem, all directory operations are trivially supported.

Test Plan:
I parameterized the existing EnvPosixTest so it runs tests on ChrootEnv
except the ioctl-related cases.

Reviewers: sdong, lightmark, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57543
2016-05-06 17:42:50 -07:00
Yi Wu
792762c42c Split db_test.cc
Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase.

Test Plan: make check

Reviewers: sdong, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong

Differential Revision: https://reviews.facebook.net/D56715
2016-04-18 09:42:50 -07:00
Siying Dong
774922c680 Merge pull request #1026 from SherlockNoMad/Hist
Histogram Concurrency Improvement and Time-Windowing Support
2016-03-15 11:27:54 -07:00
SherlockNoMad
54f6b9e162 Histogram Concurrency Improvement and Time-Windowing Support 2016-03-11 16:54:25 -08:00
agiardullo
790252805d Add multithreaded transaction test
Summary: Refactored db_bench transaction stress tests so that they can be called from unit tests as well.

Test Plan: run new unit test as well as db_bench

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55203
2016-03-11 15:16:52 -08:00
Yi Wu
2568985ab3 IOStatsContext::ToString() add option to exclude zero counters
Summary: similar to D52809 add option to exclude zero counters.

Test Plan:
[yiwu@dev4504.prn1 ~/rocksdb] ./iostats_context_test
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from IOStatsContextTest
[ RUN      ] IOStatsContextTest.ToString
[       OK ] IOStatsContextTest.ToString (0 ms)
[----------] 1 test from IOStatsContextTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.

Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54591
2016-02-23 10:26:24 -08:00
Jonathan Wiepert
7bd284c374 Separeate main from bench functionality to allow cusomizations
Summary: Isolate db_bench functionality from main so custom benchmark code can be written and managed

Test Plan:
Tested commands
./build_tools/regression_build_test.sh
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --reads=500 --writes=500
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --merge_keys=100 --numdistinct=100 --num_column_families=3 --num_hot_column_families=1
./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5

Test Results - https://phabricator.fb.com/P56130387

Additional tests for:
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 --key_size=8 --merge_operator=put
./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 --merge_operator=uint64add

Results: https://phabricator.fb.com/P56130607

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53991
2016-02-16 06:17:31 -08:00
Jonathan Wiepert
14a322033f Remove references to files deleted in commit abb4052278
Summary:
Remove obolete references to files in src.mk
Fix incorrect path for reference in source.mk

Test Plan: Ran build to ensure changes do not break anything.

Reviewers: leveldb, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53733
2016-02-03 16:47:45 -08:00
Islam AbdelRahman
d6c838f1e1 Add SstFileManager (component tracking all SST file in DBs and control the deletion rate)
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size

Test Plan: unit tests

Reviewers: rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, lovro, march, dhruba

Differential Revision: https://reviews.facebook.net/D50469
2016-01-28 18:35:01 -08:00
Andrew Kryczka
167bd8856d [directory includes cleanup] Finish removing util->db dependencies 2016-01-26 10:49:24 -08:00
Andrew Kryczka
46f9cd46af [directory includes cleanup] Move cross-function test points
Summary:
I split the db-specific test points out into a separate file under db/
directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it
from compiling previously; see https://reviews.facebook.net/D36825.

Test Plan:
compilation works now, below command works, will also run "make xfunc".

  $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32

Reviewers: sdong, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53343
2016-01-26 10:49:05 -08:00
Nathan Bronson
7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
Igor Canadi
64fa43843b Merge pull request #862 from ceph/wip-env
implement EnvMirror
2015-12-10 18:45:07 -08:00
Sage Weil
2074ddd625 env: add EnvMirror
This is an Env implementation that mirrors all storage-related methods on
two different backend Env's and verifies that they return the same
results (return status and read results).  This is useful for implementing
a new Env and verifying its correctness.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-12-10 21:32:45 -05:00
Javier González
b2863017b1 Move posix threads into a library
Summary: This patch moves all posix thread logic to a separate library.
The motivation is to allow another environments to easily reuse posix
threads. HDFS wraps already posix threads; this split would simplify
this code.

Test Plan: No new functionality is added to posix Env or the threading
library, thus the current tests should suffice.
2015-12-07 12:03:38 +01:00
Nathan Bronson
78812ec6bf InlineSkipList - part 1/3
Summary:
This diff is 1/3 in a sequence that introduces a skip list optimized for
a key that is a freshly-allocated const char*.  The diff is broken into
pieces to make it easier to review.  This piece only introduces the new
type by copying the existing SkipList, with mechanical naming changes
and reformatting.

Test Plan: new unit test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51279
2015-11-24 14:30:22 -08:00
sdong
c4ebb66d61 Not to build forward_iterator_bench now
Summary: forward_iterator_bench is not stable enough for build. Remove it for now.

Test Plan: Build it with both of CLANG and non-CLANG and make sure it builds.

Reviewers: rven, kradhakrishnan, anthony, yhchiang, igor, IslamAbdelRahman

Reviewed By: igor, IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50991
2015-11-18 15:42:06 -08:00
Venkatesh Radhakrishnan
7824444bfc Reuse file iterators in tailing iterator when memtable is flushed
Summary:
Under a tailing workload, there were increased block cache
misses when a memtable was flushed because we were rebuilding iterators
in that case since the version set changed. This was exacerbated in the
case of iterate_upper_bound, since file iterators which were over the
iterate_upper_bound would have been deleted and are now brought back as
part of the Rebuild, only to be deleted again. We now renew the iterators
and only build iterators for files which are added and delete file
iterators for files which are deleted.
Refer to https://reviews.facebook.net/D50463 for previous version

Test Plan: DBTestTailingIterator.TailingIteratorTrimSeekToNext

Reviewers: anthony, IslamAbdelRahman, igor, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: yhchiang, march, dhruba, leveldb, lovro

Differential Revision: https://reviews.facebook.net/D50679
2015-11-13 15:50:59 -08:00
Yueh-Hsuan Chiang
e11f676e34 Add OptionsUtil::LoadOptionsFromFile() API
Summary:
This patch adds OptionsUtil::LoadOptionsFromFile() and
OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
to construct DBOptions and ColumnFamilyOptions from a RocksDB
options file.  Note that most pointer-typed options such as
merge_operator will not be constructed.

With this API, developers no longer need to remember all the
options in order to reopen an existing rocksdb instance like
the following:

  DBOptions db_options;
  std::vector<std::string> cf_names;
  std::vector<ColumnFamilyOptions> cf_opts;

  // Load primitive-typed options from an existing DB
  OptionsUtil::LoadLatestOptionsFromDB(
      dbname, &db_options, &cf_names, &cf_opts);

  // Initialize necessary pointer-typed options
  cf_opts[0].merge_operator.reset(new MyMergeOperator());
  ...

  // Construct the vector of ColumnFamilyDescriptor
  std::vector<ColumnFamilyDescriptor> cf_descs;
  for (size_t i = 0; i < cf_opts.size(); ++i) {
    cf_descs.emplace_back(cf_names[i], cf_opts[i]);
  }

  // Open the DB
  DB* db = nullptr;
  std::vector<ColumnFamilyHandle*> cf_handles;
  auto s = DB::Open(db_options, dbname, cf_descs,
                    &handles, &db);

Test Plan:
Augment existing tests in column_family_test
options_test
db_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49095
2015-11-12 06:52:43 -08:00
Yueh-Hsuan Chiang
e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Nathan Bronson
b81b430987 Switch to thread-local random for skiplist
Summary:
Using a TLS random instance for skiplist makes it smaller
(useful for hash_skiplist_rep) and prepares skiplist for concurrent
adds.  This diff also modifies the branching factor math to avoid an
unnecessary division.

This diff has the effect of changing the sequence of skip list node
height choices made by tests, so it has the potential to cause unit
test failures for tests that implicitly rely on the exact structure
of the skip list.  Tests that try to exactly trigger a compaction are
likely suspects for this problem (these tests have always been brittle to
changes in the skiplist details).  I've minimizes this risk by reseeding
the main thread's Random at the beginning of each test, increasing the
universal compaction size_ratio limit from 101% to 105% for some tests,
and verifying that the tests pass many times.

Test Plan: for i in `seq 0 9`; do make check; done

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50439
2015-11-09 19:25:22 -08:00
Yueh-Hsuan Chiang
183cadfc87 Add OptionsSanityCheckLevel
Summary:
This patch introduces OptionsSanityCheckLevel internally to enable
sanity check rocksdb options.

Utilities API will be added in the follow-up diffs.

Test Plan: Added more tests in options_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49515
2015-11-04 18:53:30 -08:00
Yueh-Hsuan Chiang
7d7ee2b654 Add Memory Insight support to utilities
Summary:
This patch introduces utilities/memory, which currently includes
GetApproximateMemoryUsageByType that reports different types of
rocksdb memory usage given a list of input DBs.

The API also take care of the case where Cache could be shared
across multiple column families / multiple db instances.

Currently, it reports memory usage of memtable, table-readers
and cache.

Test Plan: utilities/memory/memory_test.cc

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49257
2015-11-03 17:52:17 -08:00
Javier González
6e6dd5f6f9 Split posix storage backend into Env and library
Summary: This patch splits the posix storage backend into Env and
the actual *File implementations. The motivation is to allow other Envs
to use posix as a library. This enables a storage backend different from
posix to split its secondary storage between a normal file system
partition managed by posix, and it own media.

Test Plan: No new functionality is added to posix Env or the library,
thus the current tests should suffice.
2015-10-22 17:31:31 +02:00
Alexey Maykov
e1a09a7703 Implementation for GetPropertiesOfTablesInRange
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.

Test Plan: ran the GetPropertiesOfTablesInRange

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48651
2015-10-17 13:34:43 -07:00
Venkatesh Radhakrishnan
a98fbacfa0 Moving memtable related files from util to a new directory memtable
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48915
2015-10-16 14:10:33 -07:00