Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.
Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.
TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717
Differential Revision: D13212686
Pulled By: ajkr
fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
Summary:
The old RangeDelAggregator did expensive pre-processing work
to create a collapsed, binary-searchable representation of range
tombstones. With FragmentedRangeTombstoneIterator, much of this work is
now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
in each iterator to find a covering tombstone in ShouldDelete, while
doing minimal work in AddTombstones. The old RangeDelAggregator is still
used during flush/compaction for now, though RangeDelAggregatorV2 will
support those uses in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
Differential Revision: D13146964
Pulled By: abhimadan
fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
Summary:
In #4652 we are setting -Os for lite builds only when LITE=1 is specified. But currently almost all the users invoke lite build via OPT="-DROCKSDB_LITE=1". So this diff tries to set LITE=1 when users already pass in -DROCKSDB_LITE=1 via the command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4671
Differential Revision: D13033801
Pulled By: sagar0
fbshipit-source-id: e7b506cee574f9e3f42221ee6647915011c78d78
Summary:
When running `make shared_lib` under fbcode, there's liblua link error: https://gist.github.com/yiwu-arbug/b796bff6b3d46d90c1ed878d983de50d
This is because we link liblua.a when building shared lib. If we want to link with liblua, we need to link with liblua_pic.a instead. Fixing by simply not link with lua.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4651
Differential Revision: D12964798
Pulled By: yiwu-arbug
fbshipit-source-id: 18d6cee94afe20748068822b76e29ef255cdb04d
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.
In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```
...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```
The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.
Readrandom results before PR:
```
readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found)
```
Readrandom results after PR:
```
readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found)
```
So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
Differential Revision: D10370575
Pulled By: abhimadan
fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
Summary:
In fbcode when we build with clang7++, although -faligned-new is available in compile phase, we link with an older version of libstdc++.a and it doesn't come with aligned-new support (e.g. `nm libstdc++.a | grep align_val_t` return empty). In this case the previous -faligned-new detection can pass but will end up with link error. Fixing it by only have the detection for non-fbcode build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4576
Differential Revision: D10500008
Pulled By: yiwu-arbug
fbshipit-source-id: b375de4fbb61d2a08e54ab709441aa8e7b4b08cf
Summary:
Introduce `RepeatableThread` utility to run task periodically in a separate thread. It is basically the same as the the same class in fbcode, and in addition provide a helper method to let tests mock time and trigger execution one at a time.
We can use this class to replace `TimerQueue` in #4382 and `BlobDB`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4423
Differential Revision: D10020932
Pulled By: yiwu-arbug
fbshipit-source-id: 3616bef108c39a33c92eedb1256de424b7c04087
Summary:
To measure the results of upcoming DeleteRange v2 work, this commit adds
simple benchmarks for RangeDelAggregator. It measures the average time
for AddTombstones and ShouldDelete calls.
Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results:
Before #4014:
```
=======================
Results:
=======================
AddTombstones: 1356.28 us
ShouldDelete: 0.401732 us
```
Latest master:
```
=======================
Results:
=======================
AddTombstones: 740.82 us
ShouldDelete: 0.383271 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363
Differential Revision: D9881676
Pulled By: abhimadan
fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
Summary:
Before the fix:
On a PowerPC machine, run the following
```
$ make jtest
```
The command will fail due to "undefined symbol: crc32c_ppc". It was caused by
'rocksdbjava' Makefile target not including crc32c_ppc object files when
generating the shared lib. The fix is simple.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4357
Differential Revision: D9779474
Pulled By: riversand963
fbshipit-source-id: 3c5ec9068c2b9c796e6500f71cd900267064fd51
Summary:
Currently unity-test is failing because both trace_replay.cc and trace_analyzer_tool.cc defined `DecodeCFAndKey` under anonymous namespace. It is supposed to be fine except unity test will dump all source files together and now we have a conflict.
Another issue with trace_analyzer_tool.cc is that it is using some utility functions from ldb_cmd which is not included in Makefile for unity_test, I chose to update TESTHARNESS to include LIBOBJECTS. Feel free to comment if there is a less intrusive way to solve this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4323
Differential Revision: D9599170
Pulled By: miasantreble
fbshipit-source-id: 38765b11f8e7de92b43c63bdcf43ea914abdc029
Summary:
Since bzip.org is no longer maintained, download the bzip2 packages from a snapshot taken by the internet archive until we figure out a more credible source.
Fixes issue: #4305
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4306
Differential Revision: D9514868
Pulled By: sagar0
fbshipit-source-id: 57c6a141a62e652f94377efc7ca9916b458e68d5
Summary:
Previously, the trace_analyzer_tool will be complied with other libobjects, which let the GFLAGS of trace_analyzer appear in other tools (e.g., db_bench, rocksdb_dump, and etc.). When using '--help', the help information of trace_analyzer will appear in other tool help information, which will cause confusion issues.
Currently, trace_analyzer_tool is built and used only by trace_analyzer and trace_analyzer_test to avoid the issues.
Tested with make asan_check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4290
Differential Revision: D9413163
Pulled By: zhichao-cao
fbshipit-source-id: ed5d20c4575a53ca15ff62a2ffe601d5cf278cc4
Summary:
A framework of trace analyzing for RocksDB
After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
**Input:**
1. trace file
2. Whole keys space file
**Statistics:**
1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
2. Key hotness (access count) of each one
3. Key space separation based on given prefix
4. Key size distribution
5. Value size distribution if appliable
6. Top K accessed keys
7. QPS statistics including the average QPS and peak QPS
8. Top K accessed prefix
9. The query correlation analyzing, output the number of X after Y and the corresponding average time
intervals
**Output:**
1. key access heat map (either in the accessed key space or whole key space)
2. trace sequence file (interpret the raw trace file to line base text file for future use)
3. Time serial (The key space ID and its access time)
4. Key access count distritbution
5. Key size distribution
6. Value size distribution (in each intervals)
7. whole key space separation by the prefix
8. Accessed key space separation by the prefix
9. QPS of each operation and each column family
10. Top K QPS and their accessed prefix range
**Test:**
1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
2. Generated the trace and analyze the trace
**Implemented but not tested (due to the limitation of trace_replay):**
1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
2. Analyzing the number of Key found by Get
**Future Work:**
1. Support execution time analyzing of each requests
2. Support cache hit situation and block read situation of Get
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091
Differential Revision: D9256157
Pulled By: zhichao-cao
fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
Summary:
Usually when using cscope, the query results contain a lot of function calls in test, making it hard to browse. So this PR aims to provide an option to exclude test source files.
Add a new PHONY target, tags0, to exclude test source files while using cscope.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4190
Differential Revision: D9015901
Pulled By: riversand963
fbshipit-source-id: ea9a45756ccff5b26344d37e9ff1c02c5d9736d6
Summary:
The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested.
`DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called.
`DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139
Reviewed By: sagar0
Differential Revision: D8866764
Pulled By: fgwu
fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10
Summary:
The transactions are currently tested with and without using StackableDB. This is mostly to check that the code path is consistent with stackable db as well. Slow, stress tests however do not benefit from being run again with StackableDB. The patch excludes StackableDB from such tests.
On a single core it reduced the runtime of transaction_test from 199s to 135s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4132
Differential Revision: D8841655
Pulled By: maysamyabandeh
fbshipit-source-id: 7b9aaba2673b542b195439dfb306cef26bd63b19
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance
This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997
Differential Revision: D8653831
Pulled By: anand1976
fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
Summary:
Pinned the alignment of StatisticsData to the cacheline size rather than just extending its size (which could go over two cache lines)if unaligned in allocation.
Avoid compile errors in the process as per individual commit messages.
strengthen static_assert to CACHELINE rather than the highest common multiple.
Closes https://github.com/facebook/rocksdb/pull/4036
Differential Revision: D8582844
Pulled By: yiwu-arbug
fbshipit-source-id: 363c37029f28e6093e06c60b987bca9aa204bc71
Summary:
As titled.
I have not extended the Compatibility tests because the new WAL markers are still unimplemented.
Closes https://github.com/facebook/rocksdb/pull/3941
Differential Revision: D8238394
Pulled By: lth
fbshipit-source-id: 980e3d44837bbf2cfa64047f9738f559dfac4b1d
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
We were seeing the following error: "ThreadSanitizer: DenseSlabAllocator overflow. Dying."
It is fixable by mmap'ing a smaller region for keys' expected values, which this PR achieves by reducing the number of keys.
Closes https://github.com/facebook/rocksdb/pull/3803
Differential Revision: D7874478
Pulled By: ajkr
fbshipit-source-id: 433939f5cb92410ab4777d540cb0cc2ee0fe6c2e
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662
Differential Revision: D7426121
Pulled By: Dayvedde
fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
Summary:
Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files, but this PR targets sst.
Also add a unit test to verify that this PR fixes the issue.
The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
```
timeline user_thread background_compaction thread
t1 | FindObsoleteFiles(full_scan=false)
t2 | FindObsoleteFiles(full_scan=true)
t3 | PurgeObsoleteFiles
t4 | PurgeObsoleteFiles
V
```
When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.
ajkr could you take a look and comment? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3638
Differential Revision: D7384372
Pulled By: riversand963
fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
Summary:
I found that each instance of MySQLStyleTransactionTest.TransactionStressTest/x is taking more than 10 hours to complete on our continuous testing environment, causing the whole valgrind run to timeout after a day. So excluding these tests.
Closes https://github.com/facebook/rocksdb/pull/3652
Differential Revision: D7400332
Pulled By: sagar0
fbshipit-source-id: 987810574506d01487adf7c2de84d4817ec3d22d
Summary:
I modified the Makefile so that we can compile rocksdb on OpenBSD.
The instructions for building have been added to INSTALL.md.
The whole compilation process works fine like this on OpenBSD-current
Closes https://github.com/facebook/rocksdb/pull/3617
Differential Revision: D7323754
Pulled By: siying
fbshipit-source-id: 990037d1cc69138d22f85bd77ef4dc8c1ba9edea
Summary:
In original $ROCKSDB_HOME/Makefile, the command used to generate ctags is
```
ctags * -R
```
However, this failed to generate tags for me.
I did some search on the usage of ctags command and found that it should be
```
ctags -R .
```
or
```
ctags -R *
```
After the change, I can find the tags in vim using `:ts <identifier>`.
Closes https://github.com/facebook/rocksdb/pull/3626
Reviewed By: ajkr
Differential Revision: D7320217
Pulled By: riversand963
fbshipit-source-id: e4cd8f8a67842370a2343f0213df3cbd07754111
Summary:
https://blog.github.com/2018-02-23-weak-cryptographic-standards-removed/
Github dropped supporting some weak cryptographic protocols from their website couple of weeks ago which cause our vagrant build process to fail on curl downloading step. This diff force curl use tls v1.2 protocol if it is supported so that it does not rely on the default protocol on different systems.
Closes https://github.com/facebook/rocksdb/pull/3561
Differential Revision: D7148575
Pulled By: wpc
fbshipit-source-id: b8cecfdfeb2bc8236de2d0d14f044532befec98c
Summary:
Now we suppress alignment UBSAN error as a whole. Suppressing 3-way CRC and murmurhash feels a better idea than turning off alignment check as a whole.
Closes https://github.com/facebook/rocksdb/pull/3495
Differential Revision: D6971273
Pulled By: siying
fbshipit-source-id: 080b59fed6df494b9f622ef7cb5d42d39e6a8cdf
Summary:
…db_test
options_settable_test won't pass UBSAN so disable it.
blob_db_test fails in UBSAN as SnapshotList doesn't initialize all the fields in dummy snapshot. Fix it. I don't understand why only blob_db_test fails though.
Closes https://github.com/facebook/rocksdb/pull/3477
Differential Revision: D6928681
Pulled By: siying
fbshipit-source-id: e31dd300fcdecdfd4f6af279a0987fd0cdec5122
Summary:
Disable alignment check in UBSAN for now. Now we can't get signals to meaningful failures. We can reenable it after we figure out how we can suppress failures in finer grain manner.
Closes https://github.com/facebook/rocksdb/pull/3473
Differential Revision: D6925971
Pulled By: siying
fbshipit-source-id: a0f1a242cde866abbc5c1eeee9ff8d1d7d582ac4
Summary:
By default if ubsan detects any problem, it outputs a “runtime error:” message, and in most cases continues executing the program.
In order to make test abort on errors, option `-fno-sanitize-recover` is needed. [link](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html)
Closes https://github.com/facebook/rocksdb/pull/3447
Differential Revision: D6854654
Pulled By: miasantreble
fbshipit-source-id: c48e892b0b38307029df38a67adda0e24257e481
Summary:
Java static builds are again broken, this time due Snappy version upgrade introduced in 90c1d81975 (#3331).
This is due to two reasons:
1. The new Snappy packages should now be downloaded from https://github.com/google/snappy/archive/<pkg.tar.gz> instead of https://github.com/google/snappy/releases/download/<pkg.tar.gz> which we are using now.
1. In addition to the the above URL change, Snappy changed its build from using autotools to CMake based : e69d9f8806/README.md (L65-L72)
So more changes are needed if we are going to upgrade to 1.1.7. Hence reverting the version upgrade until we figure them out.
Closes https://github.com/facebook/rocksdb/pull/3363
Differential Revision: D6716983
Pulled By: sagar0
fbshipit-source-id: f451a1bc5eb0bb090f4da07bc3e5ba72cf89aefa
Summary:
Java build on PPC64le has been broken since a few months, due to #2716. Fixing it with the least amount of changes.
(We should cleanup a little around this code when time permits).
This should fix the build failures seen in http://140.211.168.68:8080/job/Rocksdb/ .
Closes https://github.com/facebook/rocksdb/pull/3359
Differential Revision: D6712938
Pulled By: sagar0
fbshipit-source-id: 3046e8f072180693de2af4762934ec1ace309ca4
Summary:
With the ZSTD dictionary generator support added in #3057
`PORTABLE=1 ROCKSDB_NO_FBCODE=1 make rocksdbjavastatic` fails as it can't find zdict.h. Specifically due to:
e3a06f12d2/util/compression.h (L39)
In java static builds zstd code gets directly downloaded from https://github.com/facebook/zstd , and in there zdict.h is under dictBuilder directory. So, I modified libzstd.a target to use `make install` to collect all the header files into a single location and used that as the zstd's include path.
Closes https://github.com/facebook/rocksdb/pull/3260
Differential Revision: D6669850
Pulled By: sagar0
fbshipit-source-id: f8a7562a670e5aed4c4fb6034a921697590d7285
Summary:
added support for C and asm files as required for e612e31740.
Closes https://github.com/facebook/rocksdb/pull/3299
Differential Revision: D6612479
Pulled By: ajkr
fbshipit-source-id: 6263ed7c1602f249460421825c76b5721f396163
Summary:
**# Summary**
RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain.
This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used.
**# Performance Test Results**
We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm.
Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s.
1) ReadRandom in db_bench overall metrics
PER RUN
Algorithm | run | micros/op | ops/sec |Throughput (MB/s)
3-way | 1 | 4.143 | 241387 | 26.7
3-way | 2 | 3.775 | 264872 | 29.3
3-way | 3 | 4.116 | 242929 | 26.9
FastCrc32c|1 | 4.037 | 247727 | 27.4
FastCrc32c|2 | 4.648 | 215166 | 23.8
FastCrc32c|3 | 4.352 | 229799 | 25.4
AVG
Algorithm | Average of micros/op | Average of ops/sec | Average of Throughput (MB/s)
3-way | 4.01 | 249,729 | 27.63
FastCrc32c | 4.35 | 230,897 | 25.53
2) Crc32c computation CPU cost (inclusive samples percentage)
PER RUN
Implementation | run | TotalSamples | Crc32c percentage
3-way | 1 | 4,572,250,000 | 4.37%
3-way | 2 | 3,779,250,000 | 4.62%
3-way | 3 | 4,129,500,000 | 4.48%
FastCrc32c | 1 | 4,663,500,000 | 11.24%
FastCrc32c | 2 | 4,047,500,000 | 12.34%
FastCrc32c | 3 | 4,366,750,000 | 11.68%
**# Test Plan**
make -j64 corruption_test && ./corruption_test
By default it uses 3-way SSE algorithm
NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test
make clean && DEBUG_LEVEL=0 make -j64 db_bench
make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench
Closes https://github.com/facebook/rocksdb/pull/3173
Differential Revision: D6330882
Pulled By: yingsu00
fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9
Summary:
The TSAN version of tests could take quite long. Make the buck tests parallel to avoid timeouts.
Closes https://github.com/facebook/rocksdb/pull/3280
Differential Revision: D6581594
Pulled By: maysamyabandeh
fbshipit-source-id: 3f8476d8c69f0183e394fa8a2089dd8d4e90c90c
Summary:
Add ROCKSDB_VALGRIND_RUN macro and suppress false-positive "unimplemented functionality" throw by valgrind for steam hints.
Another approach would be add a valgrind suppress file. Valgrind is suppose to print the suppression when given "--gen-suppressions=all" param, which is suppose to be the content for the suppression file. But it doesn't print.
Closes https://github.com/facebook/rocksdb/pull/3174
Differential Revision: D6338786
Pulled By: yiwu-arbug
fbshipit-source-id: 3559efa5f3b92d40d09ad6ac82bc7b59f86c75aa
Summary:
The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky.
Closes https://github.com/facebook/rocksdb/pull/3164
Differential Revision: D6319201
Pulled By: yiwu-arbug
fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48
Summary:
Dynamic adjustment of rate limit according to demand for background I/O. It increases by a factor when limiter is drained too frequently, and decreases by the same factor when limiter is not drained frequently enough. The parameters for this behavior are fixed in `GenericRateLimiter::Tune`. Other changes:
- make rate limiter's `Env*` configurable for testing
- track num drain intervals in RateLimiter so we don't have to rely on stats, which may be shared across different DB instances from the ones that share the RateLimiter.
Closes https://github.com/facebook/rocksdb/pull/2899
Differential Revision: D5858704
Pulled By: ajkr
fbshipit-source-id: cc2bac30f85e7f6fd63655d0a6732ef9ed7403b1
Summary:
Make SnapshotConcurrentAccessTest run in the beginning of the queue.
Test Plan
`make all check -j64` on devserver
Closes https://github.com/facebook/rocksdb/pull/2962
Differential Revision: D5965871
Pulled By: yiwu-arbug
fbshipit-source-id: 8cb5a47c2468be0fbbb929226a143ec5848bfaa9
Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886
Differential Revision: D5838431
Pulled By: yiwu-arbug
fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
Summary:
This enables us to crossbuild pcc64le RocksJava binaries with a suitably old version of glibc (2.17) on CentOS 7.
Closes https://github.com/facebook/rocksdb/pull/2491
Differential Revision: D5955301
Pulled By: sagar0
fbshipit-source-id: 69ef9746f1dc30ffde4063dc764583d8c7ae937e
Summary:
TSAN shows error when we grab too many locks at the same time. In TSAN crash test, make one shard key cover 2^22 keys so that no many keys will be hold at the same time.
Closes https://github.com/facebook/rocksdb/pull/2719
Differential Revision: D5609035
Pulled By: siying
fbshipit-source-id: 930e5d63fff92dbc193dc154c4c615efbdf06c6a
Summary:
Commit 4f81ab38bf has the test wrong.
clang doesn't support a -dumpversion option. By lucky coincidence
clang/gcc --version both place a version number at the same output location
when --verison is passed.
Example output (1st line only).
$ clang --version
clang version 3.9.1 (tags/RELEASE_391/final)
$ gcc --version
gcc (GCC) 6.4.1 20170727 (Red Hat 6.4.1-1)
During the test of the compiler we ensure that a minimum version is met
as Makefile doesn't support patterns.
Also xcode9 doesn't seem affected by https://github.com/facebook/rocksdb/issues/2672
and also doesn't have "clang" as the first part of its output so the
fix implemented here also is Apple clang friendly.
$ clang --version
Apple LLVM version 9.0.0 (clang-900.0.31)
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Closes https://github.com/facebook/rocksdb/pull/2699
Differential Revision: D5600818
Pulled By: yiwu-arbug
fbshipit-source-id: 3b0f2751becb53c1c35468bf29f3f828e7cf2c2a
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645
Differential Revision: D5502723
Pulled By: siying
fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
Summary:
platform_dependent tests in Travis now builds all tests, which is not needed. Only build those tests we need to run.
Closes https://github.com/facebook/rocksdb/pull/2647
Differential Revision: D5513954
Pulled By: siying
fbshipit-source-id: 4d540b146124e70dd25586c47939d19f93655b0a
Summary:
Major changes in this PR:
* Implement CassandraCompactionFilter to remove expired columns and rows (if all column expired)
* Move cassandra related code from utilities/merge_operators/cassandra to utilities/cassandra/*
* Switch to use shared_ptr<> from uniqu_ptr for Column membership management in RowValue. Since columns do have multiple owners in Merge and GC process, use shared_ptr helps make RowValue immutable.
* Rename cassandra_merge_test to cassandra_functional_test and add two TTL compaction related tests there.
Closes https://github.com/facebook/rocksdb/pull/2588
Differential Revision: D5430010
Pulled By: wpc
fbshipit-source-id: 9566c21e06de17491d486a68c70f52d501f27687
Summary:
Instead of ignoring UBSan checks, fix the negative shifts in
Hash(). Also add test to make sure the hash values are stable over
time. The values were computed before this change, so the test also
verifies the correctness of the change.
Closes https://github.com/facebook/rocksdb/pull/2546
Differential Revision: D5386369
Pulled By: yiwu-arbug
fbshipit-source-id: 6de4b44461a544d6222cc5d72d8cda2c0373d17e
Summary:
1. The buckifier script assume each test "foo" comes with a .cc file of the same name (i.e. foo.cc). Update cassandra tests to follow this pattern so that the buckifier script can recognize them.
2. add blob_db_test
Closes https://github.com/facebook/rocksdb/pull/2506
Differential Revision: D5331517
Pulled By: yiwu-arbug
fbshipit-source-id: 86f3eba471fc621186ab44cbd073b6162cde8e57
Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.
It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).
The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.
To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:
```
ENCRYPTED_ENV=1 make check_some
```
There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424
Differential Revision: D5322178
Pulled By: sdwilsh
fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350
Differential Revision: D5110648
Pulled By: siying
fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
Summary:
Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.
Stacking on #2375.
Closes https://github.com/facebook/rocksdb/pull/2385
Differential Revision: D5148358
Pulled By: yiwu-arbug
fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
Summary:
Re-enable blob_db_test with some update:
* Commented out delay at the end of GC tests. Will update the logic later with sync point to properly trigger GC.
* Added some helper functions.
Also update make files to include blob_dump tool.
Closes https://github.com/facebook/rocksdb/pull/2375
Differential Revision: D5133793
Pulled By: yiwu-arbug
fbshipit-source-id: 95470b26d0c1f9592ba4b7637e027fdd263f425c
Summary:
zstd files are downloaded and used as part of JNI build, but are left behind even after doing a `make clean`. This PR updates the `clean` target to remove these zstd files as well.
Closes https://github.com/facebook/rocksdb/pull/2365
Differential Revision: D5123537
Pulled By: sagar0
fbshipit-source-id: a8f355da5ba961aa89d5852e35751ffc35de03ea
Summary:
added ctags -e to the tags target in the makefile. It creates an etags file suitable for emacs.
Closes https://github.com/facebook/rocksdb/pull/2193
Differential Revision: D4983535
Pulled By: siying
fbshipit-source-id: 1077ef0676025b8109df37433572533c9e8fe86e
Summary:
-pic seems to be not working in gcc-5 and it is curently broken. Remove it to fix the build.
Closes https://github.com/facebook/rocksdb/pull/2320
Differential Revision: D5082775
Pulled By: siying
fbshipit-source-id: 5055f987353f1417643a394e7ce05905670410a4
Summary:
snprintf is in <stdio.h> and not in namespace std.
Closes https://github.com/facebook/rocksdb/pull/2287
Reviewed By: anirbanr-fb
Differential Revision: D5054752
Pulled By: yiwu-arbug
fbshipit-source-id: 356807ec38f3c7d95951cdb41f31a3d3ae0714d4
Summary:
As an alternative to Vagrant, we can now also use Docker to cross-build RocksDB. The advantages are:
1. The Docker images are fixed; they include all the latest updates and build tools.
2. The Vagrant image, required scripts that ran for every build that would update CentOS and install the buildtools. This lead to slow repeatable builds, we don't need to do this with Docker as they are already in the provided images.
The Docker images I have used have their Docker build files here: https://github.com/evolvedbinary/docker-rocksjava and the images themselves are available from Docker hub: https://hub.docker.com/r/evolvedbinary/rocksjava/
I have added the following targets to the `Makefile`:
1. `rocksdbjavastaticreleasedocker` this uses Docker to perform the cross-builds. It is basically the Docker version of the existing Vagrant `rocksdbjavastaticrelease` target.
2. `rocksdbjavastaticpublishdocker` delegates to `rocksdbjavastaticreleasedocker` and then `rocksdbjavastaticpublishcentral` to upload the artiacts to Maven Central. Equivalent to the existing Vagrant target: `rocksdbjavastaticpublish`
Closes https://github.com/facebook/rocksdb/pull/2278
Differential Revision: D5048206
Pulled By: yiwu-arbug
fbshipit-source-id: 78fa96ef9d966fe09638ed01de282cd4e31961a9
Summary:
When building rocksdb in fbcode using `make`, util/build_version.cc is always updated (gitignore/hgignore doesn't apply because the file is already checked into fbcode). To use the rocksdb makefile from our own makefile, I would like an option to prevent the metadata update, which is of no value for us.
Closes https://github.com/facebook/rocksdb/pull/2264
Differential Revision: D5037846
Pulled By: siying
fbshipit-source-id: 9fa005725c5ecb31d9cbe2e738cbee209591f08a
Summary:
I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there.
Closes https://github.com/facebook/rocksdb/pull/2211
Differential Revision: D4951558
Pulled By: ajkr
fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4
Summary:
Replacement of #2147
The change was squashed due to a lot of conflicts.
Closes https://github.com/facebook/rocksdb/pull/2194
Differential Revision: D4929799
Pulled By: siying
fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581
Summary:
Previously, the shared library (make shared_lib) was built with only one
compile line, compiling all .cc files and linking the shared library in
one step. That step would often take 10+ minutes on one machine, and
could not take advantage of multiple CPUs (it's only one invocation of
the compiler).
This commit changes the shared_lib build to compile .o files
individually (placing the resulting .o files in the directory
shared-objects) and then link them into the shared library at the end,
similarly to how the java static build (jls) does it.
Tested by making sure that both static and shared libraries work, and by
making sure that "make clean" cleans up the shared-objects directory.
Closes https://github.com/facebook/rocksdb/pull/2165
Differential Revision: D4897121
Pulled By: yiwu-arbug
fbshipit-source-id: 9811e043d1c01e10503593f3489d186c786ee7d7
Summary:
In some CI test environment, compression libraries can't be successfully built. It still helps to build RocksDB there. Provide such an option to skip to download and build compression libraries.
Closes https://github.com/facebook/rocksdb/pull/2135
Differential Revision: D4872617
Pulled By: siying
fbshipit-source-id: bb21ac373bc62a2528cdf1ca4547e05fcae86214
Summary:
siying this is a resubmission of #2081 with the 4th commit fixed. From that commit message:
> Note that the previous use of quotes in PLATFORM_{CC,CXX}FLAGS was
incorrect and caused GCC to produce the incorrect define:
>
> #define ROCKSDB_JEMALLOC -DJEMALLOC_NO_DEMANGLE 1
>
> This was the cause of the Linux build failure on the previous version
of this change.
I've tested this locally, and the Linux build succeeds now.
Closes https://github.com/facebook/rocksdb/pull/2097
Differential Revision: D4839964
Pulled By: siying
fbshipit-source-id: cc51322
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
I've needed Env timing measurements a few times now, so finally built something for it.
Closes https://github.com/facebook/rocksdb/pull/2073
Differential Revision: D4811231
Pulled By: ajkr
fbshipit-source-id: 218a249
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080
Differential Revision: D4821141
Pulled By: siying
fbshipit-source-id: ca7d768
Summary:
Right now, building rocksdbjava in PowerPC is broken due to JNI library name. I figured it out that "uname -m" and java's os.arch matches in PowerPC architecture. I made use of this advantage to fix the issue. More info can found from this issue --> https://github.com/facebook/rocksdb/issues/1317
Closes https://github.com/facebook/rocksdb/pull/2040
Differential Revision: D4779967
Pulled By: siying
fbshipit-source-id: 259f939
Summary:
After we have db_basic_test and external_sst_file_basic_test, we don't need to run db_test and external_sst_file_test in Travis's MAC OS run anymore. Move it out.
Closes https://github.com/facebook/rocksdb/pull/1940
Differential Revision: D4659361
Pulled By: siying
fbshipit-source-id: e64e291
Summary:
Separate the platform dependent tests from external_sst_file_test. Only those tests need to run on platforms like OSX
Closes https://github.com/facebook/rocksdb/pull/1923
Differential Revision: D4622461
Pulled By: siying
fbshipit-source-id: d2d6f04
Summary:
Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux.
Closes https://github.com/facebook/rocksdb/pull/1924
Differential Revision: D4616702
Pulled By: siying
fbshipit-source-id: 13e6549
Summary:
Travis is short of OSX resource. Try to move platform independent test suites out of OSX
Closes https://github.com/facebook/rocksdb/pull/1922
Differential Revision: D4616070
Pulled By: siying
fbshipit-source-id: 786342c
Summary:
valgrind tests always timeout with parallel run. Black list some slowest ones. It is better to run fewer tests than always have the tests timeout.
Closes https://github.com/facebook/rocksdb/pull/1908
Differential Revision: D4607875
Pulled By: siying
fbshipit-source-id: 7062664
Summary:
The previous version of zlib is no longer available. I have also updated the versions of the other static libraries and added checkum checks for the downloads; This is related to https://github.com/facebook/rocksdb/issues/1769
Closes https://github.com/facebook/rocksdb/pull/1863
Differential Revision: D4550742
Pulled By: yiwu-arbug
fbshipit-source-id: 4414150
Summary:
The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config.
Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file.
This diff generalizes the Env registration logic to work with arbitrary types.
- Generalized registration and instantiation code by templating them
- The entire implementation is in a header file as that's Google style guide's recommendation for template definitions
- Pattern match with std::regex_match rather than checking prefix, which was the previous behavior
- Rename functions/files to be non-Env-specific
Closes https://github.com/facebook/rocksdb/pull/1776
Differential Revision: D4421933
Pulled By: ajkr
fbshipit-source-id: 34647d1
Summary:
Cleanable objects will perform the registered cleanups when
they are destructed. We however rather to delay this cleaning like when
we are gathering the merge operands. Current approach is to create the
Cleanable object on heap (instead of on stack) and delay deleting it.
By allowing Cleanables to delegate their cleanups to another cleanable
object we can delay the cleaning without however the need to craete the
cleanable object on heap and keeping it around. This patch applies this
technique for the cleanups of BlockIter and shows improved performance
for some in-memory benchmarks:
+1.8% for merge worklaod, +6.4% for non-merge workload when the merge
operator is specified.
https://our.intern.facebook.com/intern/tasks?t=15168163
Non-merge benchmark:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom
--num=1000000 -value_size=100 -compression_type=none
Reading random with no merge operator specified:
TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench
--benchmarks="read
Closes https://github.com/facebook/rocksdb/pull/1711
Differential Revision: D4361163
Pulled By: maysamyabandeh
fbshipit-source-id: 9801e07
Summary:
Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach.
Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map.
Closes https://github.com/facebook/rocksdb/pull/1614
Differential Revision: D4270397
Pulled By: ajkr
fbshipit-source-id: 93092c7
Summary:
Iterator should be in corrupted status if merge operator return false.
Also add test to make sure if max_successive_merges is hit during write,
data will not be lost.
Closes https://github.com/facebook/rocksdb/pull/1665
Differential Revision: D4322695
Pulled By: yiwu-arbug
fbshipit-source-id: b327b05
Summary:
currently when running a portable build we have to do the following
PORTABLE=1 make ...
this commit adds support for the following
make PORTABLE=1 ...
this might be seem subtle but it makes PORTABLE like all other
makefile args and simplifies invocation from numerous build systems
including things like ExternalProject_Add in cmake.
Signed-off-by: Bassam Tabbara <bassam.tabbara@quantum.com>
Closes https://github.com/facebook/rocksdb/pull/1643
Differential Revision: D4315870
Pulled By: yiwu-arbug
fbshipit-source-id: ee43755
Summary:
The second variable "SHELL" simply tells make explicitly which shell to use, instead of allowing it to default to "/bin/sh", which may or may not be Bash.
However, simply defining the second variable by itself causes make to throw an error concerning a circular definition, as it would be attempting to use the "shell" command while simultaneously trying to set which shell to use. Thus, the first variable "BASH_EXISTS" is defined such that make already knows about "/path/to/bash" before trying to use it to set "SHELL".
A more technically correct solution would be to edit the makefile itself to make it compatible with non-bash shells (see the original Issue discussion for details). However, as it seems very few of the people working on this project were building with non-bash shells, I figured this solution would be good enough.
Closes https://github.com/facebook/rocksdb/pull/1631
Differential Revision: D4295689
Pulled By: yiwu-arbug
fbshipit-source-id: e4f9532
Summary:
Travis now is building for ldb tests. Disable for now to unblock other tests while we are investigating.
Closes https://github.com/facebook/rocksdb/pull/1546
Differential Revision: D4209404
Pulled By: siying
fbshipit-source-id: 47edd97
Summary:
This diff includes an implementation of CompactionFilter that allows
users to write CompactionFilter in Lua. With this ability, users can
dynamically change compaction filter logic without requiring building
the rocksdb binary and restarting the database.
To compile, WITH_LUA_PATH must be specified to the base directory
of lua.
Closes https://github.com/facebook/rocksdb/pull/1478
Differential Revision: D4150138
Pulled By: yhchiang
fbshipit-source-id: ed84222
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.
This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
implementation to return a hint of insert position, which can be later
pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419
Differential Revision: D4079367
Pulled By: yiwu-arbug
fbshipit-source-id: 3555326
* util/build_verion.cc.in: add this file, so cmake and make can share the
template file for generating util/build_version.cc.
* CMakeLists.txt: also, cmake v2.8.11 does not support file(GENERATE ...),
so we are using configure_file() for creating build_version.cc.
* Makefile: use util/build_verion.cc.in for creating build_version.cc.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Summary:
Valgrind does not work well with JEMALLOC. If you run
a simple make valgrind_check, you will see lots of issues and
crashes. When precommit runs, this is taken care of. Here we
make sure valgrind_check is passed in DISABLE_JEMALLOC=1
Test Plan: Ran local valgrind_test and noticed the difference
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65379
Summary:
Revert the behavior where we don't read sequence id from WAL, but increase it as we replay the log. We still keep the behave for 2PC for now but will fix later.
This change fixes github issue 1339, where some writes come with WAL disabled and we may recover records with wrong sequence id.
Test Plan: Added unit test.
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64275
Summary:
I just realized that when we run parallel valgrind we actually don't run the parallel tests under valgrind (we run the normally)
This patch make sure that we run both parallel and non-parallel tests with valgrind
Test Plan: DISABLE_JEMALLOC=1 make valgrind_check -j64
Reviewers: andrewkr, yiwu, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62469
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.
Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.
In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.
Test Plan: unit test.
Reviewers: IslamAbdelRahman, sdong, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D61977
Summary:
This is a proof of concept of a RocksDB blob log file. The actual value of the Put() is appended to a blob log using normal data block format, and the handle of the block is written as the value of the key in RocksDB.
The prototype only supports Put() and Get(). It doesn't support DB restart, garbage collection, Write() call, iterator, snapshots, etc.
Test Plan: Add unit tests.
Reviewers: arahut
Reviewed By: arahut
Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61485
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.
Test Plan: I am adding unit tests.
Reviewers: andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61503
Summary:
compact_on_deletion_collector_test does not support --gtest_list_tests
since it isn't gtest, so the full program would run for the target
gen_parallel_tests. This caused gen_parallel_tests to take 8+ minutes for tsan
and prevented compact_on_deletion_collector_test from running during check_0
since no t/run-* script could be generated.
Test Plan:
run make check, verify generating t/run-* scripts is fast and
./compact_on_deletion_collector_test is now run
Reviewers: IslamAbdelRahman, wanning, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61695
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).
Test Plan: Add `date_tiered_test`.
Reviewers: dhruba, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61653
Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options.
Test Plan: Add unit tests for it.
Reviewers: andrewkr, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60783
Summary:
parallel tests are broken because gnu_parallel is reading deprecated options from `/etc/parallel/config`
Fix this by passing `--plain` to ignore `/etc/parallel/config`
Test Plan: make check -j64
Reviewers: kradhakrishnan, sdong, andrewkr, yiwu, arahut
Reviewed By: arahut
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61359
Summary:
Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format.
There is still on-going work on this diff. More refactoring is necessary.
Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added.
Reviewers: sdong
Reviewed By: sdong
Subscribers: arahut, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60027
Summary:
Removing moreutils from sandcastle and adding gnu parallel.
Then passing in J= nproc command
Test Plan: Testing on sandcastle
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61017
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)
Test Plan: added new unit test
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61083
Summary:
Tsan crash white-box test was disabled because it never ends. Re-enable it with reduced killing odds.
Add a parameter in crash test script to allow we pass it through an environment variable.
Test Plan: Run it manually.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61053
Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost.
Test Plan: Add a new unit test which will fail without the diff.
Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: yiwu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60915
Summary: RocksDB behavior is slightly different between data on tmpfs and normal file systems. Add a test case to run RocksDB on normal file system.
Test Plan: See the tests launched by Phabricator
Reviewers: kradhakrishnan, IslamAbdelRahman, gunnarku
Reviewed By: gunnarku
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60963
EnvLibrados is a customized RocksDB Env to use RADOS as the backend file system of RocksDB. It overrides all file system related API of default Env. The easiest way to use it is just like following:
std::string db_name = "test_db";
std::string config_path = "path/to/ceph/config";
DB* db;
Options options;
options.env = EnvLibrados(db_name, config_path);
Status s = DB::Open(options, kDBPath, &db);
Then EnvLibrados will forward all file read/write operation to the RADOS cluster assigned by config_path. Default pool is db_name+"_pool".
There are some options that users could set for EnvLibrados.
- write_buffer_size. This variable is the max buffer size for WritableFile. After reaching the buffer_max_size, EnvLibrados will sync buffer content to RADOS, then clear buffer.
- db_pool. Rather than using default pool, users could set their own db pool name
- wal_dir. The dir for WAL files. Because RocksDB only has 2-level structure (dir_name/file_name), the format of wal_dir is "/dir_name"(CAN'T be "/dir1/dir2"). Default wal_dir is "/wal".
- wal_pool. Corresponding pool name for WAL files. Default value is db_name+"_wal_pool"
The example of setting options looks like following:
db_name = "test_db";
db_pool = db_name+"_pool";
wal_dir = "/wal";
wal_pool = db_name+"_wal_pool";
write_buffer_size = 1 << 20;
env_ = new EnvLibrados(db_name, config, db_pool, wal_dir, wal_pool, write_buffer_size);
DB* db;
Options options;
options.env = env_;
// The last level dir name should match the dir name in prefix_pool_map
options.wal_dir = "/tmp/wal";
// open DB
Status s = DB::Open(options, kDBPath, &db);
Librados is required to compile EnvLibrados. Then use "$make LIBRADOS=1" to compile RocksDB. If you want to only compile EnvLibrados test, just run "$ make env_librados_test LIBRADOS=1". To run env_librados_test, you need to have a running RADOS cluster with the configure file located in "../ceph/src/ceph.conf" related to "rocksdb/".
Summary:
Fix flush not being commit while writing manifest, which is a recent bug introduced by D60075.
The issue:
# Options.max_background_flushes > 1
# Background thread A pick up a flush job, flush, then commit to manifest. (Note that mutex is released before writing manifest.)
# Background thread B pick up another flush job, flush. When it gets to `MemTableList::InstallMemtableFlushResults`, it notices another thread is commiting, so it quit.
# After the first commit, thread A doesn't double check if there are more flush result need to commit, leaving the second flush uncommitted.
Test Plan: run the test. Also verify the new test hit deadlock without the fix.
Reviewers: sdong, igor, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, omegaga, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60969
The tests run by `make check` require Bash. On Debian you'd need to run
the test as `make SHELL=/bin/bash check`. This commit makes it work on
all POSIX compatible shells (tested on Debian with `dash`).
Summary: As Title.
Test Plan: See how the diff works.
Reviewers: kradhakrishnan, andrewkr, gunnarku
Reviewed By: gunnarku
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60933