Summary:
Cleanup TTLExtractor interface. The original purpose of it is to allow our users keep using existing `Write()` interface but allow it to accept TTL via `TTLExtractor`. However the interface is confusing. Will replace it with something like `WriteWithTTL(batch, ttl)` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4229
Differential Revision: D9174390
Pulled By: yiwu-arbug
fbshipit-source-id: 68201703d784408b851336ab4dd9b84188245b2d
Summary: Windows requires new/delete for memory allocations to be overriden. Refactor to be less intrusive.
Differential Revision: D8878047
Pulled By: siying
fbshipit-source-id: 35f2b5fec2f88ea48c9be926539c6469060aab36
Summary:
- Avoid `strdup` to use jemalloc on Windows
- Use `size_t` for consistency
- Add GCC 8 to Travis
- Add CMAKE_BUILD_TYPE=Release to Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433
Differential Revision: D6837948
Pulled By: sagar0
fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance
This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997
Differential Revision: D8653831
Pulled By: anand1976
fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.
For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.
The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838
Differential Revision: D8229794
Pulled By: miasantreble
fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
Summary:
As titled.
I have not extended the Compatibility tests because the new WAL markers are still unimplemented.
Closes https://github.com/facebook/rocksdb/pull/3941
Differential Revision: D8238394
Pulled By: lth
fbshipit-source-id: 980e3d44837bbf2cfa64047f9738f559dfac4b1d
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
People also use ON/OFF, TRUE/FALSE and other switch options that is allowed by cmake.
Closes https://github.com/facebook/rocksdb/pull/3814
Differential Revision: D7899032
Pulled By: ajkr
fbshipit-source-id: b71511af59e0a78eedafb639b5002c47050bf3c2
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
Summary:
Returning bytes_read causes the caller to call GetLastError()
to report failure but the lasterror may be overwritten by then
so we lose the error code.
Fix up CMake file to include xpress source code only when needed.
Fix warning for the uninitialized var.
Closes https://github.com/facebook/rocksdb/pull/3795
Differential Revision: D7832935
Pulled By: anand1976
fbshipit-source-id: 4be21affb9b85d361b96244f4ef459f492b7cb2b
Summary:
This replaces reading the rocksdb version by external shell script. This does not work reliably on Windows (I wander how it works on AppVeyor).
Closes https://github.com/facebook/rocksdb/pull/3737
Differential Revision: D7703106
Pulled By: ajkr
fbshipit-source-id: 4079c7c77431757e9ddc801363ed896b18fdbf23
Summary:
otherwise the env_librados_test executable will fail to link against
librados.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/3724
Differential Revision: D7631542
Pulled By: ajkr
fbshipit-source-id: 38afbf21f9aeb7dedfb840aba8b2f8b421f9edb0
Summary:
The introduction of the 3-way SSE4.2 optimized crc32c implementation in commit f54d7f5fea added the `HAVE_PCLMUL` definition when the compiler supports intrinsics for that instruction, but did not modify CMakeLists.txt to set that definition on MSVC when appropriate. As a result, 3-way SSE4.2 is not used in MSVC builds with CMake although it could be.
Since the existing test program in CMakeLists.txt for `HAVE_SSE42` already uses `_mm_clmulepi64_si128` which is a PCLMUL instruction, this PR sets `HAVE_PCLMUL` as well if that program builds successfully, fixing the problem.
Closes https://github.com/facebook/rocksdb/pull/3673
Differential Revision: D7473975
Pulled By: miasantreble
fbshipit-source-id: bc346b9eb38920e427aa1a253e6dd9811efa269e
Summary:
Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files, but this PR targets sst.
Also add a unit test to verify that this PR fixes the issue.
The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows
```
timeline user_thread background_compaction thread
t1 | FindObsoleteFiles(full_scan=false)
t2 | FindObsoleteFiles(full_scan=true)
t3 | PurgeObsoleteFiles
t4 | PurgeObsoleteFiles
V
```
When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`.
To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files.
ajkr could you take a look and comment? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3638
Differential Revision: D7384372
Pulled By: riversand963
fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556
Differential Revision: D7139328
Pulled By: yiwu-arbug
fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
Summary:
Issue #3370 Simple fixes to make RocksDB project working also as a submodule of other bigger one.
Closes https://github.com/facebook/rocksdb/pull/3372
Differential Revision: D6729595
Pulled By: ajkr
fbshipit-source-id: eee2589e7a7c4322873dff8510eebd050301c54c
Summary:
Make dependacies switches compatible with other OS builds
TODO: Make find_package work for Windows.
Closes https://github.com/facebook/rocksdb/pull/3322
Differential Revision: D6667637
Pulled By: sagar0
fbshipit-source-id: 5afcd7bbfe69465310a4fbc8e589f01e506b95f5
Summary:
**# Summary**
RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain.
This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used.
**# Performance Test Results**
We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm.
Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s.
1) ReadRandom in db_bench overall metrics
PER RUN
Algorithm | run | micros/op | ops/sec |Throughput (MB/s)
3-way | 1 | 4.143 | 241387 | 26.7
3-way | 2 | 3.775 | 264872 | 29.3
3-way | 3 | 4.116 | 242929 | 26.9
FastCrc32c|1 | 4.037 | 247727 | 27.4
FastCrc32c|2 | 4.648 | 215166 | 23.8
FastCrc32c|3 | 4.352 | 229799 | 25.4
AVG
Algorithm | Average of micros/op | Average of ops/sec | Average of Throughput (MB/s)
3-way | 4.01 | 249,729 | 27.63
FastCrc32c | 4.35 | 230,897 | 25.53
2) Crc32c computation CPU cost (inclusive samples percentage)
PER RUN
Implementation | run | TotalSamples | Crc32c percentage
3-way | 1 | 4,572,250,000 | 4.37%
3-way | 2 | 3,779,250,000 | 4.62%
3-way | 3 | 4,129,500,000 | 4.48%
FastCrc32c | 1 | 4,663,500,000 | 11.24%
FastCrc32c | 2 | 4,047,500,000 | 12.34%
FastCrc32c | 3 | 4,366,750,000 | 11.68%
**# Test Plan**
make -j64 corruption_test && ./corruption_test
By default it uses 3-way SSE algorithm
NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test
make clean && DEBUG_LEVEL=0 make -j64 db_bench
make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench
Closes https://github.com/facebook/rocksdb/pull/3173
Differential Revision: D6330882
Pulled By: yingsu00
fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9
Summary:
Divide ReadBlockContents() to multiple sub-functions. Maintaining the input and intermediate data in a new class BlockFetcher.
I hope in general it makes the code easier to maintain.
Another motivation to do it is to clearly divide the logic before file reading and after file reading. The refactor will help us evaluate how can we make I/O async in the future.
Closes https://github.com/facebook/rocksdb/pull/3244
Differential Revision: D6520983
Pulled By: siying
fbshipit-source-id: 338d90bc0338472d46be7a7682028dc9114b12e9
Summary:
We should use it if available otherwise the tools builds never work. Thanks to #3212, we can set -DGFLAGS=1 and it'll be independent of the namespace with which gflags was compiled.
Closes https://github.com/facebook/rocksdb/pull/3214
Differential Revision: D6462214
Pulled By: ajkr
fbshipit-source-id: db4e5f1b905322e3119554a9d01b57532c499384
Summary:
Bring CMakeLists.txt back up to parity with build_detect_platform.
Closes https://github.com/facebook/rocksdb/pull/3211
Differential Revision: D6452908
Pulled By: ajkr
fbshipit-source-id: 93f5f336ad7eff6ecf65dec47bfaf114dd24cfb2
Summary:
The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky.
Closes https://github.com/facebook/rocksdb/pull/3164
Differential Revision: D6319201
Pulled By: yiwu-arbug
fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48
Summary:
Move WritePreparedTxnDB from pessimistic_transaction_db.h to its own header, write_prepared_txn_db.h
Closes https://github.com/facebook/rocksdb/pull/3114
Differential Revision: D6220987
Pulled By: maysamyabandeh
fbshipit-source-id: 18893fb4fdc6b809fe117dabb544080f9b4a301b
Summary:
it turns out that, with older GCC shipped from centos7, the SSE42
intrinsics are not available even with "target" specified. so we
need to pass "-msse42" for checking compiler's sse4.2 support and
for building crc32c.cc which uses sse4.2 intrinsics for crc32.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/2950
Differential Revision: D6032298
Pulled By: siying
fbshipit-source-id: 124c946321043661b3fb0a70b6cdf4c9c5126ab4
Summary:
Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator:
* check if key has been committed. If not, output uncommitted keys AS-IS.
* use SnapshotChecker to check if key is visible to a snapshot when in need.
* do not output key with seq = 0 if the key is not committed.
Closes https://github.com/facebook/rocksdb/pull/2926
Differential Revision: D5902907
Pulled By: yiwu-arbug
fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320
Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886
Differential Revision: D5838431
Pulled By: yiwu-arbug
fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
Summary:
Hello,
current master branch declares cmake_minimum_required (VERSION 2.8.11)
but cmake gives the following error:
[ 6%] CMake Error at CMakeLists.txt:658 (install):
install TARGETS given unknown argument "INCLUDES".
CMake Error at src/CMakeLists.txt:658 (install): install TARGETS given unknown argument "INCLUDES".
because this argument not supported on CMake versions prior 2.8.12
Closes https://github.com/facebook/rocksdb/pull/2904
Differential Revision: D5863430
Pulled By: yiwu-arbug
fbshipit-source-id: 0f7230e080add472ad4b87836b3104ea0b971a38
Summary:
snprintf is defined as _snprintf, which doesn't exist in the std
namespace.
Closes https://github.com/facebook/rocksdb/pull/2298
Differential Revision: D5070457
Pulled By: yiwu-arbug
fbshipit-source-id: 6e1659ac3e86170653b174578da5a8ed16812cbb
Summary:
because it is not only platform independent
but also allows to override the build date
This helps to make ceph builds reproducible (that includes a fork of rockdb in a submodule)
Also adds UTC flag, to be independent of timezone.
Requires cmake-2.8.11+ from 2013
Closes https://github.com/facebook/rocksdb/pull/2848
Differential Revision: D5820189
Pulled By: yiwu-arbug
fbshipit-source-id: e3e8c1550e10e238c173f6c5d9ba15f71ad3ce28
Summary:
Some compilers require `-std=c++11` for the `cstdint` header to be available. We already have logic to add `-std=c++11` to `CXXFLAGS` when the compiler is not MSVC; simply reorder CMakeLists.txt so that logic happens before the calls to `CHECK_CXX_SOURCE_COMPILES`.
Additionally add a missing `set(CMAKE_REQUIRED_FLAGS, ...)` before a call to `CHECK_C_SOURCE_COMPILES`.
Closes https://github.com/facebook/rocksdb/pull/2535
Differential Revision: D5384244
Pulled By: yiwu-arbug
fbshipit-source-id: 2dbae4297c5d8ab4636e08b1457ffb2d3e37aef4
Summary:
This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
Closes https://github.com/facebook/rocksdb/pull/2691
Differential Revision: D5569464
Pulled By: maysamyabandeh
fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325