Commit Graph

3948 Commits

Author SHA1 Message Date
Akanksha Mahajan
a0faff126d Report kFilesMarkedForCompaction for delete triggered compactions (#6680)
Summary:
Summary : Set manual_compaction false in case of DeleteTriggeredCompaction object so that kFilesMarkedForComapaction can be reported.
          Added a DeletionTriggeredUniversalCompactionMarking test case for Deletion Triggered compaction in case of Universal Compaction.

Test Plan : make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6680

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D20945946

Pulled By: akankshamahajan15

fbshipit-source-id: af84e417bd7127652aaae9143c560d1ab3815d25
2020-04-10 15:30:38 -07:00
Huisheng Liu
9e89ffb776 make iterator return versions between timestamp bounds (#6544)
Summary:
(Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window.

SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit e860f8840):
seekrandom   : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found)
This PR:
seekrandom   : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found)

db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544

Reviewed By: ltamasi

Differential Revision: D20844069

Pulled By: riversand963

fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74
2020-04-10 09:51:58 -07:00
Yi Wu
eb287c72d7 Fix wrong key being read on ingested file with global seqno and delta encoding (#6669)
Summary:
On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key.

The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669

Test Plan:
New unit test

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Reviewed By: cheng-chang

Differential Revision: D20931808

Pulled By: ajkr

fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab
2020-04-08 21:22:15 -07:00
Peter Dillinger
e5f1bfc263 Fix initializer syntax for old Xcode compiler (#6662)
Summary:
Example compiler output, from OSX TEST_GROUP=3:

db/flush_job_test.cc:185:7: error: suggest braces around initialization
of subobject [-Werror,-Wmissing-braces]
      kInvalidBlobFileNumber, 5, 103, 17, 102, 101};

Apparently permitted in newer version, but worth working around.
https://stackoverflow.com/questions/31555584/why-is-clang-warning-suggest-braces-around-initialization-of-subobject-wmis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6662

Test Plan: CI (temporarily including OSX TEST_GROUP=3 in Travis)

Reviewed By: ltamasi

Differential Revision: D20901009

Pulled By: pdillinger

fbshipit-source-id: 5338878613b5725e5d632c8858904de467dc4692
2020-04-07 16:00:26 -07:00
Kirill Abrosimov
3ff603171d added new functions to c-api (#5630)
Summary:
Few functions from options added to C-api
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5630

Reviewed By: anand1976

Differential Revision: D20896731

Pulled By: ltamasi

fbshipit-source-id: e4215a58b3c2429ec44e3f0d0381cbf86700fb14
2020-04-07 14:45:39 -07:00
Steven Fackler
f53cdab3d7 Hex encode keys in compaction flush logs (#6616)
Summary:
The raw key bytes are currently dumped directly into the log messages,
which is not ideal if the keys aren't ASCII strings. Null bytes in
particular can cut off bits of the message early.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6616

Reviewed By: ajkr

Differential Revision: D20879218

Pulled By: anand1976

fbshipit-source-id: 825a20715fe6d8012c0163c6e7b8159f7926a1a7
2020-04-06 17:41:45 -07:00
mrambacher
259b6ec8da Move the OptionTypeMap code closer to home (#6198)
Summary:
This is a predecessor to the Configurable PR.  This change moves the OptionTypeInfo maps closer to where they will be used.

When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198

Reviewed By: siying

Differential Revision: D20778108

Pulled By: zhichao-cao

fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa
2020-04-03 10:52:38 -07:00
Peter Dillinger
079e77ff9e Revamp cache_bench to resemble a real workload (#6629)
Summary:
I suspect LRUCache could use some optimization, and to support
such an effort, a good benchmarking tool is needed. The existing
cache_bench was heavily skewed toward insertion and lookup misses, and
did not saturate memory with other work. This change should improve
those things to better resemble a real workload.

(All below using clang compiler, for some consistency, but not
necessarily same version and settings.)

The real workload is from production MySQL on RocksDB, filtering stacks
containing "LRU", "ShardedCache" or "CacheShard."
Lookup inclusive: 66%
Insert inclusive: 17%
Release inclusive: 15%

An alternate simulated workload is MySQL running a LinkBench read test:
Lookup inclusive: 54%
Insert inclusive: 24%
Release inclusive: 21%

cache_bench default settings, prior to this change:
Lookup inclusive: 35.8%
Insert inclusive: 63.6%
Release inclusive: 0%

cache_bench after this change (intended as somewhat "tighter" workload
than average production, more like LinkBench):
Lookup inclusive: 52%
Insert inclusive: 20%
Release inclusive: 26%

And top exclusive stacks (portion of stack samples as filtered above):
Production MySQL:
LRUHandleTable::FindPointer: 25.3%
rocksdb::operator==: 15.1%  <-- Slice ==
LRUCacheShard::LRU_Remove: 13.8%
ShardedCache::Lookup: 8.9%
__pthread_mutex_lock: 7.1%
LRUCacheShard::LRU_Insert: 6.3%
MurmurHash64A: 4.8%  <-- Since upgraded to XXH3p
...

Old cache_bench:
LRUHandleTable::FindPointer: 23.6%
__pthread_mutex_lock: 15.0%
__pthread_mutex_unlock_usercnt: 11.7%
__lll_lock_wait: 8.6%
__lll_unlock_wake: 6.8%
LRUCacheShard::LRU_Insert: 6.0%
ShardedCache::Lookup: 4.4%
LRUCacheShard::LRU_Remove: 2.8%
...
rocksdb::operator==: 0.2%  <-- Slice ==
...

New cache_bench:
LRUHandleTable::FindPointer: 22.8%
__pthread_mutex_unlock_usercnt: 14.3%
rocksdb::operator==: 10.5%  <-- Slice ==
LRUCacheShard::LRU_Insert: 9.0%
__pthread_mutex_lock: 5.9%
LRUCacheShard::LRU_Remove: 5.0%
...
ShardedCache::Lookup: 2.9%
...

So there's a bit more lock contention in the benchmark than in
production, but otherwise looks similar enough to me. At least it's a
big improvement over the existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6629

Test Plan: No production code changes, ran cache_bench with ASAN

Reviewed By: ltamasi

Differential Revision: D20824318

Pulled By: pdillinger

fbshipit-source-id: 6f8dc5891ead0f87edbed3a615ecd5289d9abe12
2020-04-03 10:26:49 -07:00
Zhichao Cao
ef088f0e93 Fix the multi-thread Manifest write dependency in error_handler_fs_test (#6637)
Summary:
In CompactionManifestWriteRetryableError in error_handler_fs_test, the manifest write of flush should pass with no fs error. After flush, fs is set to error status and the manifest write of compaction should fail due to the IO Error. Currently, the manifest write of flush is not synced with the compaction in order, which might cause manifest write fails, which will cause test failure. Fixed by adding the LoadDependency of sync-point after flush and before compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6637

Test Plan: pass error_hanlder_fs_tes. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20826969

Pulled By: zhichao-cao

fbshipit-source-id: fb2e702caa19bd63c82570320536b7acda870ff1
2020-04-02 18:08:46 -07:00
anand76
0709cd04ca Fix LITE mode test failure in DBOptionsTest.ChangeCompression (#6635)
Summary:
This failure was introduced in https://github.com/facebook/rocksdb/issues/6262
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6635

Reviewed By: siying

Differential Revision: D20822602

Pulled By: anand1976

fbshipit-source-id: 96b316816cce6b95b092a7fc46ea968ed6ba8809
2020-04-02 16:41:09 -07:00
sdong
d0f3894cf1 In block based table builder, make variables for estimating file size atomic (#6636)
Summary:
With https://github.com/facebook/rocksdb/issues/6262, TSAN complains about data race of some variables. Those variables are used to estimate file size and are accessed in writer and background threads. Since file size estimation doesn't have to be 100% accurate, we make some variables atomic and use relaxed memory order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6636

Test Plan: Run all tests with TSAN.

Reviewed By: anand1976

Differential Revision: D20820635

fbshipit-source-id: 1ea45ff38be15e33674ffe06b7d42fc9fe161ea5
2020-04-02 16:16:24 -07:00
Levi Tamasi
2165c3bacc Re-persist blob file metadata when a new manifest file is created (#6630)
Summary:
Does what it says on the can. Similarly to table files, we need to re-persist
the metadata of live blob files whenever a new manifest file is opened.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6630

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D20802126

Pulled By: ltamasi

fbshipit-source-id: 5738692d898790293bf09d66e9997369bbf89566
2020-04-02 11:53:05 -07:00
Ziyue Yang
03a781a90c Add pipelined & parallel compression optimization (#6262)
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262

Reviewed By: ajkr

Differential Revision: D20651306

fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
2020-04-01 16:40:18 -07:00
Yanqin Jin
b5818f87f0 Fix clang analyze error (#6622)
Summary:
As title. https://github.com/facebook/rocksdb/issues/6612 caused clang analyze to fail with the error:
```
db/compaction/compaction_picker_fifo.cc:105:39: warning: Called C++ object pointer is null
                     cf_name.c_str(), f->fd.GetNumber(), creation_time);
                                      ^~~~~~~~~~~~~~~~~
./logging/logging.h:59:36: note: expanded from macro 'ROCKS_LOG_BUFFER'
                                 ##__VA_ARGS__)
                                   ^~~~~~~~~~~
1 warning generated.
```

Test Plan (devserver):
USE_CLANG=1 make analyze
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6622

Reviewed By: ltamasi

Differential Revision: D20787407

Pulled By: riversand963

fbshipit-source-id: a5de4910cc1aa0d3481a73ec114578925bfd63f7
2020-04-01 10:01:38 -07:00
Levi Tamasi
e6f86cfb36 Revert the recent cache deleter change (#6620)
Summary:
Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)"

    This reverts commit 6301dbe7a7.

    Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)"

    This reverts commit 3a35542f86.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620

Test Plan: `make check`

Reviewed By: zhichao-cao

Differential Revision: D20773311

Pulled By: ltamasi

fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b
2020-03-31 16:11:06 -07:00
sdong
80979f81c7 Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615)
Summary:
These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615

Test Plan: Add a unit test to make sure that SetOptions() can change the options.

Reviewed By: riversand963

Differential Revision: D20755951

fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff
2020-03-31 12:11:42 -07:00
Yanqin Jin
18cf0de640 Use flush time for the props.creation_time for FIFO compaction (#6612)
Summary:
For FIFO compaction, we use flush time instead of oldest key time as the
creation time. This is to prevent FIFO compaction dropping files whose oldest
key time is older than TTL but which has newer keys than TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612

Test Plan: make check

Reviewed By: siying

Differential Revision: D20748217

Pulled By: riversand963

fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663
2020-03-30 18:59:17 -07:00
Zhichao Cao
e8d332d97e Use FileChecksumGenFactory for SST file checksum (#6600)
Summary:
In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600

Test Plan: tested with make asan_check

Reviewed By: riversand963

Differential Revision: D20717670

Pulled By: zhichao-cao

fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
2020-03-29 15:58:46 -07:00
Cheng Chang
ee50b8d499 Be able to decrease background thread's CPU priority when creating database backup (#6602)
Summary:
When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.

This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602

Test Plan: make check

Reviewed By: siying, zhichao-cao

Differential Revision: D20683216

Pulled By: cheng-chang

fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
2020-03-28 19:07:25 -07:00
Zhichao Cao
4246888101 Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
2020-03-27 16:04:43 -07:00
Levi Tamasi
6f62322fe4 Add blob files to VersionStorageInfo/VersionBuilder (#6597)
Summary:
The patch adds a couple of classes to represent metadata about
blob files: `SharedBlobFileMetaData` contains the information elements
that are immutable (once the blob file is closed), e.g. blob file number,
total number and size of blob files, checksum method/value, while
`BlobFileMetaData` contains attributes that can vary across versions like
the amount of garbage in the file. There is a single `SharedBlobFileMetaData`
for each blob file, which is jointly owned by the `BlobFileMetaData` objects
that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s
and can also be shared if the (immutable _and_ mutable) state of the blob file
is the same in two versions.

In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends
`VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those
containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata
to a new `VersionStorageInfo`. Consistency checks are also extended to ensure
that table files point to blob files that are part of the `Version`, and that all blob files
that are part of any given `Version` have at least some _non_-garbage data in them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D20656803

Pulled By: ltamasi

fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
2020-03-26 18:51:53 -07:00
Levi Tamasi
6301dbe7a7 Use function objects as deleters in the block cache (#6545)
Summary:
As the first step of reintroducing eviction statistics for the block
cache, the patch switches from using simple function pointers as deleters
to function objects implementing an interface. This will enable using
deleters that have state, like a smart pointer to the statistics object
that is to be updated when an entry is removed from the cache. For now,
the patch adds a deleter template class `SimpleDeleter`, which simply
casts the `value` pointer to its original type and calls `delete` or
`delete[]` on it as appropriate. Note: to prevent object lifecycle
issues, deleters must outlive the cache entries referring to them;
`SimpleDeleter` ensures this by using the ("leaky") Meyers singleton
pattern.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545

Test Plan: `make asan_check`

Reviewed By: siying

Differential Revision: D20475823

Pulled By: ltamasi

fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a
2020-03-26 16:19:58 -07:00
Mike Kolupaev
963af52f15 Fix iterator reading filter block despite read_tier == kBlockCacheTier (#6562)
Summary:
We're seeing iterators with `ReadOptions::read_tier == kBlockCacheTier` sometimes doing file reads. Stack trace:

```
rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*, bool) const
rocksdb::BlockFetcher::ReadBlockContents()
rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const
rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const
rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const
rocksdb::FullFilterBlockReader::RangeMayExist(rocksdb::Slice const*, rocksdb::Slice const&, rocksdb::SliceTransform const*, rocksdb::Comparator const*, rocksdb::Slice const*, bool*, bool, rocksdb::BlockCacheLookupContext*)
rocksdb::BlockBasedTable::PrefixMayMatch(rocksdb::Slice const&, rocksdb::ReadOptions const&, rocksdb::SliceTransform const*, bool, rocksdb::BlockCacheLookupContext*) const
rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, rocksdb::Slice>::SeekImpl(rocksdb::Slice const*)
rocksdb::ForwardIterator::SeekInternal(rocksdb::Slice const&, bool)
rocksdb::DBIter::Seek(rocksdb::Slice const&)
```

`BlockBasedTableIterator::CheckPrefixMayMatch` was missing a check for `kBlockCacheTier`. This PR adds it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6562

Test Plan: deployed it to a logdevice test cluster and looked at logdevice's IO tracing.

Reviewed By: siying

Differential Revision: D20529368

Pulled By: al13n321

fbshipit-source-id: 65bf33964b1951464415c900336635fb20919611
2020-03-26 15:21:26 -07:00
sdong
6fd0ed4993 CompactRange() to use bottom pool when goes to bottommost level (#6593)
Summary:
In automatic compaction, if a compaction is bottommost, it goes to bottom thread pool. We should do the same for manual compaction too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6593

Test Plan: Add a unit test. See all existing tests pass.

Reviewed By: ajkr

Differential Revision: D20637408

fbshipit-source-id: cb03031e8f895085f7acf6d2d65e69e84c9ddef3
2020-03-24 20:24:32 -07:00
Huisheng Liu
a6ce5c823b multiget support for timestamps (#6483)
Summary:
Add timestamp support for MultiGet().
timestamp from readoptions is honored, and timestamps can be returned along with values.

MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 17bef7d3a):
  multireadrandom :     104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
This PR:
  multireadrandom :     104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)

.\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483

Reviewed By: anand1976

Differential Revision: D20498373

Pulled By: riversand963

fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
2020-03-24 11:24:09 -07:00
sdong
921cdd37e2 Fix bug that number of table loading threads is set as a boolean (#6576)
Summary:
When applying a new version in non DB open case, optimize_filters_for_hits is used for max_threads, which is clearly a bug. It is not clear what the indented value in the first place, but it value 1 makes sense here, which would create no extra threads. This bug is not expected to cause user visible problems, assuming C++ implicitly cast bool to 0 or 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6576

Test Plan: Run all exsiting test.

Reviewed By: ajkr

Differential Revision: D20602467

fbshipit-source-id: 40b2cd8619aba09ae9242b36c415464db3c9b737
2020-03-24 10:17:40 -07:00
anand76
a9d168cfd7 Simplify migration to FileSystem API (#6552)
Summary:
The current Env/FileSystem API separation has a couple of issues -
1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.

This PR solves the above issues and simplifies the migration in the following ways -
1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
1a. - This also makes it more robust by ensuring that even if RocksDB
  has some stray calls to Env APIs rather than FileSystem, they will go
  through the same object and thus there is no risk of getting out of
  sync.
2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
PosixEnv with a custom FileSystem implementation. This eliminates an
indirection to call Env APIs, and relieves the FileSystem developer of
the burden of having to implement wrappers for the Env APIs.
3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
```NewLogger()```

Tests:
1. New unit tests
2. make check and make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552

Reviewed By: riversand963

Differential Revision: D20592038

Pulled By: anand1976

fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
2020-03-23 21:54:21 -07:00
Zhichao Cao
d300d10962 Fix the MultiGet testing failure in Circleci (#6578)
Summary:
The MultiGet test in db_basic_test fails in CircleCI vs2019. The reason is that even Snappy compression is enabled, the first compression type is still kNoCompression. This PR checks the list and ensure that only when compression is enable and the compression type is valid, compression will be enabled. Such that, it will not fail the combined read test in MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6578

Test Plan: make check, db_basic_test.

Reviewed By: anand1976

Differential Revision: D20607529

Pulled By: zhichao-cao

fbshipit-source-id: dcead264d5c2da105912c18caad34b8510bb04b0
2020-03-23 18:51:09 -07:00
Yanqin Jin
617f479266 Fix LITE build (#6575)
Summary:
Fix LITE build by excluding some unit tests that use features not supported in LITE.
```
db/db_basic_test.cc:1778:8: error: ‘void rocksdb::{anonymous}::TableFileListener::OnTableFileCreated(const rocksdb::TableFileCreationInfo&)’ marked ‘override’, but does not override
   void OnTableFileCreated(const TableFileCreationInfo& info) override {
        ^~~~~~~~~~~~~~~~~~
make: *** [db/db_basic_test.o] Error 1
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6575

Reviewed By: ltamasi

Differential Revision: D20598598

Pulled By: riversand963

fbshipit-source-id: 367f7cb2500360ad57030b138a94c0f731a04339
2020-03-23 13:05:36 -07:00
Zhichao Cao
5c6346c420 Revert "Added the safe-to-ignore tag to version_edit (#6530)" (#6569)
Summary:
This reverts commit e10553f2a6.

Pass make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6569

Reviewed By: riversand963

Differential Revision: D20574319

Pulled By: zhichao-cao

fbshipit-source-id: ce36981a21596f5f2e14da6a59a2bb3619509a8b
2020-03-23 10:27:47 -07:00
Yanqin Jin
fb09ef05dc Attempt to recover from db with missing table files (#6334)
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.

Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334

Reviewed By: anand1976

Differential Revision: D19778960

Pulled By: riversand963

fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
2020-03-20 19:30:48 -07:00
Cheng Chang
5fd152b7ad Get block size only in direct IO mode (#6522)
Summary:
When `use_direct_reads` and `use_direct_writes` are `false`, `logical_sector_size_` inside various `*File` implementations are not actually used, so `GetLogicalBlockSize` does not necessarily need to be called for `logical_sector_size_`, just set a default page size.

This is a follow up PR for https://github.com/facebook/rocksdb/pull/6457.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6522

Test Plan: make check

Reviewed By: siying

Differential Revision: D20408885

Pulled By: cheng-chang

fbshipit-source-id: f2d3808f41265237e7fa2c0be9f084f8fa97fe3d
2020-03-20 15:26:10 -07:00
Zhichao Cao
e10553f2a6 Added the safe-to-ignore tag to version_edit (#6530)
Summary:
Each time RocksDB switches to a new MANIFEST file from old one, it calls WriteCurrentStateToManifest() which writes a 'snapshot' of the current in-memory state of versions to the beginning of the new manifest as a bunch of version edits. We can distinguish these version edits from other version edits written during normal operations with a custom, safe-to-ignore tag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6530

Test Plan: added test to version_edit_test, pass make asan_check

Reviewed By: riversand963

Differential Revision: D20524516

Pulled By: zhichao-cao

fbshipit-source-id: f1de102f5499bfa88dae3caa2f32c7f42cf904db
2020-03-19 11:30:26 -07:00
Levi Tamasi
442404558a Clean up VersionBuilder a bit (#6556)
Summary:
The whole point of the pimpl idiom is to hide implementation details.
Internal helper methods like `CheckConsistency`, `CheckConsistencyForDeletes`,
and `MaybeAddFile` do not belong in the public interface of the class.
In addition, the patch switches to `unique_ptr` for the implementation
object instead of using a raw `delete`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6556

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D20523568

Pulled By: ltamasi

fbshipit-source-id: 5bbb0ccebd0c47a33b815398c7f9cfe13bd775ac
2020-03-19 10:44:16 -07:00
Yanqin Jin
66ed58083a Reduce runtime of db_with_timestamp_basic_test (#6546)
Summary:
Reduce runtime by reducing test scale to avoid test time-outs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6546

Test Plan:
time ./db_with_timestamp_basic_test
and watch internal tests.

Reviewed By: zhichao-cao

Differential Revision: D20479292

Pulled By: riversand963

fbshipit-source-id: c9e4a155be7699dd4de60fa531de86d442a3ba0a
2020-03-17 10:50:48 -07:00
Cheng Chang
402da454cb Migrate AppVeyor to CircleCI (#6518)
Summary:
CircleCI is the new recommended CI system internally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6518

Test Plan: Watch https://app.circleci.com/pipelines/github/facebook/rocksdb

Differential Revision: D20454743

Pulled By: cheng-chang

fbshipit-source-id: 39031568d6c1d3d25b7fbd78fa9a0e6067ddc47c
2020-03-13 21:58:51 -07:00
Cheng Chang
23eae14d24 Destroy DB at the end of each test in db_logical_block_size_cache_test (#6532)
Summary:
If DB is not deleted, in concurrent test, the tests might fail because of the previously existing DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6532

Test Plan:
make clean && make -j24 LITE=1  db_logical_block_size_cache_test && ./db_logical_block_size_cache_test
make clean && make -j24 db_logical_block_size_cache_test && ./db_logical_block_size_cache_test

Differential Revision: D20454734

Pulled By: cheng-chang

fbshipit-source-id: 8abede2ec1d79c1a4fe1bc95fbda489f8f7ee052
2020-03-13 21:53:38 -07:00
Zhichao Cao
5c30e6c088 Separate timestamp related test from db_basic_test (#6516)
Summary:
In some of the test, db_basic_test may cause time out due to its long running time. Separate the timestamp related test from db_basic_test to avoid the potential issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6516

Test Plan: pass make asan_check

Differential Revision: D20423922

Pulled By: zhichao-cao

fbshipit-source-id: d6306f89a8de55b07bf57233e4554c09ef1fe23a
2020-03-13 11:37:15 -07:00
Cheng Chang
0d2c8e47e8 OpenForReadOnly is not supported in LITE mode (#6523)
Summary:
In DBLogicalBlockSizeCacheTest, do not test OpenForReadOnly in LITE mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6523

Test Plan: watch test for LITE mode

Differential Revision: D20420321

Pulled By: cheng-chang

fbshipit-source-id: e45bf6f2800206d6f8ce9af7308e76a08de80643
2020-03-12 14:13:59 -07:00
Levi Tamasi
c15e85bdcb Move BlobDB related files under db/ to db/blob/ (#6519)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519

Test Plan:
```
make all
make check
```

Differential Revision: D20400691

Pulled By: ltamasi

fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94
2020-03-12 11:00:56 -07:00
Cheng Chang
2d9efc9ab2 Cache result of GetLogicalBufferSize in Linux (#6457)
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.

This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.

Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457

Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.

`make check`

Differential Revision: D20131243

Pulled By: cheng-chang

fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
2020-03-11 18:40:05 -07:00
sdong
331e6199df Include more information in file lock failure (#6507)
Summary:
When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used.

Since type of lockedFiles is changed, rename it to follow naming convention too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507

Test Plan: Add a unit test and improve an existing test to validate the case.

Differential Revision: D20378333

fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a
2020-03-11 16:23:08 -07:00
Levi Tamasi
37a635cfe6 Disambiguate CustomFieldTags for the unity build (#6513)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6513

Test Plan: `make unity_test`

Differential Revision: D20388919

Pulled By: ltamasi

fbshipit-source-id: 88dbceab0723a54ee3939e1644e13dc9a4c70420
2020-03-11 14:45:12 -07:00
Adam Retter
8fc20ac468 Add ppc64le builds to Travis (#6144)
Summary:
Let's see how this goes...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6144

Differential Revision: D20387515

Pulled By: pdillinger

fbshipit-source-id: ba2669348c267141dfddff910b4c2224a22cbb38
2020-03-11 12:33:45 -07:00
Levi Tamasi
f5bc3b99d5 Split BlobFileState into an immutable and a mutable part (#6502)
Summary:
It's never too soon to refactor something. The patch splits the recently
introduced (`VersionEdit` related) `BlobFileState` into two classes
`BlobFileAddition` and `BlobFileGarbage`. The idea is that once blob files
are closed, they are immutable, and the only thing that changes is the
amount of garbage in them. In the new design, `BlobFileAddition` contains
the immutable attributes (currently, the count and total size of all blobs, checksum
method, and checksum value), while `BlobFileGarbage` contains the mutable
GC-related information elements (count and total size of garbage blobs). This is a
better fit for the GC logic and is more consistent with how SST files are handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6502

Test Plan: `make check`

Differential Revision: D20348352

Pulled By: ltamasi

fbshipit-source-id: ff93f0121e80ab15e0e0a6525ba0d6af16a0e008
2020-03-10 17:27:26 -07:00
Yanqin Jin
fd1da22111 Support options.max_open_files != -1 with FIFO compaction (#6503)
Summary:
Allow user to specify options.max_open_files != -1 with FIFO compaction.
If max_open_files != -1, not all table files are kept open.
In the past, FIFO style compaction requires all table files to be open in order
to read file creation time from table properties. Later, we added file creation
time to MANIFEST, making it possible to read file creation time without opening
file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503

Test Plan: make check

Differential Revision: D20353758

Pulled By: riversand963

fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296
2020-03-09 18:45:06 -07:00
Yanqin Jin
d93812c9ae Iterator with timestamp (#6255)
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.

Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```

Test plan (dev server)
```
$make check
```

Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255

Differential Revision: D19438227

Pulled By: riversand963

fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
2020-03-06 16:24:27 -08:00
Yuqi Gu
e171a219d5 Fix db_wal_test::TruncateLastLogAfterRecoverWithoutFlush failure (#6437)
Summary:
`TruncateLastLogAfterRecoverWithoutFlush` case depends on fallocate support
of underlying file system.

On a file system which lacks of this feature, like zfs, it will fail to allocate predefined file size as this test case intends to do;

So a check block is added to detect fallocate support and skip test if not.
The related work is done by JunHe77. Thanks!

Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6437

Differential Revision: D20145032

Pulled By: pdillinger

fbshipit-source-id: c8b691dc508e95acfa2a004ddbc07e2faa76680d
2020-03-05 17:18:16 -08:00
Cheng Chang
afb97094ae Skip high levels with no key falling in the range in CompactRange (#6482)
Summary:
In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped.
This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482

Test Plan:
A new test ManualCompactionTest::SkipLevel is added.

Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index.

Also updated db_compaction_test and db_test to use correct range for full compaction.

Differential Revision: D20251149

Pulled By: cheng-chang

fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b
2020-03-04 20:15:25 -08:00
Zhichao Cao
e62fe50634 Introduce FaultInjectionTestFS to test fault File system instead of Env (#6414)
Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414

Test Plan: pass make asan_check, pass error_handler_fs_test.

Differential Revision: D20252421

Pulled By: zhichao-cao

fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
2020-03-04 12:35:05 -08:00