Commit Graph

3717 Commits

Author SHA1 Message Date
anand76
4f7ba3aaed Fix tsan and valgrind failures in import_column_family_test
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5598

Test Plan:
tsan_check
valgrind_test

Differential Revision: D16380167

Pulled By: anand1976

fbshipit-source-id: 2d0caea7d2d02a9606457f62811175d762b89d5c
2019-07-19 13:25:36 -07:00
Eli Pozniansky
c129c75fb7 Added log_readahead_size option to control prefetching for Log::Reader (#5592)
Summary:
Added log_readahead_size option to control prefetching for Log::Reader.
This is mostly useful for reading a remotely located log, as it can save the number of round-trips when reading it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5592

Differential Revision: D16362989

Pulled By: elipoz

fbshipit-source-id: c5d4d5245a44008cd59879640efff70c091ad3e8
2019-07-19 12:00:19 -07:00
anand76
abd1fdddef Fix asan_check failures
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5589

Test Plan: TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_ASAN=1 OPT=-g make J=64 -j64 asan_check

Differential Revision: D16361081

Pulled By: anand1976

fbshipit-source-id: 09474832b9cfb318a840d4b633e22dfad105d58c
2019-07-18 14:51:25 -07:00
anand76
ec2b996b29 Fix LITE mode build failure
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5588

Test Plan: make LITE=1 all check

Differential Revision: D16354543

Pulled By: anand1976

fbshipit-source-id: 327a171439e183ac3a5e5057c511d6bca445e97d
2019-07-17 22:06:12 -07:00
Venki Pallipadi
22ce462450 Export Import sst files (#5495)
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135

This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
//   is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                  const std::string& export_dir,
                                  ExportImportFilesMetaData** metadata);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
//     an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
    const ColumnFamilyOptions& options, const std::string& column_family_name,
    const ImportColumnFamilyOptions& import_options,
    const ExportImportFilesMetaData& metadata,
    ColumnFamilyHandle** handle);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495

Differential Revision: D16018881

fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
2019-07-17 12:27:14 -07:00
sdong
699a569c52 Remove RandomAccessFileReader.for_compaction_ (#5572)
Summary:
RandomAccessFileReader.for_compaction_ doesn't seem to be used anymore. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5572

Test Plan: USE_CLANG=1 make all check -j

Differential Revision: D16286178

fbshipit-source-id: aa338049761033dfbe5e8b1707bbb0be2df5be7e
2019-07-16 16:32:18 -07:00
Levi Tamasi
3bde41b5a3 Move the filter readers out of the block cache (#5504)
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.

Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504

Test Plan: make asan_check

Differential Revision: D16036974

Pulled By: ltamasi

fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
2019-07-16 13:14:58 -07:00
Jim Lin
cd2520361d Fix memorty leak in rocksdb_wal_iter_get_batch function (#5515)
Summary:
`wal_batch.writeBatchPtr.release()` gives up the ownership of the original `WriteBatch`, but there is no new owner, which causes memory leak.

The patch is simple. Removing `release()` prevent ownership change. `std::move` is for speed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5515

Differential Revision: D16264281

Pulled By: riversand963

fbshipit-source-id: 51c556b7a1c977325c3aa24acb636303847151fa
2019-07-15 12:59:39 -07:00
Zhongyi Xie
b0259e45e0 add more tracing for stats history (#5566)
Summary:
Sample info log output from db_bench:
In-memory:
```
2019/07/12-21:39:19.478490 7fa01b3f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
2019/07/12-21:39:19.478633 7fa01b3f5700 [_impl/db_impl.cc:753] Storing 145 stats with timestamp 1562992759 to in-memory stats history
2019/07/12-21:39:19.478670 7fa01b3f5700 [_impl/db_impl.cc:766] [Pre-GC] In-memory stats history size: 1051218 bytes, slice count: 103
2019/07/12-21:39:19.478704 7fa01b3f5700 [_impl/db_impl.cc:775] [Post-GC] In-memory stats history size: 1051218 bytes, slice count: 102
```
On-disk:
```
2019/07/12-21:48:53.862548 7f24943f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
2019/07/12-21:48:53.862553 7f24943f5700 [_impl/db_impl.cc:709] Reading 145 stats from statistics
2019/07/12-21:48:53.862852 7f24943f5700 [_impl/db_impl.cc:737] Writing 145 stats with timestamp 1562993333 to persistent stats CF succeeded
```
```
2019/07/12-21:48:51.861711 7f24943f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS -------
2019/07/12-21:48:51.861729 7f24943f5700 [_impl/db_impl.cc:709] Reading 145 stats from statistics
2019/07/12-21:48:51.861921 7f24943f5700 [_impl/db_impl.cc:732] Writing to persistent stats CF failed -- Result incomplete: Write stall
...
2019/07/12-21:48:51.873032 7f2494bf6700 [WARN] [lumn_family.cc:749] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5566

Differential Revision: D16258187

Pulled By: miasantreble

fbshipit-source-id: 292497099b941418590ed4312411bee36e244dc5
2019-07-15 11:49:17 -07:00
Sergei Petrunia
61876614dc Fix MyRocks compile warnings-treated-as-errors on Fedora 30, gcc 9.1.1 (#5553)
Summary:
- Provide assignment operator in CompactionStats
- Provide a copy constructor for FileDescriptor
- Remove std::move from "return std::move(t)" in BoundedQueue
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5553

Differential Revision: D16230170

fbshipit-source-id: fd7c6e52390b2db1be24141e25649cf62424d078
2019-07-12 17:30:51 -07:00
sdong
cb19e7411f Fix bugs in DBWALTest.kTolerateCorruptedTailRecords triggered by #5520 (#5550)
Summary:
https://github.com/facebook/rocksdb/pull/5520 caused a buffer overflow bug in DBWALTest.kTolerateCorruptedTailRecords. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5550

Test Plan: Run the test in UBSAN. It used to fail. Not it succeeds.

Differential Revision: D16165516

fbshipit-source-id: 42c56a6bc64eb091f054b87757fcbef60da825f7
2019-07-09 11:18:32 -07:00
Yanqin Jin
7c76a7fba2 Support GetAllKeyVersions() for non-default cf (#5544)
Summary:
Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families.

Test plan (devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test
$./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions
```
All other unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544

Differential Revision: D16147551

Pulled By: riversand963

fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168
2019-07-07 22:43:52 -07:00
Zhongyi Xie
8d34806972 setup wal_in_db_path_ for secondary instance (#5545)
Summary:
PR https://github.com/facebook/rocksdb/pull/5520 adds DBImpl:: wal_in_db_path_ and initializes it in DBImpl::Open, this PR fixes the valgrind error for secondary instance:
```
==236417== Conditional jump or move depends on uninitialised value(s)
==236417==    at 0x62242A: rocksdb::DeleteDBFile(rocksdb::ImmutableDBOptions const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool) (file_util.cc:96)
==236417==    by 0x512432: rocksdb::DBImpl::DeleteObsoleteFileImpl(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileType, unsigned long) (db_impl_files.cc:261)
==236417==    by 0x515A7A: rocksdb::DBImpl::PurgeObsoleteFiles(rocksdb::JobContext&, bool) (db_impl_files.cc:492)
==236417==    by 0x499153: rocksdb::ColumnFamilyHandleImpl::~ColumnFamilyHandleImpl() (column_family.cc:75)
==236417==    by 0x499880: rocksdb::ColumnFamilyHandleImpl::~ColumnFamilyHandleImpl() (column_family.cc:84)
==236417==    by 0x4C9AF9: rocksdb::DB::DestroyColumnFamilyHandle(rocksdb::ColumnFamilyHandle*) (db_impl.cc:3105)
==236417==    by 0x44E853: CloseSecondary (db_secondary_test.cc:53)
==236417==    by 0x44E853: rocksdb::DBSecondaryTest::~DBSecondaryTest() (db_secondary_test.cc:31)
==236417==    by 0x44EC77: ~DBSecondaryTest_PrimaryDropColumnFamily_Test (db_secondary_test.cc:443)
==236417==    by 0x44EC77: rocksdb::DBSecondaryTest_PrimaryDropColumnFamily_Test::~DBSecondaryTest_PrimaryDropColumnFamily_Test() (db_secondary_test.cc:443)
==236417==    by 0x83D1D7: HandleSehExceptionsInMethodIfSupported<testing::Test, void> (gtest-all.cc:3824)
==236417==    by 0x83D1D7: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3860)
==236417==    by 0x8346DB: testing::TestInfo::Run() [clone .part.486] (gtest-all.cc:4078)
==236417==    by 0x8348D4: Run (gtest-all.cc:4047)
==236417==    by 0x8348D4: testing::TestCase::Run() [clone .part.487] (gtest-all.cc:4190)
==236417==    by 0x834D14: Run (gtest-all.cc:6100)
==236417==    by 0x834D14: testing::internal::UnitTestImpl::RunAllTests() (gtest-all.cc:6062)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5545

Differential Revision: D16146224

Pulled By: miasantreble

fbshipit-source-id: 184c90e451352951da4e955f054d4b1a1f29ea29
2019-07-07 21:32:50 -07:00
anand76
e0d9d57750 Fix bugs in WAL trash file handling (#5520)
Summary:
1. Cleanup WAL trash files on open
2. Don't apply deletion rate limit if WAL dir is different from db dir
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5520

Test Plan: Add new unit tests and make check

Differential Revision: D16096750

Pulled By: anand1976

fbshipit-source-id: 6f07858ad864b754b711db416f0389c45ede599b
2019-07-06 21:07:32 -07:00
Yi Wu
4f66ec977d Fix lower bound check error when iterate across file boundary (#5540)
Summary:
Since https://github.com/facebook/rocksdb/issues/5468 `LevelIterator` compare lower bound and file smallest key on `NewFileIterator` and cache the result to reduce per key lower bound check. However when iterate across file boundary, it doesn't update the cached result since `Valid()=false` because `Valid()` still reflect the status of the previous file iterator. Fixing it by remove the `Valid()` check from `CheckMayBeOutOfLowerBound()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5540

Test Plan:
See the new test.

Signed-off-by: Yi Wu <yiwu@pingcap.com>

Differential Revision: D16127653

fbshipit-source-id: a0691e1164658d485c17971aaa97028812f74678
2019-07-04 17:28:30 -07:00
haoyuhuang
6edc5d0719 Block cache tracing: Associate a unique id with Get and MultiGet (#5514)
Summary:
This PR associates a unique id with Get and MultiGet. This enables us to track how many blocks a Get/MultiGet request accesses. We can also measure the impact of row cache vs block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5514

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16032681

Pulled By: HaoyuHuang

fbshipit-source-id: 775b05f4440badd58de6667e3ec9f4fc87a0af4c
2019-07-03 19:35:41 -07:00
Andrew Kryczka
0d57d93a06 Support jemalloc compiled with --with-jemalloc-prefix (#5521)
Summary:
Previously, if the jemalloc was built with nonempty string for
`--with-jemalloc-prefix`, then `HasJemalloc()` would return false on
Linux, so jemalloc would not be used at runtime. On Mac, it would cause
a linker failure due to no definitions found for the weak functions
declared in "port/jemalloc_helper.h". This should be a rare problem
because (1) on Linux the default `--with-jemalloc-prefix` value is the
empty string, and (2) Homebrew's build explicitly sets
`--with-jemalloc-prefix` to the empty string.

However, there are cases where `--with-jemalloc-prefix` is nonempty.
For example, when building jemalloc from source on Mac, the default
setting is `--with-jemalloc-prefix=je_`. Such jemalloc builds should be
usable by RocksDB.

The fix is simple. Defining `JEMALLOC_MANGLE` before including
"jemalloc.h" causes it to define unprefixed symbols that are aliases for
each of the prefixed symbols. Thanks to benesch for figuring this out
and explaining it to me.

Fixes https://github.com/facebook/rocksdb/issues/1462.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5521

Test Plan:
build jemalloc with prefixed symbols:

```
$ ./configure --with-jemalloc-prefix=lol
$ make
```

compile rocksdb against it:

```
$ WITH_JEMALLOC_FLAG=1 JEMALLOC=1 EXTRA_LDFLAGS="-L/home/andrew/jemalloc/lib/" EXTRA_CXXFLAGS="-I/home/andrew/jemalloc/include/" make -j12 ./db_bench
```

run db_bench and verify jemalloc actually used:

```
$ ./db_bench -benchmarks=fillrandom -statistics=true -dump_malloc_stats=true -stats_dump_period_sec=1
$ grep jemalloc /tmp/rocksdbtest-1000/dbbench/LOG
2019/06/29-12:20:52.088658 7fc5fb7f6700 [_impl/db_impl.cc:837] ___ Begin jemalloc statistics ___
...
```

Differential Revision: D16092758

fbshipit-source-id: c2c358346190ed62ceb2a3547a6c4c180b12f7c4
2019-07-02 12:07:01 -07:00
Yi Wu
662ce62044 Reduce iterator key comparison for upper/lower bound check (2nd attempt) (#5468)
Summary:
This is a second attempt for https://github.com/facebook/rocksdb/issues/5111, with the fix to redo iterate bounds check after `SeekXXX()`. This is because MyRocks may change iterate bounds between seek.

See https://github.com/facebook/rocksdb/issues/5111 for original benchmark result and discussion.

Closes https://github.com/facebook/rocksdb/issues/5463.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5468

Test Plan: Existing rocksdb tests, plus myrocks test `rocksdb.optimizer_loose_index_scans` and `rocksdb.group_min_max`.

Differential Revision: D15863332

fbshipit-source-id: ab4aba5899838591806b8673899bd465f3f53e18
2019-07-02 11:48:46 -07:00
haoyuhuang
66464d1fde Remove multiple declarations o kMicrosInSecond.
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5526

Test Plan:
OPT=-g V=1 make J=1 unity_test -j32
make clean && make -j32

Differential Revision: D16079315

Pulled By: HaoyuHuang

fbshipit-source-id: 294ab439cf0db8dd5da44e30eabf0cbb2bb8c4f6
2019-07-01 15:15:12 -07:00
Yanqin Jin
1e87f2b68b Ref and unref cfd before and after calling WaitForFlushMemTables (#5513)
Summary:
This is to prevent bg flush thread from unrefing and deleting the cfd that has been dropped by a concurrent thread.
Before RocksDB calls `DBImpl::WaitForFlushMemTables`, we should increase the refcount of each `ColumnFamilyData` so that its ref count will not drop to 0 even if the column family is dropped by another thread. Otherwise the bg flush thread can deref the cfd and deletes it, causing a segfault in `WaitForFlushMemtables` upon accessing `cfd`.

Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32
$make check
```
All unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5513

Differential Revision: D16062898

Pulled By: riversand963

fbshipit-source-id: 37dc511f1dc99f036d0201bbd7f0a8f5677c763d
2019-07-01 14:12:02 -07:00
Zhongyi Xie
3886dddc3b force flushing stats CF to avoid holding old logs (#5509)
Summary:
WAL records RocksDB writes to all column families. When user flushes a a column family, the old WAL will not accept new writes but cannot be deleted yet because it may still contain live data for other column families. (See https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log#life-cycle-of-a-wal for detailed explanation)
Because of this, if there is a column family that receive very infrequent writes and no manual flush is called for it, it could prevent a lot of WALs from being deleted. PR https://github.com/facebook/rocksdb/pull/5046 introduced persistent stats column family which is a good example of such column families. Depending on the config, it may have long intervals between writes, and user is unaware of it which makes it difficult to call manual flush for it.
This PR addresses the problem for persistent stats column family by forcing a flush for persistent stats column family when 1) another column family is flushed 2) persistent stats column family's log number is the smallest among all column families, this way persistent stats column family will  keep advancing its log number when necessary, allowing RocksDB to delete old WAL files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5509

Differential Revision: D16045896

Pulled By: miasantreble

fbshipit-source-id: 286837b633e988417f0096ff38384742d3b40ef4
2019-07-01 11:56:43 -07:00
anand76
7259e28d91 MultiGet parallel IO (#5464)
Summary:
Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464

Test Plan:
1. make check
2. make asan_check
3. make asan_crash

Differential Revision: D15911771

Pulled By: anand1976

fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394
2019-06-30 20:56:04 -07:00
Yanqin Jin
c08c0ae731 Add C binding for secondary instance (#5505)
Summary:
Add C binding for secondary instance as well as unit test.

Test plan (on devserver)
```
$make clean && COMPILE_WITH_ASAN=1 make -j20 all
$./c_test
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5505

Differential Revision: D16000043

Pulled By: riversand963

fbshipit-source-id: 3361ef6bfdf4ce12438cee7290a0ac203b5250bd
2019-06-27 08:58:54 -07:00
Mike Kolupaev
b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00
Yi Wu
2730fe693e Fix ingested file and direcotry not being sync (#5435)
Summary:
It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest.

Also syncing after writing global sequence when write_global_seqno=true was removed in https://github.com/facebook/rocksdb/issues/4172. Adding it back.

Fixes https://github.com/facebook/rocksdb/issues/5287.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5435

Test Plan:
Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false.
https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9

More test suggestions are welcome.

Differential Revision: D15941675

Pulled By: riversand963

fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
2019-06-21 10:15:38 -07:00
haoyuhuang
705b8eecb4 Add more callers for table reader. (#5454)
Summary:
This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.

This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.

Differential Revision: D15819451

Pulled By: HaoyuHuang

fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
2019-06-20 14:31:48 -07:00
Zhongyi Xie
24f73436fb sanitize and limit block_size under 4GB (#5492)
Summary:
`Block::restart_index_`, `Block::restarts_`, and `Block::current_` are defined as uint32_t but  `BlockBasedTableOptions::block_size` is defined as a size_t so user might see corruption as in https://github.com/facebook/rocksdb/issues/5486.
This PR adds a check in `BlockBasedTableFactory::SanitizeOptions` to disallow such configurations.
yiwu-arbug
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5492

Differential Revision: D15914047

Pulled By: miasantreble

fbshipit-source-id: c943f153d967e15aee7f2795730ab8259e2be201
2019-06-20 11:45:08 -07:00
Vijay Nadimpalli
24b118ad98 Combine the read-ahead logic for user reads and compaction reads (#5431)
Summary:
Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change  `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431

Test Plan:
make check

Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5

Differential Revision: D15772533

Pulled By: vjnadimpalli

fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9
2019-06-19 14:10:46 -07:00
Simon Grätzer
fe90ed7a70 Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474)
Summary:
When tailing the WAL with TransactionLogIterator, it used to return Corruption status to indicate that the WAL has new tail that is not visible to the iterator, which is a misleading status. The patch replaces it with TryAgain which is more descriptive of a status, indicating that the user needs to create a new iterator to fetch the recent tail.
Fixes https://github.com/facebook/rocksdb/issues/5455
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5474

Differential Revision: D15898953

Pulled By: maysamyabandeh

fbshipit-source-id: 40966f6457cb539e1aeb104daeada6b0e46059fc
2019-06-19 08:10:08 -07:00
Yanqin Jin
f287f8dc93 Fix a bug caused by secondary not skipping the beginning of new MANIFEST (#5472)
Summary:
While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows.
The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files **again** to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them.

Test plan (on dev server)
```
$make clean && make -j32 all
$./db_secondary_test
```
All existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472

Differential Revision: D15866771

Pulled By: riversand963

fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed
2019-06-18 11:21:37 -07:00
Zhongyi Xie
ddd088c8b9 fix rocksdb lite and clang contrun test failures (#5477)
Summary:
recent commit 671d15cbdd introduced some test failures:
```
===== Running stats_history_test
[==========] Running 9 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 9 tests from StatsHistoryTest
[ RUN      ] StatsHistoryTest.RunStatsDumpPeriodSec
monitoring/stats_history_test.cc:63: Failure
dbfull()->SetDBOptions({{"stats_dump_period_sec", "0"}})
Not implemented: Not supported in ROCKSDB LITE

db/db_options_test.cc:28:11: error: unused variable 'kMicrosInSec' [-Werror,-Wunused-const-variable]
const int kMicrosInSec = 1000000;
```
This PR fixes these failures
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5477

Differential Revision: D15871814

Pulled By: miasantreble

fbshipit-source-id: 0a7023914d2c1784d9d2d3f5bfb47310d4855394
2019-06-17 21:16:29 -07:00
Yanqin Jin
7d8d56413d Override check consistency for DBImplSecondary (#5469)
Summary:
`DBImplSecondary` calls `CheckConsistency()` during open. In the past, `DBImplSecondary` did not override this function thus `DBImpl::CheckConsistency()` is called.
The following can happen. The secondary instance is performing consistency check which calls `GetFileSize(file_path)` but the file at `file_path` is deleted by the primary instance. `DBImpl::CheckConsistency` does not account for this and fails the consistency check. This is undesirable. The solution is that, we call `DBImpl::CheckConsistency()` first. If it passes, then we are good. If not, we give it a second chance and handles the case of file(s) being deleted.

Test plan (on dev server):
```
$make clean && make -j20 all
$./db_secondary_test
```
All other existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5469

Differential Revision: D15861845

Pulled By: riversand963

fbshipit-source-id: 507d72392508caed3cd003bb2e2aa43f993dd597
2019-06-17 15:39:55 -07:00
Zhongyi Xie
671d15cbdd Persistent Stats: persist stats history to disk (#5046)
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and  both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046

Differential Revision: D15863138

Pulled By: miasantreble

fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
2019-06-17 15:21:50 -07:00
Sagar Vemuri
f1219644ec Validate CF Options when creating a new column family (#5453)
Summary:
It seems like CF Options are not properly validated  when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations,  will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`).

**Test Plan:**
Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions`
```
TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions
```
Also ran gtest-parallel to make sure the new test is not flaky.
```
TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000
[10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453

Differential Revision: D15816851

Pulled By: sagar0

fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5
2019-06-14 14:11:10 -07:00
haoyuhuang
bb4178066d Integrate block cache tracer into db_impl (#5433)
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433

Differential Revision: D15728016

Pulled By: HaoyuHuang

fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
2019-06-13 15:43:10 -07:00
Levi Tamasi
a3b8c76d8e Add missing check before calling PurgeObsoleteFiles in EnableFileDeletions (#5448)
Summary:
Calling PurgeObsoleteFiles with a JobContext for which HaveSomethingToDelete
is false is a precondition violation. This would trigger an assertion in debug builds;
however, in release builds with assertions disabled, this can result in the
pending_purge_obsolete_files_ counter in DBImpl underflowing, which in turn can lead
to the process hanging during database close.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5448

Differential Revision: D15792569

Pulled By: ltamasi

fbshipit-source-id: 82d92c9b4f6a9efcdc69dbb3d5a52a1ae2dd2472
2019-06-13 14:43:13 -07:00
Maysam Yabandeh
f43edff9ac Disable kPipelinedWrite in MultiThreaded (#5442)
Summary:
TSAN tests report a race condition. We temporarily exclude kPipelinedWrite from MultiThreaded until the race condition is fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5442

Differential Revision: D15782349

Pulled By: maysamyabandeh

fbshipit-source-id: 42b4f9b3fa9137f0675e13ad132c0a06800c1bdd
2019-06-12 10:37:40 -07:00
Levi Tamasi
ba64a4cf52 Revert "Reduce iterator key comparison for upper/lower bound check (#5111)" (#5440)
Summary:
This reverts commit f3a7847598.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5440

Differential Revision: D15765967

Pulled By: ltamasi

fbshipit-source-id: d027fe24132e3729289cd7c01857a7eb449d9dd0
2019-06-11 16:23:41 -07:00
Yanqin Jin
7177dc46a1 Handle missing WAL in secondary mode (#5323)
Summary:
In secondary mode, it is possible that the secondary lists the primary's WAL
directory, finds a WAL and tries to open it. It is possible that the primary
deletes the WAL after secondary listing dir but before the secondary opening
it. Then the secondary will fail to open the WAL file with a PathNotFound
status. In this case, we can return OK without replaying WAL and optionally
replay more MANIFEST.

Test Plan (on my dev machine):
Without this PR, the following will fail several times out of 100 runs.
```
~/gtest-parallel/gtest-parallel -r 100 -w 16 ./db_secondary_test --gtest_filter=DBSecondaryTest.SwitchToNewManifestDuringOpen
```
With this PR, the above should always succeed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5323

Differential Revision: D15763878

Pulled By: riversand963

fbshipit-source-id: c7164fa7cb8d9001abc258b6a2dc93613e4f38ff
2019-06-11 13:08:28 -07:00
sdong
58c4aee42e TransactionUtil::CheckKey() to skip unnecessary history (#4941)
Summary:
If a memtable definitely covers a key, there isn't a need to check older memtables.
We can skip them by checking the earliest sequence number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941

Differential Revision: D13932666

fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b
2019-06-11 11:46:42 -07:00
Levi Tamasi
a94aef6596 Fix DBTest.DynamicMiscOptions so it passes even with Snappy disabled (#5438)
Summary:
This affects our "no compression" automated tests. Since PR #5368, DBTest.DynamicMiscOptions has been failing with:

db/db_test.cc:4889: Failure
dbfull()->SetOptions({{"compression", "kSnappyCompression"}})
Invalid argument: Compression type Snappy is not linked with the binary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5438

Differential Revision: D15752100

Pulled By: ltamasi

fbshipit-source-id: 3f19eff7cafc03b333965be0203c5853d2a9cb71
2019-06-10 18:47:58 -07:00
Maysam Yabandeh
c8c1a549f0 Avoid deadlock between mutex_ and log_write_mutex_ (#5437)
Summary:
To avoid deadlock mutex_ should never be acquired before log_write_mutex_. The patch documents that and also fixes one case in ::FlushWAL that acquires mutex_ through ::WriteStatusCheck when it already holds lock on log_write_mutex_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5437

Differential Revision: D15749722

Pulled By: maysamyabandeh

fbshipit-source-id: f57b69c44b4b80cc6d7ddf3d3fdf4a9eb5a5a45a
2019-06-10 17:06:50 -07:00
Maysam Yabandeh
b2584577fa Remove global locks from FlushScheduler (#5372)
Summary:
FlushScheduler's methods are instrumented with debug-time locks to check the scheduler state against a simple container definition. Since https://github.com/facebook/rocksdb/pull/2286 the scope of such locks are widened to the entire methods' body. The result is that the concurrency tested during testing (in debug mode) is stricter than the concurrency level manifested at runtime (in release mode).
The patch reverts this change to reduce the scope of such locks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5372

Differential Revision: D15545831

Pulled By: maysamyabandeh

fbshipit-source-id: 01d69191afb1dd807d4bdc990fc74813ae7b5426
2019-06-10 16:50:26 -07:00
Yanqin Jin
641cc8d541 Use CreateLoggerFromOptions function (#5427)
Summary:
Use `CreateLoggerFromOptions` function to reduce code duplication.

Test plan (on my machine)
```
$make clean && make -j32 db_secondary_test
$KEEP_DB=1 ./db_secondary_test
```
Verify all info logs of the secondary instance are properly logged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5427

Differential Revision: D15748922

Pulled By: riversand963

fbshipit-source-id: bad7261df1b8373efc504f141efc7871e375a311
2019-06-10 16:00:30 -07:00
haoyuhuang
5efa0d6b0d Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421)
Summary:
BlockCacheLookupContext only contains the caller for now.
We will trace block accesses at five places:
1. BlockBasedTable::GetFilter.
2. BlockBasedTable::GetUncompressedDict.
3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)

We create the context at:
1. BlockBasedTable::Get. (kUserGet)
2. BlockBasedTable::MultiGet. (kUserMGet)
3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
4. BlockBasedTable::Open. (kPrefetch)
5. Index/Filter::CacheDependencies. (kPrefetch)
6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).

I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
Throughput of this PR: 231334 ops/s.
Throughput of the master branch: 238428 ops/s.

Experiment setup:
RocksDB:    version 6.2
Date:       Mon Jun 10 10:42:51 2019
CPU:        24 * Intel Core Processor (Skylake)
CPUCache:   16384 KB
Keys:       20 bytes each
Values:     100 bytes each (100 bytes after compression)
Entries:    1000000
Prefix:    20 bytes
Keys per prefix:    0
RawSize:    114.4 MB (estimated)
FileSize:   114.4 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1

Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000

Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120

TODOs:
1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
2. Integrate tracer to trace block cache accesses.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421

Differential Revision: D15704258

Pulled By: HaoyuHuang

fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
2019-06-10 15:33:27 -07:00
Yanqin Jin
6ce5580882 Improve memtable earliest seqno assignment for secondary instance (#5413)
Summary:
In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay.
When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families.
Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables.
1. Primary triggers a flush and flushes "cf1". "default" is **not** flushed.
2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST.
3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413

Differential Revision: D15637407

Pulled By: riversand963

fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580
2019-06-10 12:58:14 -07:00
Maysam Yabandeh
c292dc8540 WritePrepared: reduce prepared_mutex_ overhead (#5420)
Summary:
The patch reduces the contention over prepared_mutex_ using these techniques:
1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420

Differential Revision: D15741985

Pulled By: maysamyabandeh

fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
2019-06-10 11:53:31 -07:00
anand76
b703a56e5c Potential fix for stress test failure due to "SST file ahead of WAL" error (#5412)
Summary:
I'm not able to prove it, but the stress test failure may be caused by the following sequence of events -

1. Crash db_stress while writing the log file. This should result in a corrupted WAL.
2. Run db_stress with recycle_log_file_num=1. Crash during recovery immediately after writing manifest and updating the current file. The old log from the previous run is left behind, but the memtable would have been flushed during recovery and the CF log number will point to the newer log
3. Run db_stress with recycle_log_file_num=0. During recovery, the old log file will be processed and the corruption will be detected. Since the CF has moved ahead, we get the "SST file is ahead of WAL" error

Test -
1. stress_crash
2. make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5412

Differential Revision: D15699120

Pulled By: anand1976

fbshipit-source-id: 9092ce81e7c4a0b4b4e66560c23ea4812a4d9cbe
2019-06-07 15:35:47 -07:00
Levi Tamasi
0f48e56f96 Revert to checking the upper bound on a per-key basis in BlockBasedTableIterator (#5428)
Summary:
PR #5111 reduced the number of key comparisons when iterating with
upper/lower bounds; however, this caused a regression for MyRocks.
Reverting to the previous behavior in BlockBasedTableIterator as a hotfix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5428

Differential Revision: D15721038

Pulled By: ltamasi

fbshipit-source-id: 5450106442f1763bccd17f6cfd648697f2ae8b6c
2019-06-07 15:17:05 -07:00
Zhongyi Xie
d68f9f4580 simplify include directive involving inttypes (#5402)
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402

Differential Revision: D15701195

Pulled By: miasantreble

fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
2019-06-06 13:56:07 -07:00
Yanqin Jin
340ed4fac7 Add support for timestamp in Get/Put (#5079)
Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).

Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.

We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.

Results are as follows:
```
|        | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079

Differential Revision: D15132946

Pulled By: riversand963

fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
2019-06-05 23:10:47 -07:00
haoyuhuang
227b5d52df Make RocksDB secondary instance respect atomic groups in version edits. (#5411)
Summary:
With this commit, RocksDB secondary instance respects atomic groups in version edits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5411

Differential Revision: D15617512

Pulled By: HaoyuHuang

fbshipit-source-id: 913f4ede391d772dcaf5649e3cd2099fa292d120
2019-06-04 10:56:19 -07:00
Andrew Kryczka
ebe89ef9d8 Fix merging range tombstone covering put during flush/compaction (#5406)
Summary:
Flush/compaction use `MergeUntil` which has a special code path to
handle a merge ending with a non-`Merge` point key. In particular if
that key is a `Put` we forgot to check whether it is covered by a range
tombstone. If it is covered then we must not include it in the following call
to `TimedFullMerge`.

Fixes #5392.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5406

Differential Revision: D15611144

Pulled By: sagar0

fbshipit-source-id: ba6a7863ca2d043f591de78fd0c4f4561f0c500e
2019-06-04 10:24:14 -07:00
anand76
5d6e8df1cf Ignore shutdown error during compaction (#5400)
Summary:
The PR #5275 separated the column dropped and shutdown status codes. However, there were a couple of places in compaction where this change ended up treating a ShutdownInProgress() error as a real error and set bg_error. This caused MyRocks unit test to fail due to WAL writes during shutdown returning this error. Fix it by ignoring the shutdown status during compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5400

Differential Revision: D15611680

Pulled By: anand1976

fbshipit-source-id: c602e97840e3ae24eb420d61e0ce95d3e6258632
2019-06-03 22:40:43 -07:00
Maysam Yabandeh
ae05a83e19 Call ValidateOptions from SetOptions (#5368)
Summary:
Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368

Differential Revision: D15540101

Pulled By: maysamyabandeh

fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4
2019-06-03 19:49:57 -07:00
Siying Dong
5851cb7fdb Move util/trace_replay.* to trace_replay/ (#5376)
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376

Differential Revision: D15550938

Pulled By: siying

fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
2019-06-03 13:25:26 -07:00
Siying Dong
000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00
Vijay Nadimpalli
cae22c53fb Make format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5395

Differential Revision: D15581698

Pulled By: vjnadimpalli

fbshipit-source-id: f415972f16e784b1361714c202b97defcab46767
2019-05-31 15:24:43 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Zhongyi Xie
ab8f6c01a6 move LevelCompactionPicker to a separate file (#5369)
Summary:
In order to improve code readability, this PR moves LevelCompactionBuilder and LevelCompactionPicker to compaction_picker_level.h and .cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5369

Differential Revision: D15540172

Pulled By: miasantreble

fbshipit-source-id: c1a578b93f127cd63661b53f32b356e6edd349af
2019-05-30 21:38:24 -07:00
Sagar Vemuri
ff9d286877 Reorder DBImpl's private section (#5385)
Summary:
The methods and fields in the private section of DBImpl were all intermingled, making it hard to figure out where the fields/methods start and where they end. I cleaned up the code a little so that all the type declaration are at the beginning, followed by methods, and all the data fields are at the end. This follows
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5385

Differential Revision: D15566978

Pulled By: sagar0

fbshipit-source-id: 4618a7d819ad4e2d7cc9ae1af2c59f400140bb1b
2019-05-30 21:32:46 -07:00
Yanqin Jin
b9f5900658 Fix WAL replay by skipping old write batches (#5170)
Summary:
1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables.
2. Add support for benchmarking secondary instance to db_bench_tool.
With changes made in this PR, we can start benchmarking secondary instance
using two processes. It is also possible to vary the frequency at which the
secondary instance tries to catch up with the primary. The info log of the
secondary can be found in a directory whose path can be specified with
'-secondary_path'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170

Differential Revision: D15564608

Pulled By: riversand963

fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8
2019-05-30 19:33:33 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Yanqin Jin
f1302ebab8 Add class-level comments to version-related classes (#5348)
Summary:
As title.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5348

Differential Revision: D15564595

Pulled By: riversand963

fbshipit-source-id: dd45aa86a70e0343c2e9ef702fad165163f548e6
2019-05-30 16:18:33 -07:00
Sagar Vemuri
1b59a490ef Fix flaky DBTest2.PresetCompressionDict test (#5378)
Summary:
Fix flaky DBTest2.PresetCompressionDict test.

This PR fixes two issues with the test:
1. Replaces `GetSstFiles` with `TotalSize`, which is based on `DB::GetColumnFamilyMetaData` so that only the size of the live SST files is taken into consideration when computing the total size of all sst files. Earlier, with `GetSstFiles`, even obsolete files were getting picked up.
1. In ZSTD compression, it is sometimes possible that using a trained dictionary is not better than using an untrained one. Using a trained dictionary performs well in 99% of the cases, but still in the remaining ~1% of the cases (out of 10000 runs) using an untrained dictionary gets better compression results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5378

Differential Revision: D15559100

Pulled By: sagar0

fbshipit-source-id: c35adbf13871f520a2cec48f8bad9ff27ff7a0b4
2019-05-30 16:11:27 -07:00
Vijay Nadimpalli
50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Sagar Vemuri
e62986260f Fix env_options_for_read spelling in CompactionJob
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5380

Differential Revision: D15563386

Pulled By: sagar0

fbshipit-source-id: 8b26aef47cfc40ff8016daf815582f21cdd40df2
2019-05-30 14:04:53 -07:00
Levi Tamasi
1e35584251 Move the index readers out of the block cache (#5298)
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).

Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298

Differential Revision: D15303203

Pulled By: ltamasi

fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
2019-05-30 11:53:27 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
anand76
a984040f0b Increase Trash/DB size ratio in DBSSTTest.RateLimitedWALDelete (#5366)
Summary:
By increasing the ratio, we ensure that all files go through background deletion and eliminate flakiness due to timing of deletions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5366

Differential Revision: D15549992

Pulled By: anand1976

fbshipit-source-id: d137375cd791fc1a802841412755d6e2b8fd7688
2019-05-30 11:12:59 -07:00
Zhongyi Xie
87fe4bcab8 Fix FIFO dynamic options sanitization (#5367)
Summary:
When dynamically setting options, we check the option type info and skip options that are marked deprecated. However this check is only done at top level, which results in bugs where SetOptions will corrupt option values and cause unexpected system behavior iff a deprecated second level option is set dynamically.
For exmaple, the following call:
```
dbfull()->SetOptions(
    {{"compaction_options_fifo",
        "{allow_compaction=true;max_table_files_size=1024;ttl=731;}"}});
```
was from pre 6.0 release when `ttl` was part of `compaction_options_fifo`. Now that it got moved out of `compaction_options_fifo`, this call will incorrectly set `compaction_options_fifo.max_table_files_size` to 731 (as `max_table_files_size` is the first one in `OptionsHelper::fifo_compaction_options_type_info` struct) and cause files to gett evicted much faster than expected.

This PR adds verification to second level options like `compaction_options_fifo.ttl` or `compaction_options_fifo.max_table_files_size` when set dynamically, and filter out those marked as deprecated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5367

Differential Revision: D15530998

Pulled By: miasantreble

fbshipit-source-id: 818258be5c3abe09cd82d62f3c083572d70fecdd
2019-05-30 10:46:28 -07:00
Siying Dong
545d206040 Move some file related files outside util/ (#5375)
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375

Differential Revision: D15550935

Pulled By: siying

fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
2019-05-29 20:47:06 -07:00
Siying Dong
4d0c3b1f96 Add comments in compaction_picker.h
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5357

Differential Revision: D15522825

Pulled By: siying

fbshipit-source-id: d775386b9d10c7179f5d3af2c821ed213abfacdf
2019-05-28 12:24:38 -07:00
Zhongyi Xie
a466120cd5 improve comments in db_impl_secondary
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5360

Differential Revision: D15502973

Pulled By: miasantreble

fbshipit-source-id: 15b7f9d7928e771a6fac0643861173be8ba6b37a
2019-05-24 15:32:03 -07:00
anand76
029b98984e Add some comments in table_cache.h
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5343

Differential Revision: D15485831

Pulled By: anand1976

fbshipit-source-id: 8735ccfba90d7ecb3559e63f792e34527f04ed29
2019-05-24 14:26:43 -07:00
Siying Dong
6267ed251a Improve comment in db_impl.h (#5338)
Summary:
Add some comments in db_impl.h. Also reordered function order a little bit so that I can add a comment to flag the area of functions implementing DB interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5338

Differential Revision: D15498284

Pulled By: siying

fbshipit-source-id: 3d7c59c8303577fe44d13c74ae84c7ce05164f77
2019-05-24 13:09:55 -07:00
Siying Dong
f69e63dc5f Improve comments in compaction.h
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5356

Differential Revision: D15499033

Pulled By: siying

fbshipit-source-id: 069ae48669484beaf668dd90389b8743b3309dc3
2019-05-24 12:24:28 -07:00
Siying Dong
596cc1547a Update comments in column_family.h (#5347)
Summary:
Document relationships of data structures declared in column_family.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5347

Differential Revision: D15496941

Pulled By: siying

fbshipit-source-id: 47b37835abba26aa31a94fabea6b2775483e0ccb
2019-05-24 12:07:15 -07:00
Zhongyi Xie
767d1f3ff1 Improve comments for StatsHistoryIterator and InMemoryStatsHistoryIterator
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5346

Differential Revision: D15497679

Pulled By: miasantreble

fbshipit-source-id: c10caf10293c3d9663bfb398a0d331326d1e9e67
2019-05-24 11:40:05 -07:00
Zhongyi Xie
88ff80780b improve comment for WalManager (#5350)
Summary:
att
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5350

Differential Revision: D15496467

Pulled By: miasantreble

fbshipit-source-id: c29c0b143bf4df2040695a82be0feb9814ddb641
2019-05-24 10:40:30 -07:00
haoyuhuang
74a334a2eb Provide an option so that SST ingestion won't fall back to copy after hard linking fails (#5333)
Summary:
RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named  'failed_move_fall_back_to_copy' for users to choose which behavior they want.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333

Differential Revision: D15457597

Pulled By: HaoyuHuang

fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214
2019-05-23 21:58:52 -07:00
Zhongyi Xie
09b534cc2f improve comments for CompactionJob (#5341)
Summary:
add class/function level comments to the header file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5341

Differential Revision: D15485442

Pulled By: miasantreble

fbshipit-source-id: 9f11e2a1cd3ce0f4990f01353d0a6f4b050615cf
2019-05-23 16:57:46 -07:00
Siying Dong
02830a20f8 Add comments in db/dbformat.h
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5342

Differential Revision: D15485238

Pulled By: siying

fbshipit-source-id: a56b374584cb1d815c1173907a807d90b37d4dd6
2019-05-23 16:44:20 -07:00
Siying Dong
dc30a9b69b Add comments to db/db_iter.h (#5340)
Summary:
Add file comment in db/db_iter.h and minor changes in other parts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5340

Differential Revision: D15484605

Pulled By: siying

fbshipit-source-id: 173771f9d5bd51303de5410ee5afd0a4af9d6572
2019-05-23 16:11:38 -07:00
Thomas Fersch
3d9d77d900 Restrict L0->L0 compaction according to max_compaction_bytes option (#5329)
Summary:
Modified FindIntraL0Compaction to stop picking more files if total
amount of compensated bytes would be larger than max_compaction_bytes
option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5329

Differential Revision: D15435728

Pulled By: ThomasFersch

fbshipit-source-id: d118a6da88d5df8ee20944422ade37cf6b15d60c
2019-05-22 23:40:57 -07:00
haoyuhuang
518cd1a62a Use GetCurrentManifestPath to locate current MANIFEST file (#5331)
Summary:
In version_set.cc, there is a function GetCurrentManifestPath. The goal of this task is to refactor ListColumnFamilies function so that ListColumnFamilies calls GetCurrentManifestPath to search for MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5331

Differential Revision: D15444524

Pulled By: HaoyuHuang

fbshipit-source-id: 1dcbd030bc0f2e835695741f450bba150f2f2903
2019-05-22 09:21:56 -07:00
Siying Dong
b2274da0e5 LogWriter to only flush after finish generating whole record (#5328)
Summary:
Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328

Differential Revision: D15425032

Pulled By: siying

fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa
2019-05-21 12:33:17 -07:00
Siying Dong
cd43446d01 Improve DBTablePropertiesTest.GetPropertiesOfTablesInRange (#5302)
Summary:
DBTablePropertiesTest.GetPropertiesOfTablesInRange sometimes hits the assert that generated LSM-tree doesn't have L1 file. Tighten the compaction triggering condition even further, hoping it goes away.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5302

Differential Revision: D15325971

Pulled By: siying

fbshipit-source-id: 3e032bdb16fe8d98d5fcfcd65dd8be9781f3d6ae
2019-05-20 13:50:53 -07:00
Vijay Nadimpalli
931c9df886 Use separate status code for column family drop and db shutdown in progress (#5275)
Summary:
Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event.
https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55
Comment on this:
abc4202e47/db/version_set.cc (L2742):L2743
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275

Differential Revision: D15204583

Pulled By: vjnadimpalli

fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a
2019-05-20 10:47:32 -07:00
Maysam Yabandeh
5c0e304170 WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313)
Summary:
WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313

Differential Revision: D15379977

Pulled By: maysamyabandeh

fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
2019-05-20 07:49:20 -07:00
Yanqin Jin
fb4c6a31ce Log replay integration for secondary instance (#5305)
Summary:
RocksDB secondary can replay both MANIFEST and WAL now.
On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data.
This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305

Differential Revision: D15386512

Pulled By: riversand963

fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355
2019-05-17 19:19:51 -07:00
yiwu-arbug
f3a7847598 Reduce iterator key comparison for upper/lower bound check (#5111)
Summary:
Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111

Differential Revision: D15330061

Pulled By: siying

fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
2019-05-17 10:28:31 -07:00
Siying Dong
f82e693a31 RangeDelAggregator::StripeRep::Invalidate() to be skipped if empty (#5312)
Summary:
RangeDelAggregator::StripeRep::Invalidate() clears up several vectors. If we know there isn't anything to there, we can safe these small CPUs. Profiling shows that it sometimes take non-negligible amount of CPU. Worth a small optimization.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5312

Differential Revision: D15380511

Pulled By: siying

fbshipit-source-id: 53c5f34c33b4cb1e743643c6086ac56d0b84ec2e
2019-05-16 15:24:28 -07:00
Yanqin Jin
1583cb402e Fix a flaky test with test sync point (#5310)
Summary:
If DB is opened with `avoid_unnecessary_blocking_io` being true, then `~ColumnFamilyHandleImpl` enqueues a purge request and schedules a background thread to perform the deletion. Without test sync point, whether the SST file is purged or not at a later point in time is not deterministic. If the SST does not exist, it will cause an assertion failure.

How to reproduce:
```
$git checkout 6492430eaf
$make -j20 deletefile_test
$gtest-parallel --repeat 1000 --worker 16 ./deletefile_test --gtest_filter=DeleteFileTest.BackgroundPurgeCFDropTest
```
The test may fail a few times.
With changes made in this PR, repeat the above commands, and the test should not fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5310

Differential Revision: D15361136

Pulled By: riversand963

fbshipit-source-id: c4308d5f8da83472c893bf7f8ceed347fbfa850f
2019-05-15 15:17:55 -07:00
Maysam Yabandeh
f0e8216197 WritePrepared: Fix deadlock in WriteRecoverableState (#5306)
Summary:
The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306

Differential Revision: D15341458

Pulled By: maysamyabandeh

fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
2019-05-15 13:53:54 -07:00
Thomas Fersch
a42757607d Use pre-increment instead of post-increment for iterators (#5296)
Summary:
Google C++ style guide indicates pre-increment should be used for iterators: https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement. Replaced all instances of ' it++' by ' ++it' (where type is iterator). So this covers the cases where iterators are named 'it'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5296

Differential Revision: D15301256

Pulled By: tfersch

fbshipit-source-id: 2803483c1392504ad3b281d21db615429c71114b
2019-05-15 13:19:15 -07:00
Maysam Yabandeh
3c3252a06a Fix tsan complaint in ConcurrentMergeWrite test (#5308)
Summary:
The test was not using separate MemTablePostProcessInfo per memetable insert thread and thus tsan was complaining about data race.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5308

Differential Revision: D15356420

Pulled By: maysamyabandeh

fbshipit-source-id: 46c2f2d19fb02c3c775b587aa09ca9c0dae6ed04
2019-05-15 11:21:48 -07:00
anand76
6492430eaf Fix a bug in db_stress and an incorrect assertion in FilePickerMultiGet (#5301)
Summary:
This PR has two fixes for crash test failures -
1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values
2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time

Test -
asan_crash and ubsan_crash tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301

Differential Revision: D15337383

Pulled By: anand1976

fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6
2019-05-14 11:58:04 -07:00
Maysam Yabandeh
f383641a1d Unordered Writes (#5218)
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.

Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)

Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)

- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)

Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218

Differential Revision: D15219029

Pulled By: maysamyabandeh

fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
2019-05-13 17:47:21 -07:00
Yanqin Jin
e626016545 Fix a race condition caused by unlocking db mutex (#5294)
Summary:
Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows.

Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong.

The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used **or** the memtable list does not have unpicked memtables.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294

Differential Revision: D15295297

Pulled By: riversand963

fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd
2019-05-10 17:56:48 -07:00