Commit Graph

8991 Commits

Author SHA1 Message Date
Levi Tamasi
f5bc3b99d5 Split BlobFileState into an immutable and a mutable part (#6502)
Summary:
It's never too soon to refactor something. The patch splits the recently
introduced (`VersionEdit` related) `BlobFileState` into two classes
`BlobFileAddition` and `BlobFileGarbage`. The idea is that once blob files
are closed, they are immutable, and the only thing that changes is the
amount of garbage in them. In the new design, `BlobFileAddition` contains
the immutable attributes (currently, the count and total size of all blobs, checksum
method, and checksum value), while `BlobFileGarbage` contains the mutable
GC-related information elements (count and total size of garbage blobs). This is a
better fit for the GC logic and is more consistent with how SST files are handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6502

Test Plan: `make check`

Differential Revision: D20348352

Pulled By: ltamasi

fbshipit-source-id: ff93f0121e80ab15e0e0a6525ba0d6af16a0e008
2020-03-10 17:27:26 -07:00
Chao Zhao
4028eba67b Optional sequence number exporting during checkpoint creation (#5528)
Summary:
Add sequence_number_ptr to the checkpoint interface to expose the sequence number during taking the checkpoint. The number will be consistent with the seq # in rocksdb log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5528

Test Plan: make check -j64

Reviewed By: Winger1994

Differential Revision: D16080209

fbshipit-source-id: 6dc3c7680287ee97d673c5e61f89aae1f43e33df
2020-03-10 13:40:18 -07:00
Yanqin Jin
fd1da22111 Support options.max_open_files != -1 with FIFO compaction (#6503)
Summary:
Allow user to specify options.max_open_files != -1 with FIFO compaction.
If max_open_files != -1, not all table files are kept open.
In the past, FIFO style compaction requires all table files to be open in order
to read file creation time from table properties. Later, we added file creation
time to MANIFEST, making it possible to read file creation time without opening
file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503

Test Plan: make check

Differential Revision: D20353758

Pulled By: riversand963

fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296
2020-03-09 18:45:06 -07:00
Yanqin Jin
d93812c9ae Iterator with timestamp (#6255)
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.

Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```

Test plan (dev server)
```
$make check
```

Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255

Differential Revision: D19438227

Pulled By: riversand963

fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
2020-03-06 16:24:27 -08:00
Cheng Chang
0a0151fb99 Remove memcpy from RandomAccessFileReader::Read in direct IO mode (#6455)
Summary:
In direct IO mode, RandomAccessFileReader::Read allocates an internal aligned buffer, and then copies the result into the scratch buffer. If the result is only temporarily used inside a function, there is no need to do the memcpy and just let the result Slice refer to the internally allocated buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6455

Test Plan: make check

Differential Revision: D20106753

Pulled By: cheng-chang

fbshipit-source-id: 44f505843837bba47a56e3fa2c4dd3bd76486b58
2020-03-06 14:05:12 -08:00
Otto Kekäläinen
f6c2777d95 Fix spelling: commited -> committed (#6481)
Summary:
In most places in the code the variable names are spelled correctly as
COMMITTED but in a couple places not. This fixes them and ensures the
variable is always called COMMITTED everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481

Differential Revision: D20306776

Pulled By: pdillinger

fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022
2020-03-06 12:45:20 -08:00
Yuqi Gu
e171a219d5 Fix db_wal_test::TruncateLastLogAfterRecoverWithoutFlush failure (#6437)
Summary:
`TruncateLastLogAfterRecoverWithoutFlush` case depends on fallocate support
of underlying file system.

On a file system which lacks of this feature, like zfs, it will fail to allocate predefined file size as this test case intends to do;

So a check block is added to detect fallocate support and skip test if not.
The related work is done by JunHe77. Thanks!

Signed-off-by: Yuqi Gu <yuqi.gu@arm.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6437

Differential Revision: D20145032

Pulled By: pdillinger

fbshipit-source-id: c8b691dc508e95acfa2a004ddbc07e2faa76680d
2020-03-05 17:18:16 -08:00
Cheng Chang
afb97094ae Skip high levels with no key falling in the range in CompactRange (#6482)
Summary:
In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped.
This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482

Test Plan:
A new test ManualCompactionTest::SkipLevel is added.

Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index.

Also updated db_compaction_test and db_test to use correct range for full compaction.

Differential Revision: D20251149

Pulled By: cheng-chang

fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b
2020-03-04 20:15:25 -08:00
Zhichao Cao
e62fe50634 Introduce FaultInjectionTestFS to test fault File system instead of Env (#6414)
Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414

Test Plan: pass make asan_check, pass error_handler_fs_test.

Differential Revision: D20252421

Pulled By: zhichao-cao

fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
2020-03-04 12:35:05 -08:00
Fabrice Fontaine
8bbd76edbf Check for sys/auxv.h (#6359)
Summary:
Check for sys/auxv.h and getauxval before using them as they are not
always available (for example on uclibc)

Signed-off-by: Fabrice Fontaine <fontaine.fabrice@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6359

Differential Revision: D20239797

fbshipit-source-id: 175a098094d81545628c2372e7c388e70a32fd48
2020-03-03 18:09:59 -08:00
Kefu Chai
03dbd11ead s/const auto/const auto&/ when doing loop (#6477)
Summary:
this silences following warning from clang-11
```
rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:21: warning: loop variable 'newf' of type 'const std::pair<int, rocksdb::FileMetaData>' creates a copy from type 'const
std::pair<int\
, rocksdb::FileMetaData>' [-Wrange-loop-analysis]
    for (const auto newf : c->edit()->GetNewFiles()) {
                    ^
rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:10: note: use reference type 'const std::pair<int, rocksdb::FileMetaData> &' to prevent copying
    for (const auto newf : c->edit()->GetNewFiles()) {
         ^~~~~~~~~~~~~~~~~
                    &
```
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6477

Differential Revision: D20211850

Pulled By: ltamasi

fbshipit-source-id: 3e89e13a12bba79f1b934d46b7c4c0576cdafb01
2020-03-03 08:41:57 -08:00
sumeerbhola
48d8d076a3 Add missing MutexLock to MockEnv::CreateDir (#6474)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6474

Differential Revision: D20205109

Pulled By: ltamasi

fbshipit-source-id: ec136005c63740f5b713ff537b5671ea9b8e217a
2020-03-02 20:52:19 -08:00
sdong
17bef7d3a8 Fix data race of GetCreationTimeOfOldestFile() (#6473)
Summary:
When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run:

==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308
READ of size 8 at 0x612000b7d198 thread T845 (store_load-33)
SCARINESS: 51 (8-byte-read-heap-use-after-free)
    #0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/src/db/version_set.cc:1488
    https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/src/db/db_impl/db_impl.cc:4499
    https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long*) rocksdb/utilities/stackable_db.h:392
    ......
0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8)
freed by thread T28 here:
    ......
    https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData*, std::allocator<rocksdb::FileMetaData*> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435
    https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734
    https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758
    https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869
    https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275
    https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction*) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78
    https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction*) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376
    https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826
    https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320
    https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096
    ......

Fix the issue by reference the super version and use the referenced version from it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473

Test Plan: Run ASAN for all existing tests.

Differential Revision: D20196416

fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f
2020-03-02 16:37:01 -08:00
Zhichao Cao
8d73137ae8 Replace Directory with FSDirectory in DB (#6468)
Summary:
In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761  introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468

Test Plan: pass make asan_check

Differential Revision: D20195261

Pulled By: zhichao-cao

fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1
2020-03-02 16:16:26 -08:00
Huisheng Liu
904a60ff63 return timestamp from get (#6409)
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.

ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
    base line (commit 72ee067b9):
        101.712 micros/op 314602 ops/sec;   36.0 MB/s (5658999 of 5658999 found)
    This PR:
        100.288 micros/op 319071 ops/sec;   36.5 MB/s (5674999 of 5674999 found)

./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409

Differential Revision: D20200086

Pulled By: riversand963

fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
2020-03-02 16:01:00 -08:00
Levi Tamasi
8637bc1eea Fix the description of unordered_write in db_bench (#6476)
Summary:
As reported in https://github.com/facebook/rocksdb/issues/6467, the
description of the `unordered_write` switch of `db_bench` was incorrect.
(Note: the new description is based on
https://rocksdb.org/blog/2019/08/15/unordered-write.html).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6476

Test Plan: `db_bench --help`

Differential Revision: D20200653

Pulled By: ltamasi

fbshipit-source-id: 4c3683fcfa6a069164167af5aaff9974a810c16a
2020-03-02 15:34:19 -08:00
Yanqin Jin
5f2f8cd97c Ignore compile_commands.json file (#6472)
Summary:
Both clangd and cquery-language-server requires a compile_commands.json file to
index the project. This file can be ignored by git.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6472

Differential Revision: D20194899

Pulled By: riversand963

fbshipit-source-id: ea1587f2e5d10b7591147073b61efe262a1cf747
2020-03-02 12:54:13 -08:00
sdong
9b3c9ef0e8 Add --index_with_first_key and --index_shortening_mode to DB bench (#5859)
Summary:
Some combinatino of --index_with_first_key and --index_shortening_mode can signifcantly improve performance for large values. Expose them in db_bench.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5859

Test Plan: Run them with the new options and observe the behavior.

Differential Revision: D20104434

fbshipit-source-id: 21d48a732a9caf20b82312c7d7557d747ea3c304
2020-03-02 11:55:28 -08:00
sdong
86f1ad7046 Add more unit test coverage to MultiRead (#6452)
Summary:
MultiRead tests in env_test cannot simulate the io_uring case when queries need to be submitted in multiple rounds. Add a new unit test to cover up more requests per MultiRead
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6452

Test Plan: Run it and see it pass when liburing is enabled or not enabled.

Differential Revision: D20078924

fbshipit-source-id: 6cff7fe345a4c5aa47135186e6181bf00df02b68
2020-02-28 16:42:44 -08:00
Michael R. Crusoe
051696bf98 fix some spelling typos (#6464)
Summary:
Found from Debian's "Lintian" program
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464

Differential Revision: D20162862

Pulled By: zhichao-cao

fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12
2020-02-28 14:14:03 -08:00
Manuel Ung
41535d0218 WriteUnPrepared: Pass in correct subbatch count during rollback (#6463)
Summary:
Today `WriteUnpreparedTxn::RollbackInternal` will write the rollback batch assuming that there is only a single subbatch. However, because untracked_keys_ are currently not deduplicated, it's possible for duplicate keys to exist, and thus split the batch. Also, tracked_keys_ also does not support compators outside of the bytewise comparators, so it's possible for duplicates to occur there as well.

To solve this, just pass in the correct subbatch count.

Also, removed `WriteUnpreparedRollbackPreReleaseCallback` to unify the Commit/Rollback codepaths some more.

Also, fixed a bug in `CommitInternal` where if 1. two_write_queue is true and 2. include_data is true, then `WriteUnpreparedCommitEntryPreReleaseCallback` ends up calling `AddCommitted` on the commit time write batch a second time on the second write. To fix, `WriteUnpreparedCommitEntryPreReleaseCallback` is re-initialized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6463

Differential Revision: D20150153

Pulled By: lth

fbshipit-source-id: df0b42d39406c75af73df995aa1138f0db539cd1
2020-02-28 11:19:32 -08:00
Jermy Li
72ee067b90 fix assert error while db.getDefaultColumnFamily().getDescriptor() (#6006)
Summary:
Threw assert error at assert(isOwningHandle()) in ColumnFamilyHandle.getDescriptor(),
because default CF don't own a handle, due to [RocksDB.getDefaultColumnFamily()](3a408eeae9/java/src/main/java/org/rocksdb/RocksDB.java (L3702)) called cfHandle.disOwnNativeHandle().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6006

Differential Revision: D19031448

fbshipit-source-id: 2420c45e835bda0e552e919b1b63708472b91538
2020-02-27 12:37:46 -08:00
Cheng Chang
741decfe37 Return early on failure when constructing CuckooTableReader (#6453)
Summary:
If file is not mmaped, CuckooTableReader should not try to read table properties from the file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6453

Test Plan: Added a new unit test

Differential Revision: D20103334

Pulled By: cheng-chang

fbshipit-source-id: 48539f14d93f6c1ebe12c3df5a14719e9d7b8726
2020-02-25 16:48:28 -08:00
Andrew Kryczka
f52db84650 support SstFileManager in db_stress (#6454)
Summary:
Add some flags for configuring an SstFileManager. An
SstFileManager is only created when one or more of these flags are
set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6454

Test Plan:
- ran it a while:

```
$ python ./tools/db_crashtest.py blackbox --simple -max_key=100000 -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -value_size_mult=33 --interval=10 -max_background_compactions=4 -max_background_flushes=2 -sst_file_manager_bytes_per_sec=1048576
```

- verified with strace the SstFileManager is behaving as configured:

```
$ strace -fp `pidof db_stress` -e ftruncate,unlink
...
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
67423) = 0
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
51039) = 0
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
34655) = 0
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
18271) = 0
[pid 3074805]
ftruncate(9</tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash>,
1887) = 0
[pid 3074805]
unlink("/tmp/rocksdb_crashtest_blackbox6OJywh/000070.sst.trash") = 0
...
```

Differential Revision: D20103315

Pulled By: ajkr

fbshipit-source-id: b3e1092747157459d244b047947a979b85c98f48
2020-02-25 16:45:30 -08:00
Andrew Kryczka
69679e7375 Fix range deletion tombstone ingestion with global seqno (#6429)
Summary:
Original author: jeffrey-xiao

If we are writing a global seqno for an ingested file, the range
tombstone metablock gets accessed and put into the cache during
ingestion preparation. At the time, the global seqno of the ingested
file has not yet been determined, so the cached block will not have a
global seqno. When the file is ingested and we read its range tombstone
metablock, it will be returned from the cache with no global seqno. In
that case, we use the actual seqnos stored in the range tombstones,
which are all zero, so the tombstones cover nothing.

This commit removes global_seqno_ variable from Block. When iterating
over a block, the global seqno for the block is determined by the
iterator instead of storing this mutable attribute in Block.
Additionally, this commit adds a regression test to check that keys are
deleted when ingesting a file with a global seqno and range deletion
tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429

Differential Revision: D19961563

Pulled By: ajkr

fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
2020-02-25 15:31:48 -08:00
Levi Tamasi
d87c10c6ab Add blob file state to VersionEdit (#6416)
Summary:
BlobDB currently does not keep track of blob files: no records are written to
the manifest when a blob file is added or removed, and upon opening a database,
the list of blob files is populated simply based on the contents of the blob directory.
This means that lost blob files cannot be detected at the moment. We plan to solve
this issue by making blob files a part of `Version`; as a first step, this patch makes
it possible to store information about blob files in `VersionEdit`. Currently, this information
includes blob file number, total number and size of all blobs, and total number and size
of garbage blobs. However, the format is extensible: new fields can be added in
both a forward compatible and a forward incompatible manner if needed (similarly
to `kNewFile4`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6416

Test Plan: `make check`

Differential Revision: D19894234

Pulled By: ltamasi

fbshipit-source-id: f9753e1f2aedf6dadb70c09b345207cb9c58c329
2020-02-24 18:39:53 -08:00
sdong
eb367d45c0 Buck config: Re-enable liburing under Linux (#6451)
Summary:
The known bug of liburing has been fixed. Now we can re-enable liburing under Linux
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6451

Test Plan: Watch internal CI

Differential Revision: D20079009

fbshipit-source-id: 04a6f53a900ff721f9a62a188cf906771b5d68d2
2020-02-24 15:47:34 -08:00
Cheng Chang
b47a714051 Update release version to 6.8 (#6450)
Summary:
Update release version to 6.8
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6450

Test Plan: no code change

Differential Revision: D20071889

Pulled By: cheng-chang

fbshipit-source-id: 91450aae09b201926469ff32f59ed436366f3b74
2020-02-24 11:44:47 -08:00
Peter Dillinger
43dde332cb Share kPageSize (and other small tweaks) (#6443)
Summary:
Make kPageSize extern const size_t (used in draft https://github.com/facebook/rocksdb/issues/6427)
Make kLitteEndian constexpr bool
Clarify a couple of comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6443

Test Plan: make check, CI

Differential Revision: D20044558

Pulled By: pdillinger

fbshipit-source-id: e0c5cc13229c82726280dc0ddcba4078346b8418
2020-02-22 08:01:36 -08:00
sdong
942eaba091 Handle io_uring partial results (#6441)
Summary:
The logic that handles io_uring partial results was wrong. Fix the logic by putting it into a queue and continue reading.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6441

Test Plan: Make sure this patch fixes the application test case where the bug was discovered; in env_test, add a unit test that simulates partial results and make sure the results are still correct.

Differential Revision: D20018616

fbshipit-source-id: 5398a7e34d74c26d52aa69dfd604e93e95d99c62
2020-02-21 16:57:37 -08:00
Yanqin Jin
890d87fadc Some minor fix-ups (#6440)
Summary:
Cleanup some code without any real change in functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440

Differential Revision: D20015891

Pulled By: riversand963

fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a
2020-02-21 15:09:56 -08:00
Peter Dillinger
ab65278b1f Misc filter_bench improvements (#6444)
Summary:
Useful in validating/testing internal fragmentation changes (https://github.com/facebook/rocksdb/issues/6427)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6444

Test Plan: manual (no changes to production code)

Differential Revision: D20040076

Pulled By: pdillinger

fbshipit-source-id: 32d26f363d2a9ab9f5bebd281dcebd9915ae340e
2020-02-21 13:31:57 -08:00
Zaiyang Li
fcec56e86c Add function to set row cache on rocksdb_options_t (#6442)
Summary:
Adding a C API function to set `row_cache` on `rocksdb_options_t` as this functionality is missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6442

Differential Revision: D20036813

Pulled By: riversand963

fbshipit-source-id: c1fa95ea343345fbc1e57961d0d048e0e79be373
2020-02-21 11:12:42 -08:00
sdong
d75ce0a8ae Mention rocksdb_namespace.h in HISTORY.md (#6439)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6439

Differential Revision: D20012442

fbshipit-source-id: a7c9569826eac39cd7ea69c90f08a21dd4caa335
2020-02-20 14:55:19 -08:00
Yanqin Jin
362b8d4393 Fix MANIFEST name assignment (#6426)
Summary:
Currently, a new MANIFEST file is assigned a new file number when 1) no
MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is
not sufficient. There are cases when the caller explicitly specifies that a new
MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true,
and there are WAL files, then RocksDB will run into an issue during recovery.
`DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being
recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs,
`DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call
`LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST.
However, the manifest_file_number is wrong before this fix. Consequently, RocksDB
opens an existing, non-empty file for append, effectively truncating the file to zero.
If a crash occurs, then there will be data loss.

Test Plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426

Test Plan: make check

Differential Revision: D19951866

Pulled By: riversand963

fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da
2020-02-20 14:30:58 -08:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Gaurav Singh
4e33f1e1dc simplify user_access_only expression (#6360)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6360

Differential Revision: D19698918

Pulled By: riversand963

fbshipit-source-id: d20ecca541376cccd32fc7afb504ea90021860ee
2020-02-20 10:27:56 -08:00
Remington Brasga
a993cc3a62 Fixed typo in benchmark.sh (#6434)
Summary:
TB =  1024 * GB
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6434

Differential Revision: D19978339

Pulled By: zhichao-cao

fbshipit-source-id: 5a89890110b23f0ebda4a95223f66da6736321ac
2020-02-19 17:08:02 -08:00
Yanqin Jin
5a297516e1 Ignore .vs/.vscode dir used by visual studio/ vscode (#6428)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6428

Differential Revision: D19959176

Pulled By: riversand963

fbshipit-source-id: 5b863944663462b246bb82883fa683f75ab33fc1
2020-02-18 16:10:15 -08:00
Andrew Kryczka
c6abe30ee3 Fix concurrent full purge and WAL recycling (#5900)
Summary:
We were removing the file from `log_recycle_files_` before renaming it
with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside
the DB mutex, it was possible for a concurrent full purge to sneak in
and delete the file before it could be renamed. Consequently, `SwitchMemtable()`
would fail and the DB would enter read-only mode.

The fix is to hold the old file number in `log_recycle_files_` until
after the file has been renamed. Full purge uses that list to decide
which files to keep, so it can no longer delete a file pending recycling.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900

Test Plan: new unit test

Differential Revision: D19771719

Pulled By: ajkr

fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8
2020-02-18 13:54:13 -08:00
Andrew Kryczka
0f9dcb88b2 Return NotSupported from WriteBatchWithIndex::DeleteRange (#5393)
Summary:
As discovered in https://github.com/facebook/rocksdb/issues/5260 and https://github.com/facebook/rocksdb/issues/5392, reads on the indexed batch do not account for range tombstones. So, return `Status::NotSupported` from `WriteBatchWithIndex::DeleteRange` until we properly support it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5393

Test Plan: added unit test

Differential Revision: D19912360

Pulled By: ajkr

fbshipit-source-id: 0bbfc978ea015d64516ca708fce2429abba524cb
2020-02-18 11:18:25 -08:00
acelyc111
3a3457575d Fix compile error when LZ4 is up to r123 (#6412)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6412

Differential Revision: D19914063

Pulled By: ajkr

fbshipit-source-id: 4e401e665d4b449d24c4cdec35a4585eeda95996
2020-02-14 15:55:23 -08:00
Manuel Ung
dc23c125c3 WriteUnPrepared: Untracked keys (#6404)
Summary:
For write unprepared, some applications may bypass the transaction api, and write keys directly into the write batch. However, since they are not tracked, rollbacks (both for savepoint and transaction) are not aware that these keys have to be rolled back.

The fix is to track them in `WriteUnpreparedTxn::untracked_keys_`. This is populated whenever we flush unprepared batches into the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6404

Differential Revision: D19842023

Pulled By: lth

fbshipit-source-id: a9edfc643d5c905fc89da9a9a9094d30c9b70108
2020-02-14 11:31:39 -08:00
Cheng Chang
152f8a8ffe Remove unnecessary computation of index (#6406)
Summary:
`index` can be replaced by  `iter`, saving the computation of `index++`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6406

Test Plan: make check

Differential Revision: D19905056

Pulled By: cheng-chang

fbshipit-source-id: add4638959c0d2e4e77a11f3fa04ffabaf0de790
2020-02-14 08:26:23 -08:00
Cheng Chang
4034e289ad Fail fast in paranoid mode when LoadTableHandlers fail during recovering (#6368)
Summary:
Previously, when recovering version set, LoadTableHandlers failures are ignored.
If paranoid_checks is true, this failure should not be ignored, otherwise, the opened db might be in an inconsistent state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6368

Test Plan: make check

Differential Revision: D19713459

Pulled By: cheng-chang

fbshipit-source-id: 68cb94f4f2cc43f8b024b14755193cd45cfcad55
2020-02-14 08:17:10 -08:00
wolfkdy
29e24434fe refine code (#6420)
Summary:
I create a new branch from the branch new upsteram/master and "git merge --squash".
Maybe it will fix everything.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6420

Differential Revision: D19897152

Pulled By: zhichao-cao

fbshipit-source-id: 6575d9e3b23e360f42ee1480b43028b5fcc20136
2020-02-13 18:55:02 -08:00
Manuel Ung
908b1ee64e WriteUnPrepared: Fix assertion during recovery (#6419)
Summary:
During recovery, multiple (un)prepared batches could exist in the same WAL record due to group commit. This breaks an assertion in `MemTableInserter::MarkBeginPrepare`.

To fix, reset unprepared_batch_ to false after `MarkEndPrepare`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6419

Differential Revision: D19896148

Pulled By: lth

fbshipit-source-id: b1a32ef88f775a0881264a18bd1a4a5b8c85eee3
2020-02-13 18:52:05 -08:00
Manuel Ung
fb571509a7 WriteUnPrepared: Enable WAL during crash recovery (#6418)
Summary:
Unfortunately, it seems like mysqld reuses xids across machine restarts. When that happens, we could have something like the following happening:

```
BEGIN_PREPARE(unprepared) Put(a) END_PREPARE(xid = 1)
-- crash and recover with Put(a) rolled back as it was not prepared
BEGIN_PREPARE(prepared) Put(b) END_PREPARE(xid = 1)
COMMIT(xid = 1)
-- crash and recover with both a, b
```

To solve this, we will have to log the rollback batch into the WAL during recovery.

WritePrepared already logs the rollback batch into the WAL, if a rollback happens after prepare, so there is no problem there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6418

Differential Revision: D19896151

Pulled By: lth

fbshipit-source-id: 2ff65ddc5fe75efd57736fed4b7cd7a109d26609
2020-02-13 18:44:39 -08:00
sdong
ac8e89a443 Should flush and sync WAL when writing it in DB::Open() (#6417)
Summary:
A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417

Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix.

Differential Revision: D19894537

fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e
2020-02-13 18:41:04 -08:00
Cheng Chang
46516778dd Fix flaky test DecreaseNumBgThreads (#6393)
Summary:
The DecreaseNumBgThreads test keeps failing on Windows in AppVeyor.
It fails because it depends on a timed wait for the tasks to be dequeued from the threadpool's internal queue, but within the specified time, the task might have not been scheduled onto the newly created threads.
https://github.com/facebook/rocksdb/pull/6232 tries to fix this by waiting for longer time to let the threads scheduled.
This PR tries to fix this by replacing the timed wait with a synchronization on the task's internal conditional variable.
When the number of threads increases, instead of guessing the time needed for the task to be scheduled, it directly blocks on the conditional variable until the task starts running.
But when thread number is reduced, it still does a timed wait, but this does not lead to the flakiness now, will try to remove these timed waits in a future PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6393

Test Plan: Wait to see whether AppVeyor tests pass.

Differential Revision: D19890928

Pulled By: cheng-chang

fbshipit-source-id: 4e56e4addf625c98c0876e62d9d57a6f0a156f76
2020-02-13 17:27:18 -08:00