rocksdb

Author	SHA1	Message	Date
Eli Pozniansky	c129c75fb7	Added log_readahead_size option to control prefetching for Log::Reader (#5592 ) Summary: Added log_readahead_size option to control prefetching for Log::Reader. This is mostly useful for reading a remotely located log, as it can save the number of round-trips when reading it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5592 Differential Revision: D16362989 Pulled By: elipoz fbshipit-source-id: c5d4d5245a44008cd59879640efff70c091ad3e8	2019-07-19 12:00:19 -07:00
anand76	abd1fdddef	Fix asan_check failures Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5589 Test Plan: TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_ASAN=1 OPT=-g make J=64 -j64 asan_check Differential Revision: D16361081 Pulled By: anand1976 fbshipit-source-id: 09474832b9cfb318a840d4b633e22dfad105d58c	2019-07-18 14:51:25 -07:00
anand76	ec2b996b29	Fix LITE mode build failure Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5588 Test Plan: make LITE=1 all check Differential Revision: D16354543 Pulled By: anand1976 fbshipit-source-id: 327a171439e183ac3a5e5057c511d6bca445e97d	2019-07-17 22:06:12 -07:00
Venki Pallipadi	22ce462450	Export Import sst files (#5495 ) Summary: Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135 This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469 "Add support for taking snapshot of a column family and creating column family from a given CF snapshot" We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality. (1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below. // Exports all live SST files of a specified Column Family onto export_dir, // returning SST files information in metadata. // - SST files will be created as hard links when the directory specified // is in the same partition as the db directory, copied otherwise. // - export_dir should not already exist and will be created by this API. // - Always triggers a flush. virtual Status ExportColumnFamily(ColumnFamilyHandle* handle, const std::string& export_dir, ExportImportFilesMetaData metadata); Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by returning the list of file metadata. (2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below. // CreateColumnFamilyWithImport() will create a new column family with // column_family_name and import external SST files specified in metadata into // this column family. // (1) External SST files can be created using SstFileWriter. // (2) External SST files can be exported from a particular column family in // an existing DB. // Option in import_options specifies whether the external files are copied or // moved (default is copy). When option specifies copy, managing files at // external_file_path is caller's responsibility. When option specifies a // move, the call ensures that the specified files at external_file_path are // deleted on successful return and files are not modified on any error // return. // On error return, column family handle returned will be nullptr. // ColumnFamily will be present on successful return and will not be present // on error return. ColumnFamily may be present on any crash during this call. virtual Status CreateColumnFamilyWithImport( const ColumnFamilyOptions& options, const std::string& column_family_name, const ImportColumnFamilyOptions& import_options, const ExportImportFilesMetaData& metadata, ColumnFamilyHandle handle); Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported. If incoming sequence number is higher than current local sequence number, local sequence number is updated to reflect this. Note, as the sst files is are being moved across Column Families, Column Family name in sst file will no longer match the actual column family on destination DB. The API does not modify Column Family name or id in the sst files being imported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495 Differential Revision: D16018881 fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9	2019-07-17 12:27:14 -07:00
sdong	699a569c52	Remove RandomAccessFileReader.for_compaction_ (#5572 ) Summary: RandomAccessFileReader.for_compaction_ doesn't seem to be used anymore. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5572 Test Plan: USE_CLANG=1 make all check -j Differential Revision: D16286178 fbshipit-source-id: aa338049761033dfbe5e8b1707bbb0be2df5be7e	2019-07-16 16:32:18 -07:00
Levi Tamasi	3bde41b5a3	Move the filter readers out of the block cache (#5504 ) Summary: Currently, when the block cache is used for the filter block, it is not really the block itself that is stored in the cache but a FilterBlockReader object. Since this object is not pure data (it has, for instance, pointers that might dangle, including in one case a back pointer to the TableReader), it's not really sharable. To avoid the issues around this, the current code erases the cache entries when the TableReader is closed (which, BTW, is not sufficient since a concurrent TableReader might have picked up the object in the meantime). Instead of doing this, the patch moves the FilterBlockReader out of the cache altogether, and decouples the filter reader object from the filter block. In particular, instead of the TableReader owning, or caching/pinning the FilterBlockReader (based on the customer's settings), with the change the TableReader unconditionally owns the FilterBlockReader, which in turn owns/caches/pins the filter block. This change also enables us to reuse the code paths historically used for data blocks for filters as well. Note: Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504 Test Plan: make asan_check Differential Revision: D16036974 Pulled By: ltamasi fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091	2019-07-16 13:14:58 -07:00
Jim Lin	cd2520361d	Fix memorty leak in `rocksdb_wal_iter_get_batch` function (#5515 ) Summary: `wal_batch.writeBatchPtr.release()` gives up the ownership of the original `WriteBatch`, but there is no new owner, which causes memory leak. The patch is simple. Removing `release()` prevent ownership change. `std::move` is for speed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5515 Differential Revision: D16264281 Pulled By: riversand963 fbshipit-source-id: 51c556b7a1c977325c3aa24acb636303847151fa	2019-07-15 12:59:39 -07:00
Zhongyi Xie	b0259e45e0	add more tracing for stats history (#5566 ) Summary: Sample info log output from db_bench: In-memory: ``` 2019/07/12-21:39:19.478490 7fa01b3f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS ------- 2019/07/12-21:39:19.478633 7fa01b3f5700 [_impl/db_impl.cc:753] Storing 145 stats with timestamp 1562992759 to in-memory stats history 2019/07/12-21:39:19.478670 7fa01b3f5700 [_impl/db_impl.cc:766] [Pre-GC] In-memory stats history size: 1051218 bytes, slice count: 103 2019/07/12-21:39:19.478704 7fa01b3f5700 [_impl/db_impl.cc:775] [Post-GC] In-memory stats history size: 1051218 bytes, slice count: 102 ``` On-disk: ``` 2019/07/12-21:48:53.862548 7f24943f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS ------- 2019/07/12-21:48:53.862553 7f24943f5700 [_impl/db_impl.cc:709] Reading 145 stats from statistics 2019/07/12-21:48:53.862852 7f24943f5700 [_impl/db_impl.cc:737] Writing 145 stats with timestamp 1562993333 to persistent stats CF succeeded ``` ``` 2019/07/12-21:48:51.861711 7f24943f5700 [_impl/db_impl.cc:702] ------- PERSISTING STATS ------- 2019/07/12-21:48:51.861729 7f24943f5700 [_impl/db_impl.cc:709] Reading 145 stats from statistics 2019/07/12-21:48:51.861921 7f24943f5700 [_impl/db_impl.cc:732] Writing to persistent stats CF failed -- Result incomplete: Write stall ... 2019/07/12-21:48:51.873032 7f2494bf6700 [WARN] [lumn_family.cc:749] [default] Stopping writes because we have 2 immutable memtables (waiting for flush), max_write_buffer_number is set to 2 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5566 Differential Revision: D16258187 Pulled By: miasantreble fbshipit-source-id: 292497099b941418590ed4312411bee36e244dc5	2019-07-15 11:49:17 -07:00
Sergei Petrunia	61876614dc	Fix MyRocks compile warnings-treated-as-errors on Fedora 30, gcc 9.1.1 (#5553 ) Summary: - Provide assignment operator in CompactionStats - Provide a copy constructor for FileDescriptor - Remove std::move from "return std::move(t)" in BoundedQueue Pull Request resolved: https://github.com/facebook/rocksdb/pull/5553 Differential Revision: D16230170 fbshipit-source-id: fd7c6e52390b2db1be24141e25649cf62424d078	2019-07-12 17:30:51 -07:00
sdong	cb19e7411f	Fix bugs in DBWALTest.kTolerateCorruptedTailRecords triggered by #5520 (#5550 ) Summary: https://github.com/facebook/rocksdb/pull/5520 caused a buffer overflow bug in DBWALTest.kTolerateCorruptedTailRecords. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5550 Test Plan: Run the test in UBSAN. It used to fail. Not it succeeds. Differential Revision: D16165516 fbshipit-source-id: 42c56a6bc64eb091f054b87757fcbef60da825f7	2019-07-09 11:18:32 -07:00
Yanqin Jin	7c76a7fba2	Support GetAllKeyVersions() for non-default cf (#5544 ) Summary: Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families. Test plan (devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test $./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions ``` All other unit tests must pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544 Differential Revision: D16147551 Pulled By: riversand963 fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168	2019-07-07 22:43:52 -07:00
Zhongyi Xie	8d34806972	setup wal_in_db_path_ for secondary instance (#5545 ) Summary: PR https://github.com/facebook/rocksdb/pull/5520 adds DBImpl:: wal_in_db_path_ and initializes it in DBImpl::Open, this PR fixes the valgrind error for secondary instance: ``` ==236417== Conditional jump or move depends on uninitialised value(s) ==236417== at 0x62242A: rocksdb::DeleteDBFile(rocksdb::ImmutableDBOptions const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, bool) (file_util.cc:96) ==236417== by 0x512432: rocksdb::DBImpl::DeleteObsoleteFileImpl(int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileType, unsigned long) (db_impl_files.cc:261) ==236417== by 0x515A7A: rocksdb::DBImpl::PurgeObsoleteFiles(rocksdb::JobContext&, bool) (db_impl_files.cc:492) ==236417== by 0x499153: rocksdb::ColumnFamilyHandleImpl::~ColumnFamilyHandleImpl() (column_family.cc:75) ==236417== by 0x499880: rocksdb::ColumnFamilyHandleImpl::~ColumnFamilyHandleImpl() (column_family.cc:84) ==236417== by 0x4C9AF9: rocksdb::DB::DestroyColumnFamilyHandle(rocksdb::ColumnFamilyHandle) (db_impl.cc:3105) ==236417== by 0x44E853: CloseSecondary (db_secondary_test.cc:53) ==236417== by 0x44E853: rocksdb::DBSecondaryTest::~DBSecondaryTest() (db_secondary_test.cc:31) ==236417== by 0x44EC77: ~DBSecondaryTest_PrimaryDropColumnFamily_Test (db_secondary_test.cc:443) ==236417== by 0x44EC77: rocksdb::DBSecondaryTest_PrimaryDropColumnFamily_Test::~DBSecondaryTest_PrimaryDropColumnFamily_Test() (db_secondary_test.cc:443) ==236417== by 0x83D1D7: HandleSehExceptionsInMethodIfSupported<testing::Test, void> (gtest-all.cc:3824) ==236417== by 0x83D1D7: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const*) (gtest-all.cc:3860) ==236417== by 0x8346DB: testing::TestInfo::Run() [clone .part.486] (gtest-all.cc:4078) ==236417== by 0x8348D4: Run (gtest-all.cc:4047) ==236417== by 0x8348D4: testing::TestCase::Run() [clone .part.487] (gtest-all.cc:4190) ==236417== by 0x834D14: Run (gtest-all.cc:6100) ==236417== by 0x834D14: testing::internal::UnitTestImpl::RunAllTests() (gtest-all.cc:6062) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5545 Differential Revision: D16146224 Pulled By: miasantreble fbshipit-source-id: 184c90e451352951da4e955f054d4b1a1f29ea29	2019-07-07 21:32:50 -07:00
anand76	e0d9d57750	Fix bugs in WAL trash file handling (#5520 ) Summary: 1. Cleanup WAL trash files on open 2. Don't apply deletion rate limit if WAL dir is different from db dir Pull Request resolved: https://github.com/facebook/rocksdb/pull/5520 Test Plan: Add new unit tests and make check Differential Revision: D16096750 Pulled By: anand1976 fbshipit-source-id: 6f07858ad864b754b711db416f0389c45ede599b	2019-07-06 21:07:32 -07:00
Yi Wu	4f66ec977d	Fix lower bound check error when iterate across file boundary (#5540 ) Summary: Since https://github.com/facebook/rocksdb/issues/5468 `LevelIterator` compare lower bound and file smallest key on `NewFileIterator` and cache the result to reduce per key lower bound check. However when iterate across file boundary, it doesn't update the cached result since `Valid()=false` because `Valid()` still reflect the status of the previous file iterator. Fixing it by remove the `Valid()` check from `CheckMayBeOutOfLowerBound()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5540 Test Plan: See the new test. Signed-off-by: Yi Wu <yiwu@pingcap.com> Differential Revision: D16127653 fbshipit-source-id: a0691e1164658d485c17971aaa97028812f74678	2019-07-04 17:28:30 -07:00
haoyuhuang	6edc5d0719	Block cache tracing: Associate a unique id with Get and MultiGet (#5514 ) Summary: This PR associates a unique id with Get and MultiGet. This enables us to track how many blocks a Get/MultiGet request accesses. We can also measure the impact of row cache vs block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5514 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32 Differential Revision: D16032681 Pulled By: HaoyuHuang fbshipit-source-id: 775b05f4440badd58de6667e3ec9f4fc87a0af4c	2019-07-03 19:35:41 -07:00
Andrew Kryczka	0d57d93a06	Support jemalloc compiled with `--with-jemalloc-prefix` (#5521 ) Summary: Previously, if the jemalloc was built with nonempty string for `--with-jemalloc-prefix`, then `HasJemalloc()` would return false on Linux, so jemalloc would not be used at runtime. On Mac, it would cause a linker failure due to no definitions found for the weak functions declared in "port/jemalloc_helper.h". This should be a rare problem because (1) on Linux the default `--with-jemalloc-prefix` value is the empty string, and (2) Homebrew's build explicitly sets `--with-jemalloc-prefix` to the empty string. However, there are cases where `--with-jemalloc-prefix` is nonempty. For example, when building jemalloc from source on Mac, the default setting is `--with-jemalloc-prefix=je_`. Such jemalloc builds should be usable by RocksDB. The fix is simple. Defining `JEMALLOC_MANGLE` before including "jemalloc.h" causes it to define unprefixed symbols that are aliases for each of the prefixed symbols. Thanks to benesch for figuring this out and explaining it to me. Fixes https://github.com/facebook/rocksdb/issues/1462. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5521 Test Plan: build jemalloc with prefixed symbols: ``` $ ./configure --with-jemalloc-prefix=lol $ make ``` compile rocksdb against it: ``` $ WITH_JEMALLOC_FLAG=1 JEMALLOC=1 EXTRA_LDFLAGS="-L/home/andrew/jemalloc/lib/" EXTRA_CXXFLAGS="-I/home/andrew/jemalloc/include/" make -j12 ./db_bench ``` run db_bench and verify jemalloc actually used: ``` $ ./db_bench -benchmarks=fillrandom -statistics=true -dump_malloc_stats=true -stats_dump_period_sec=1 $ grep jemalloc /tmp/rocksdbtest-1000/dbbench/LOG 2019/06/29-12:20:52.088658 7fc5fb7f6700 [_impl/db_impl.cc:837] ___ Begin jemalloc statistics ___ ... ``` Differential Revision: D16092758 fbshipit-source-id: c2c358346190ed62ceb2a3547a6c4c180b12f7c4	2019-07-02 12:07:01 -07:00
Yi Wu	662ce62044	Reduce iterator key comparison for upper/lower bound check (2nd attempt) (#5468 ) Summary: This is a second attempt for https://github.com/facebook/rocksdb/issues/5111, with the fix to redo iterate bounds check after `SeekXXX()`. This is because MyRocks may change iterate bounds between seek. See https://github.com/facebook/rocksdb/issues/5111 for original benchmark result and discussion. Closes https://github.com/facebook/rocksdb/issues/5463. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5468 Test Plan: Existing rocksdb tests, plus myrocks test `rocksdb.optimizer_loose_index_scans` and `rocksdb.group_min_max`. Differential Revision: D15863332 fbshipit-source-id: ab4aba5899838591806b8673899bd465f3f53e18	2019-07-02 11:48:46 -07:00
haoyuhuang	66464d1fde	Remove multiple declarations o kMicrosInSecond. Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5526 Test Plan: OPT=-g V=1 make J=1 unity_test -j32 make clean && make -j32 Differential Revision: D16079315 Pulled By: HaoyuHuang fbshipit-source-id: 294ab439cf0db8dd5da44e30eabf0cbb2bb8c4f6	2019-07-01 15:15:12 -07:00
Yanqin Jin	1e87f2b68b	Ref and unref cfd before and after calling WaitForFlushMemTables (#5513 ) Summary: This is to prevent bg flush thread from unrefing and deleting the cfd that has been dropped by a concurrent thread. Before RocksDB calls `DBImpl::WaitForFlushMemTables`, we should increase the refcount of each `ColumnFamilyData` so that its ref count will not drop to 0 even if the column family is dropped by another thread. Otherwise the bg flush thread can deref the cfd and deletes it, causing a segfault in `WaitForFlushMemtables` upon accessing `cfd`. Test plan (on devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 $make check ``` All unit tests must pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5513 Differential Revision: D16062898 Pulled By: riversand963 fbshipit-source-id: 37dc511f1dc99f036d0201bbd7f0a8f5677c763d	2019-07-01 14:12:02 -07:00
Zhongyi Xie	3886dddc3b	force flushing stats CF to avoid holding old logs (#5509 ) Summary: WAL records RocksDB writes to all column families. When user flushes a a column family, the old WAL will not accept new writes but cannot be deleted yet because it may still contain live data for other column families. (See https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log#life-cycle-of-a-wal for detailed explanation) Because of this, if there is a column family that receive very infrequent writes and no manual flush is called for it, it could prevent a lot of WALs from being deleted. PR https://github.com/facebook/rocksdb/pull/5046 introduced persistent stats column family which is a good example of such column families. Depending on the config, it may have long intervals between writes, and user is unaware of it which makes it difficult to call manual flush for it. This PR addresses the problem for persistent stats column family by forcing a flush for persistent stats column family when 1) another column family is flushed 2) persistent stats column family's log number is the smallest among all column families, this way persistent stats column family will keep advancing its log number when necessary, allowing RocksDB to delete old WAL files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5509 Differential Revision: D16045896 Pulled By: miasantreble fbshipit-source-id: 286837b633e988417f0096ff38384742d3b40ef4	2019-07-01 11:56:43 -07:00
anand76	7259e28d91	MultiGet parallel IO (#5464 ) Summary: Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464 Test Plan: 1. make check 2. make asan_check 3. make asan_crash Differential Revision: D15911771 Pulled By: anand1976 fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394	2019-06-30 20:56:04 -07:00
Yanqin Jin	c08c0ae731	Add C binding for secondary instance (#5505 ) Summary: Add C binding for secondary instance as well as unit test. Test plan (on devserver) ``` $make clean && COMPILE_WITH_ASAN=1 make -j20 all $./c_test $make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5505 Differential Revision: D16000043 Pulled By: riversand963 fbshipit-source-id: 3361ef6bfdf4ce12438cee7290a0ac203b5250bd	2019-06-27 08:58:54 -07:00
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
Yi Wu	2730fe693e	Fix ingested file and direcotry not being sync (#5435 ) Summary: It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest. Also syncing after writing global sequence when write_global_seqno=true was removed in https://github.com/facebook/rocksdb/issues/4172. Adding it back. Fixes https://github.com/facebook/rocksdb/issues/5287. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5435 Test Plan: Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false. https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9 More test suggestions are welcome. Differential Revision: D15941675 Pulled By: riversand963 fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78	2019-06-21 10:15:38 -07:00
haoyuhuang	705b8eecb4	Add more callers for table reader. (#5454 ) Summary: This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block. 1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name. 2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable. This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32. Differential Revision: D15819451 Pulled By: HaoyuHuang fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d	2019-06-20 14:31:48 -07:00
Zhongyi Xie	24f73436fb	sanitize and limit block_size under 4GB (#5492 ) Summary: `Block::restart_index_`, `Block::restarts_`, and `Block::current_` are defined as uint32_t but `BlockBasedTableOptions::block_size` is defined as a size_t so user might see corruption as in https://github.com/facebook/rocksdb/issues/5486. This PR adds a check in `BlockBasedTableFactory::SanitizeOptions` to disallow such configurations. yiwu-arbug Pull Request resolved: https://github.com/facebook/rocksdb/pull/5492 Differential Revision: D15914047 Pulled By: miasantreble fbshipit-source-id: c943f153d967e15aee7f2795730ab8259e2be201	2019-06-20 11:45:08 -07:00
Vijay Nadimpalli	24b118ad98	Combine the read-ahead logic for user reads and compaction reads (#5431 ) Summary: Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431 Test Plan: make check Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5 Differential Revision: D15772533 Pulled By: vjnadimpalli fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9	2019-06-19 14:10:46 -07:00
Simon Grätzer	fe90ed7a70	Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474 ) Summary: When tailing the WAL with TransactionLogIterator, it used to return Corruption status to indicate that the WAL has new tail that is not visible to the iterator, which is a misleading status. The patch replaces it with TryAgain which is more descriptive of a status, indicating that the user needs to create a new iterator to fetch the recent tail. Fixes https://github.com/facebook/rocksdb/issues/5455 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5474 Differential Revision: D15898953 Pulled By: maysamyabandeh fbshipit-source-id: 40966f6457cb539e1aeb104daeada6b0e46059fc	2019-06-19 08:10:08 -07:00
Yanqin Jin	f287f8dc93	Fix a bug caused by secondary not skipping the beginning of new MANIFEST (#5472 ) Summary: While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows. The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files again to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them. Test plan (on dev server) ``` $make clean && make -j32 all $./db_secondary_test ``` All existing unit tests must pass as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472 Differential Revision: D15866771 Pulled By: riversand963 fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed	2019-06-18 11:21:37 -07:00
Zhongyi Xie	ddd088c8b9	fix rocksdb lite and clang contrun test failures (#5477 ) Summary: recent commit `671d15cbdd` introduced some test failures: ``` ===== Running stats_history_test [==========] Running 9 tests from 1 test case. [----------] Global test environment set-up. [----------] 9 tests from StatsHistoryTest [ RUN ] StatsHistoryTest.RunStatsDumpPeriodSec monitoring/stats_history_test.cc:63: Failure dbfull()->SetDBOptions({{"stats_dump_period_sec", "0"}}) Not implemented: Not supported in ROCKSDB LITE db/db_options_test.cc:28:11: error: unused variable 'kMicrosInSec' [-Werror,-Wunused-const-variable] const int kMicrosInSec = 1000000; ``` This PR fixes these failures Pull Request resolved: https://github.com/facebook/rocksdb/pull/5477 Differential Revision: D15871814 Pulled By: miasantreble fbshipit-source-id: 0a7023914d2c1784d9d2d3f5bfb47310d4855394	2019-06-17 21:16:29 -07:00
Yanqin Jin	7d8d56413d	Override check consistency for DBImplSecondary (#5469 ) Summary: `DBImplSecondary` calls `CheckConsistency()` during open. In the past, `DBImplSecondary` did not override this function thus `DBImpl::CheckConsistency()` is called. The following can happen. The secondary instance is performing consistency check which calls `GetFileSize(file_path)` but the file at `file_path` is deleted by the primary instance. `DBImpl::CheckConsistency` does not account for this and fails the consistency check. This is undesirable. The solution is that, we call `DBImpl::CheckConsistency()` first. If it passes, then we are good. If not, we give it a second chance and handles the case of file(s) being deleted. Test plan (on dev server): ``` $make clean && make -j20 all $./db_secondary_test ``` All other existing unit tests must pass as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5469 Differential Revision: D15861845 Pulled By: riversand963 fbshipit-source-id: 507d72392508caed3cd003bb2e2aa43f993dd597	2019-06-17 15:39:55 -07:00
Zhongyi Xie	671d15cbdd	Persistent Stats: persist stats history to disk (#5046 ) Summary: This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046 Differential Revision: D15863138 Pulled By: miasantreble fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b	2019-06-17 15:21:50 -07:00
Sagar Vemuri	f1219644ec	Validate CF Options when creating a new column family (#5453 ) Summary: It seems like CF Options are not properly validated when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations, will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`). Test Plan: Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions` ``` TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions ``` Also ran gtest-parallel to make sure the new test is not flaky. ``` TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000 [10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453 Differential Revision: D15816851 Pulled By: sagar0 fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5	2019-06-14 14:11:10 -07:00
haoyuhuang	bb4178066d	Integrate block cache tracer into db_impl (#5433 ) Summary: This PR integrates the block cache tracer class into db_impl.cc. db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433 Differential Revision: D15728016 Pulled By: HaoyuHuang fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71	2019-06-13 15:43:10 -07:00
Levi Tamasi	a3b8c76d8e	Add missing check before calling PurgeObsoleteFiles in EnableFileDeletions (#5448 ) Summary: Calling PurgeObsoleteFiles with a JobContext for which HaveSomethingToDelete is false is a precondition violation. This would trigger an assertion in debug builds; however, in release builds with assertions disabled, this can result in the pending_purge_obsolete_files_ counter in DBImpl underflowing, which in turn can lead to the process hanging during database close. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5448 Differential Revision: D15792569 Pulled By: ltamasi fbshipit-source-id: 82d92c9b4f6a9efcdc69dbb3d5a52a1ae2dd2472	2019-06-13 14:43:13 -07:00
Maysam Yabandeh	f43edff9ac	Disable kPipelinedWrite in MultiThreaded (#5442 ) Summary: TSAN tests report a race condition. We temporarily exclude kPipelinedWrite from MultiThreaded until the race condition is fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5442 Differential Revision: D15782349 Pulled By: maysamyabandeh fbshipit-source-id: 42b4f9b3fa9137f0675e13ad132c0a06800c1bdd	2019-06-12 10:37:40 -07:00
Levi Tamasi	ba64a4cf52	Revert "Reduce iterator key comparison for upper/lower bound check (#5111 )" (#5440 ) Summary: This reverts commit `f3a7847598`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5440 Differential Revision: D15765967 Pulled By: ltamasi fbshipit-source-id: d027fe24132e3729289cd7c01857a7eb449d9dd0	2019-06-11 16:23:41 -07:00
Yanqin Jin	7177dc46a1	Handle missing WAL in secondary mode (#5323 ) Summary: In secondary mode, it is possible that the secondary lists the primary's WAL directory, finds a WAL and tries to open it. It is possible that the primary deletes the WAL after secondary listing dir but before the secondary opening it. Then the secondary will fail to open the WAL file with a PathNotFound status. In this case, we can return OK without replaying WAL and optionally replay more MANIFEST. Test Plan (on my dev machine): Without this PR, the following will fail several times out of 100 runs. ``` ~/gtest-parallel/gtest-parallel -r 100 -w 16 ./db_secondary_test --gtest_filter=DBSecondaryTest.SwitchToNewManifestDuringOpen ``` With this PR, the above should always succeed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5323 Differential Revision: D15763878 Pulled By: riversand963 fbshipit-source-id: c7164fa7cb8d9001abc258b6a2dc93613e4f38ff	2019-06-11 13:08:28 -07:00
sdong	58c4aee42e	TransactionUtil::CheckKey() to skip unnecessary history (#4941 ) Summary: If a memtable definitely covers a key, there isn't a need to check older memtables. We can skip them by checking the earliest sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941 Differential Revision: D13932666 fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b	2019-06-11 11:46:42 -07:00
Levi Tamasi	a94aef6596	Fix DBTest.DynamicMiscOptions so it passes even with Snappy disabled (#5438 ) Summary: This affects our "no compression" automated tests. Since PR #5368, DBTest.DynamicMiscOptions has been failing with: db/db_test.cc:4889: Failure dbfull()->SetOptions({{"compression", "kSnappyCompression"}}) Invalid argument: Compression type Snappy is not linked with the binary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5438 Differential Revision: D15752100 Pulled By: ltamasi fbshipit-source-id: 3f19eff7cafc03b333965be0203c5853d2a9cb71	2019-06-10 18:47:58 -07:00
Maysam Yabandeh	c8c1a549f0	Avoid deadlock between mutex_ and log_write_mutex_ (#5437 ) Summary: To avoid deadlock mutex_ should never be acquired before log_write_mutex_. The patch documents that and also fixes one case in ::FlushWAL that acquires mutex_ through ::WriteStatusCheck when it already holds lock on log_write_mutex_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5437 Differential Revision: D15749722 Pulled By: maysamyabandeh fbshipit-source-id: f57b69c44b4b80cc6d7ddf3d3fdf4a9eb5a5a45a	2019-06-10 17:06:50 -07:00
Maysam Yabandeh	b2584577fa	Remove global locks from FlushScheduler (#5372 ) Summary: FlushScheduler's methods are instrumented with debug-time locks to check the scheduler state against a simple container definition. Since https://github.com/facebook/rocksdb/pull/2286 the scope of such locks are widened to the entire methods' body. The result is that the concurrency tested during testing (in debug mode) is stricter than the concurrency level manifested at runtime (in release mode). The patch reverts this change to reduce the scope of such locks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5372 Differential Revision: D15545831 Pulled By: maysamyabandeh fbshipit-source-id: 01d69191afb1dd807d4bdc990fc74813ae7b5426	2019-06-10 16:50:26 -07:00
Yanqin Jin	641cc8d541	Use CreateLoggerFromOptions function (#5427 ) Summary: Use `CreateLoggerFromOptions` function to reduce code duplication. Test plan (on my machine) ``` $make clean && make -j32 db_secondary_test $KEEP_DB=1 ./db_secondary_test ``` Verify all info logs of the secondary instance are properly logged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5427 Differential Revision: D15748922 Pulled By: riversand963 fbshipit-source-id: bad7261df1b8373efc504f141efc7871e375a311	2019-06-10 16:00:30 -07:00
haoyuhuang	5efa0d6b0d	Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421 ) Summary: BlockCacheLookupContext only contains the caller for now. We will trace block accesses at five places: 1. BlockBasedTable::GetFilter. 2. BlockBasedTable::GetUncompressedDict. 3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.) 4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.) 5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.) We create the context at: 1. BlockBasedTable::Get. (kUserGet) 2. BlockBasedTable::MultiGet. (kUserMGet) 3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.) 4. BlockBasedTable::Open. (kPrefetch) 5. Index/Filter::CacheDependencies. (kPrefetch) 6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize). I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable. Throughput of this PR: 231334 ops/s. Throughput of the master branch: 238428 ops/s. Experiment setup: RocksDB: version 6.2 Date: Mon Jun 10 10:42:51 2019 CPU: 24 * Intel Core Processor (Skylake) CPUCache: 16384 KB Keys: 20 bytes each Values: 100 bytes each (100 bytes after compression) Entries: 1000000 Prefix: 20 bytes Keys per prefix: 0 RawSize: 114.4 MB (estimated) FileSize: 114.4 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: NoCompression Compression sampling rate: 0 Memtablerep: skip_list Perf Level: 1 Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120 TODOs: 1. Create a caller for external SST file ingestion and differentiate the callers for iterator. 2. Integrate tracer to trace block cache accesses. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421 Differential Revision: D15704258 Pulled By: HaoyuHuang fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a	2019-06-10 15:33:27 -07:00
Yanqin Jin	6ce5580882	Improve memtable earliest seqno assignment for secondary instance (#5413 ) Summary: In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay. When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families. Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables. 1. Primary triggers a flush and flushes "cf1". "default" is not flushed. 2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST. 3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413 Differential Revision: D15637407 Pulled By: riversand963 fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580	2019-06-10 12:58:14 -07:00
Maysam Yabandeh	c292dc8540	WritePrepared: reduce prepared_mutex_ overhead (#5420 ) Summary: The patch reduces the contention over prepared_mutex_ using these techniques: 1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues. 2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop() 3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420 Differential Revision: D15741985 Pulled By: maysamyabandeh fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735	2019-06-10 11:53:31 -07:00
anand76	b703a56e5c	Potential fix for stress test failure due to "SST file ahead of WAL" error (#5412 ) Summary: I'm not able to prove it, but the stress test failure may be caused by the following sequence of events - 1. Crash db_stress while writing the log file. This should result in a corrupted WAL. 2. Run db_stress with recycle_log_file_num=1. Crash during recovery immediately after writing manifest and updating the current file. The old log from the previous run is left behind, but the memtable would have been flushed during recovery and the CF log number will point to the newer log 3. Run db_stress with recycle_log_file_num=0. During recovery, the old log file will be processed and the corruption will be detected. Since the CF has moved ahead, we get the "SST file is ahead of WAL" error Test - 1. stress_crash 2. make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/5412 Differential Revision: D15699120 Pulled By: anand1976 fbshipit-source-id: 9092ce81e7c4a0b4b4e66560c23ea4812a4d9cbe	2019-06-07 15:35:47 -07:00
Levi Tamasi	0f48e56f96	Revert to checking the upper bound on a per-key basis in BlockBasedTableIterator (#5428 ) Summary: PR #5111 reduced the number of key comparisons when iterating with upper/lower bounds; however, this caused a regression for MyRocks. Reverting to the previous behavior in BlockBasedTableIterator as a hotfix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5428 Differential Revision: D15721038 Pulled By: ltamasi fbshipit-source-id: 5450106442f1763bccd17f6cfd648697f2ae8b6c	2019-06-07 15:17:05 -07:00
Zhongyi Xie	d68f9f4580	simplify include directive involving inttypes (#5402 ) Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03	2019-06-06 13:56:07 -07:00
Yanqin Jin	340ed4fac7	Add support for timestamp in Get/Put (#5079 ) Summary: It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now). Test plan (on devserver): ``` $COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` All tests must pass. We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled. ``` $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 ``` Repeat for 6 times for both versions. Results are as follows: ``` \| \| readrandom \| fillrandom \| \| master \| 16.77 MB/s \| 47.05 MB/s \| \| PR5079 \| 16.44 MB/s \| 47.03 MB/s \| ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079 Differential Revision: D15132946 Pulled By: riversand963 fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c	2019-06-05 23:10:47 -07:00
haoyuhuang	227b5d52df	Make RocksDB secondary instance respect atomic groups in version edits. (#5411 ) Summary: With this commit, RocksDB secondary instance respects atomic groups in version edits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5411 Differential Revision: D15617512 Pulled By: HaoyuHuang fbshipit-source-id: 913f4ede391d772dcaf5649e3cd2099fa292d120	2019-06-04 10:56:19 -07:00
Andrew Kryczka	ebe89ef9d8	Fix merging range tombstone covering put during flush/compaction (#5406 ) Summary: Flush/compaction use `MergeUntil` which has a special code path to handle a merge ending with a non-`Merge` point key. In particular if that key is a `Put` we forgot to check whether it is covered by a range tombstone. If it is covered then we must not include it in the following call to `TimedFullMerge`. Fixes #5392. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5406 Differential Revision: D15611144 Pulled By: sagar0 fbshipit-source-id: ba6a7863ca2d043f591de78fd0c4f4561f0c500e	2019-06-04 10:24:14 -07:00
anand76	5d6e8df1cf	Ignore shutdown error during compaction (#5400 ) Summary: The PR #5275 separated the column dropped and shutdown status codes. However, there were a couple of places in compaction where this change ended up treating a ShutdownInProgress() error as a real error and set bg_error. This caused MyRocks unit test to fail due to WAL writes during shutdown returning this error. Fix it by ignoring the shutdown status during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5400 Differential Revision: D15611680 Pulled By: anand1976 fbshipit-source-id: c602e97840e3ae24eb420d61e0ce95d3e6258632	2019-06-03 22:40:43 -07:00
Maysam Yabandeh	ae05a83e19	Call ValidateOptions from SetOptions (#5368 ) Summary: Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368 Differential Revision: D15540101 Pulled By: maysamyabandeh fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4	2019-06-03 19:49:57 -07:00
Siying Dong	5851cb7fdb	Move util/trace_replay.* to trace_replay/ (#5376 ) Summary: util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376 Differential Revision: D15550938 Pulled By: siying fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207	2019-06-03 13:25:26 -07:00
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	2019-05-31 17:23:59 -07:00
Vijay Nadimpalli	cae22c53fb	Make format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5395 Differential Revision: D15581698 Pulled By: vjnadimpalli fbshipit-source-id: f415972f16e784b1361714c202b97defcab46767	2019-05-31 15:24:43 -07:00
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	2019-05-31 11:57:01 -07:00
Zhongyi Xie	ab8f6c01a6	move LevelCompactionPicker to a separate file (#5369 ) Summary: In order to improve code readability, this PR moves LevelCompactionBuilder and LevelCompactionPicker to compaction_picker_level.h and .cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/5369 Differential Revision: D15540172 Pulled By: miasantreble fbshipit-source-id: c1a578b93f127cd63661b53f32b356e6edd349af	2019-05-30 21:38:24 -07:00
Sagar Vemuri	ff9d286877	Reorder DBImpl's private section (#5385 ) Summary: The methods and fields in the private section of DBImpl were all intermingled, making it hard to figure out where the fields/methods start and where they end. I cleaned up the code a little so that all the type declaration are at the beginning, followed by methods, and all the data fields are at the end. This follows Pull Request resolved: https://github.com/facebook/rocksdb/pull/5385 Differential Revision: D15566978 Pulled By: sagar0 fbshipit-source-id: 4618a7d819ad4e2d7cc9ae1af2c59f400140bb1b	2019-05-30 21:32:46 -07:00
Yanqin Jin	b9f5900658	Fix WAL replay by skipping old write batches (#5170 ) Summary: 1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables. 2. Add support for benchmarking secondary instance to db_bench_tool. With changes made in this PR, we can start benchmarking secondary instance using two processes. It is also possible to vary the frequency at which the secondary instance tries to catch up with the primary. The info log of the secondary can be found in a directory whose path can be specified with '-secondary_path'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170 Differential Revision: D15564608 Pulled By: riversand963 fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8	2019-05-30 19:33:33 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Yanqin Jin	f1302ebab8	Add class-level comments to version-related classes (#5348 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5348 Differential Revision: D15564595 Pulled By: riversand963 fbshipit-source-id: dd45aa86a70e0343c2e9ef702fad165163f548e6	2019-05-30 16:18:33 -07:00
Sagar Vemuri	1b59a490ef	Fix flaky DBTest2.PresetCompressionDict test (#5378 ) Summary: Fix flaky DBTest2.PresetCompressionDict test. This PR fixes two issues with the test: 1. Replaces `GetSstFiles` with `TotalSize`, which is based on `DB::GetColumnFamilyMetaData` so that only the size of the live SST files is taken into consideration when computing the total size of all sst files. Earlier, with `GetSstFiles`, even obsolete files were getting picked up. 1. In ZSTD compression, it is sometimes possible that using a trained dictionary is not better than using an untrained one. Using a trained dictionary performs well in 99% of the cases, but still in the remaining ~1% of the cases (out of 10000 runs) using an untrained dictionary gets better compression results. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5378 Differential Revision: D15559100 Pulled By: sagar0 fbshipit-source-id: c35adbf13871f520a2cec48f8bad9ff27ff7a0b4	2019-05-30 16:11:27 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Sagar Vemuri	e62986260f	Fix env_options_for_read spelling in CompactionJob Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5380 Differential Revision: D15563386 Pulled By: sagar0 fbshipit-source-id: 8b26aef47cfc40ff8016daf815582f21cdd40df2	2019-05-30 14:04:53 -07:00
Levi Tamasi	1e35584251	Move the index readers out of the block cache (#5298 ) Summary: Currently, when the block cache is used for index blocks as well, it is not really the index block that is stored in the cache but an IndexReader object. Since this object is not pure data (it has, for instance, pointers that might dangle), it's not really sharable. To avoid the issues around this, the current code uses a dummy unique cache key for each TableReader to store the IndexReader, and erases the IndexReader entry when the TableReader is closed. Instead of doing this, the new code moves the IndexReader out of the cache altogether. In particular, instead of the TableReader owning, or caching/pinning the IndexReader based on the customer's settings, the TableReader unconditionally owns the IndexReader, which in turn owns/caches/pins the index block (which is itself sharable and thus can be safely put in the cache without any hacks). Note: the change has two side effects: 1) Partitions of partitioned indexes no longer affect the read amplification statistics. 2) Eviction statistics for index blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298 Differential Revision: D15303203 Pulled By: ltamasi fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47	2019-05-30 11:53:27 -07:00
Siying Dong	e9e0101ca4	Move test related files under util/ to test_util/ (#5377 ) Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c	2019-05-30 11:25:51 -07:00
anand76	a984040f0b	Increase Trash/DB size ratio in DBSSTTest.RateLimitedWALDelete (#5366 ) Summary: By increasing the ratio, we ensure that all files go through background deletion and eliminate flakiness due to timing of deletions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5366 Differential Revision: D15549992 Pulled By: anand1976 fbshipit-source-id: d137375cd791fc1a802841412755d6e2b8fd7688	2019-05-30 11:12:59 -07:00
Zhongyi Xie	87fe4bcab8	Fix FIFO dynamic options sanitization (#5367 ) Summary: When dynamically setting options, we check the option type info and skip options that are marked deprecated. However this check is only done at top level, which results in bugs where SetOptions will corrupt option values and cause unexpected system behavior iff a deprecated second level option is set dynamically. For exmaple, the following call: ``` dbfull()->SetOptions( {{"compaction_options_fifo", "{allow_compaction=true;max_table_files_size=1024;ttl=731;}"}}); ``` was from pre 6.0 release when `ttl` was part of `compaction_options_fifo`. Now that it got moved out of `compaction_options_fifo`, this call will incorrectly set `compaction_options_fifo.max_table_files_size` to 731 (as `max_table_files_size` is the first one in `OptionsHelper::fifo_compaction_options_type_info` struct) and cause files to gett evicted much faster than expected. This PR adds verification to second level options like `compaction_options_fifo.ttl` or `compaction_options_fifo.max_table_files_size` when set dynamically, and filter out those marked as deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5367 Differential Revision: D15530998 Pulled By: miasantreble fbshipit-source-id: 818258be5c3abe09cd82d62f3c083572d70fecdd	2019-05-30 10:46:28 -07:00
Siying Dong	545d206040	Move some file related files outside util/ (#5375 ) Summary: util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375 Differential Revision: D15550935 Pulled By: siying fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2	2019-05-29 20:47:06 -07:00
Siying Dong	4d0c3b1f96	Add comments in compaction_picker.h Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5357 Differential Revision: D15522825 Pulled By: siying fbshipit-source-id: d775386b9d10c7179f5d3af2c821ed213abfacdf	2019-05-28 12:24:38 -07:00
Zhongyi Xie	a466120cd5	improve comments in db_impl_secondary Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5360 Differential Revision: D15502973 Pulled By: miasantreble fbshipit-source-id: 15b7f9d7928e771a6fac0643861173be8ba6b37a	2019-05-24 15:32:03 -07:00
anand76	029b98984e	Add some comments in table_cache.h Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5343 Differential Revision: D15485831 Pulled By: anand1976 fbshipit-source-id: 8735ccfba90d7ecb3559e63f792e34527f04ed29	2019-05-24 14:26:43 -07:00
Siying Dong	6267ed251a	Improve comment in db_impl.h (#5338 ) Summary: Add some comments in db_impl.h. Also reordered function order a little bit so that I can add a comment to flag the area of functions implementing DB interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5338 Differential Revision: D15498284 Pulled By: siying fbshipit-source-id: 3d7c59c8303577fe44d13c74ae84c7ce05164f77	2019-05-24 13:09:55 -07:00
Siying Dong	f69e63dc5f	Improve comments in compaction.h Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5356 Differential Revision: D15499033 Pulled By: siying fbshipit-source-id: 069ae48669484beaf668dd90389b8743b3309dc3	2019-05-24 12:24:28 -07:00
Siying Dong	596cc1547a	Update comments in column_family.h (#5347 ) Summary: Document relationships of data structures declared in column_family.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/5347 Differential Revision: D15496941 Pulled By: siying fbshipit-source-id: 47b37835abba26aa31a94fabea6b2775483e0ccb	2019-05-24 12:07:15 -07:00
Zhongyi Xie	767d1f3ff1	Improve comments for StatsHistoryIterator and InMemoryStatsHistoryIterator Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5346 Differential Revision: D15497679 Pulled By: miasantreble fbshipit-source-id: c10caf10293c3d9663bfb398a0d331326d1e9e67	2019-05-24 11:40:05 -07:00
Zhongyi Xie	88ff80780b	improve comment for WalManager (#5350 ) Summary: att Pull Request resolved: https://github.com/facebook/rocksdb/pull/5350 Differential Revision: D15496467 Pulled By: miasantreble fbshipit-source-id: c29c0b143bf4df2040695a82be0feb9814ddb641	2019-05-24 10:40:30 -07:00
haoyuhuang	74a334a2eb	Provide an option so that SST ingestion won't fall back to copy after hard linking fails (#5333 ) Summary: RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named 'failed_move_fall_back_to_copy' for users to choose which behavior they want. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333 Differential Revision: D15457597 Pulled By: HaoyuHuang fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214	2019-05-23 21:58:52 -07:00
Zhongyi Xie	09b534cc2f	improve comments for CompactionJob (#5341 ) Summary: add class/function level comments to the header file Pull Request resolved: https://github.com/facebook/rocksdb/pull/5341 Differential Revision: D15485442 Pulled By: miasantreble fbshipit-source-id: 9f11e2a1cd3ce0f4990f01353d0a6f4b050615cf	2019-05-23 16:57:46 -07:00
Siying Dong	02830a20f8	Add comments in db/dbformat.h Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5342 Differential Revision: D15485238 Pulled By: siying fbshipit-source-id: a56b374584cb1d815c1173907a807d90b37d4dd6	2019-05-23 16:44:20 -07:00
Siying Dong	dc30a9b69b	Add comments to db/db_iter.h (#5340 ) Summary: Add file comment in db/db_iter.h and minor changes in other parts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5340 Differential Revision: D15484605 Pulled By: siying fbshipit-source-id: 173771f9d5bd51303de5410ee5afd0a4af9d6572	2019-05-23 16:11:38 -07:00
Thomas Fersch	3d9d77d900	Restrict L0->L0 compaction according to max_compaction_bytes option (#5329 ) Summary: Modified FindIntraL0Compaction to stop picking more files if total amount of compensated bytes would be larger than max_compaction_bytes option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5329 Differential Revision: D15435728 Pulled By: ThomasFersch fbshipit-source-id: d118a6da88d5df8ee20944422ade37cf6b15d60c	2019-05-22 23:40:57 -07:00
haoyuhuang	518cd1a62a	Use GetCurrentManifestPath to locate current MANIFEST file (#5331 ) Summary: In version_set.cc, there is a function GetCurrentManifestPath. The goal of this task is to refactor ListColumnFamilies function so that ListColumnFamilies calls GetCurrentManifestPath to search for MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5331 Differential Revision: D15444524 Pulled By: HaoyuHuang fbshipit-source-id: 1dcbd030bc0f2e835695741f450bba150f2f2903	2019-05-22 09:21:56 -07:00
Siying Dong	b2274da0e5	LogWriter to only flush after finish generating whole record (#5328 ) Summary: Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328 Differential Revision: D15425032 Pulled By: siying fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa	2019-05-21 12:33:17 -07:00
Siying Dong	cd43446d01	Improve DBTablePropertiesTest.GetPropertiesOfTablesInRange (#5302 ) Summary: DBTablePropertiesTest.GetPropertiesOfTablesInRange sometimes hits the assert that generated LSM-tree doesn't have L1 file. Tighten the compaction triggering condition even further, hoping it goes away. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5302 Differential Revision: D15325971 Pulled By: siying fbshipit-source-id: 3e032bdb16fe8d98d5fcfcd65dd8be9781f3d6ae	2019-05-20 13:50:53 -07:00
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
Maysam Yabandeh	5c0e304170	WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313 ) Summary: WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes. The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313 Differential Revision: D15379977 Pulled By: maysamyabandeh fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00	2019-05-20 07:49:20 -07:00
Yanqin Jin	fb4c6a31ce	Log replay integration for secondary instance (#5305 ) Summary: RocksDB secondary can replay both MANIFEST and WAL now. On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data. This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305 Differential Revision: D15386512 Pulled By: riversand963 fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355	2019-05-17 19:19:51 -07:00
yiwu-arbug	f3a7847598	Reduce iterator key comparison for upper/lower bound check (#5111 ) Summary: Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111 Differential Revision: D15330061 Pulled By: siying fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024	2019-05-17 10:28:31 -07:00
Siying Dong	f82e693a31	RangeDelAggregator::StripeRep::Invalidate() to be skipped if empty (#5312 ) Summary: RangeDelAggregator::StripeRep::Invalidate() clears up several vectors. If we know there isn't anything to there, we can safe these small CPUs. Profiling shows that it sometimes take non-negligible amount of CPU. Worth a small optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5312 Differential Revision: D15380511 Pulled By: siying fbshipit-source-id: 53c5f34c33b4cb1e743643c6086ac56d0b84ec2e	2019-05-16 15:24:28 -07:00
Yanqin Jin	1583cb402e	Fix a flaky test with test sync point (#5310 ) Summary: If DB is opened with `avoid_unnecessary_blocking_io` being true, then `~ColumnFamilyHandleImpl` enqueues a purge request and schedules a background thread to perform the deletion. Without test sync point, whether the SST file is purged or not at a later point in time is not deterministic. If the SST does not exist, it will cause an assertion failure. How to reproduce: ``` $git checkout `6492430eaf` $make -j20 deletefile_test $gtest-parallel --repeat 1000 --worker 16 ./deletefile_test --gtest_filter=DeleteFileTest.BackgroundPurgeCFDropTest ``` The test may fail a few times. With changes made in this PR, repeat the above commands, and the test should not fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5310 Differential Revision: D15361136 Pulled By: riversand963 fbshipit-source-id: c4308d5f8da83472c893bf7f8ceed347fbfa850f	2019-05-15 15:17:55 -07:00
Maysam Yabandeh	f0e8216197	WritePrepared: Fix deadlock in WriteRecoverableState (#5306 ) Summary: The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306 Differential Revision: D15341458 Pulled By: maysamyabandeh fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323	2019-05-15 13:53:54 -07:00
Thomas Fersch	a42757607d	Use pre-increment instead of post-increment for iterators (#5296 ) Summary: Google C++ style guide indicates pre-increment should be used for iterators: https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement. Replaced all instances of ' it++' by ' ++it' (where type is iterator). So this covers the cases where iterators are named 'it'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5296 Differential Revision: D15301256 Pulled By: tfersch fbshipit-source-id: 2803483c1392504ad3b281d21db615429c71114b	2019-05-15 13:19:15 -07:00
Maysam Yabandeh	3c3252a06a	Fix tsan complaint in ConcurrentMergeWrite test (#5308 ) Summary: The test was not using separate MemTablePostProcessInfo per memetable insert thread and thus tsan was complaining about data race. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5308 Differential Revision: D15356420 Pulled By: maysamyabandeh fbshipit-source-id: 46c2f2d19fb02c3c775b587aa09ca9c0dae6ed04	2019-05-15 11:21:48 -07:00
anand76	6492430eaf	Fix a bug in db_stress and an incorrect assertion in FilePickerMultiGet (#5301 ) Summary: This PR has two fixes for crash test failures - 1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values 2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time Test - asan_crash and ubsan_crash tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301 Differential Revision: D15337383 Pulled By: anand1976 fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6	2019-05-14 11:58:04 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Yanqin Jin	e626016545	Fix a race condition caused by unlocking db mutex (#5294 ) Summary: Previous code may call `~ColumnFamilyData` in `DBImpl::AtomicFlushMemTablesToOutputFiles` if the column family is dropped or `cfd->IsFlushPending() == false`. In `~ColumnFamilyData`, the db mutex is released briefly and re-acquired. This can cause correctness issue. The reason is as follows. Assume there are more bg flush threads. After bg_flush_thr1 releases the db mutex, bg_flush_thr2 can grab it and pop an element from the flush queue. This will cause bg_flush_thr2 to accidentally pick some memtables which should have been picked by bg_flush_thr1. To make the matter worse, bg_flush_thr2 can clear `flush_requested_` flag for the memtable list, causing a subsequent call to `MemTableList::IsFlushPending()` by bg_flush_thr1 to return false, which is wrong. The fix is to delay `ColumnFamilyData::Unref` and `~ColumnFamilyData` for column families not selected for flush until `AtomicFlushMemTablesToOutputFiles` returns. Furthermore, a bg flush thread should not clear `MemTableList::flush_requested_` in `MemTableList::PickMemtablesToFlush` unless atomic flush is not used or the memtable list does not have unpicked memtables. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5294 Differential Revision: D15295297 Pulled By: riversand963 fbshipit-source-id: 03b101205ca22c242647cbf488bcf0ed80b2ecbd	2019-05-10 17:56:48 -07:00
Jelte Fennema	6451673f37	Add C bindings for LowerThreadPoolIO/CPUPriority (#5285 ) Summary: There were no C bindings for lowering thread pool priority. This adds those. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5285 Differential Revision: D15290050 Pulled By: siying fbshipit-source-id: b2ed94d0c39d27434ace2204829a242b53d0d67a	2019-05-09 18:21:21 -07:00

1 2 3 4 5 ...

3716 Commits