Summary:
Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
and every `LevelIterator`. This redundancy consumes extra memory,
resulting in the `Arena` making more allocations, and iteration
observing worse cache performance.
This PR migrates callers of `NewInternalIterator()` and
`MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
outlive the returned iterator. When the iterator's lifetime will be managed by the
user, this lifetime guarantee is achieved by storing the `ReadOptions`
value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
`MakeInputIterator()` can hold a reference-to-const `ReadOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210
Test Plan:
- `make check` under ASAN and valgrind
- benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.
Reviewed By: anand1976
Differential Revision: D22861323
Pulled By: ajkr
fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
Summary:
Very small improvements to document the defaults.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7215
Reviewed By: zhichao-cao
Differential Revision: D22902286
fbshipit-source-id: a754d172a0d8e4c03754f6f1771d4a693d60a770
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165
Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:
/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1
Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.
Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking
Reviewed By: ajkr
Differential Revision: D22693050
Pulled By: akabcenell
fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
Summary:
NextAndGetResult() is not implemented in memtable and is very simply implemented in level iterator. The result is that for a normal leveled iterator, performance regression will be observed for calling PrepareValue() for most iterator Next(). Mitigate the problem by implementing the function for both iterators. In level iterator, the implementation cannot be perfect as when calling file iterator's SeekToFirst() we don't have information about whether the value is prepared. Fortunately, the first key should not cause a big portion of the CPu.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7179
Test Plan: Run normal crash test for a while.
Reviewed By: anand1976
Differential Revision: D22783840
fbshipit-source-id: c19f45cdf21b756190adef97a3b66ccde3936e05
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.
One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176
Reviewed By: cheng-chang
Differential Revision: D22799278
Pulled By: ajkr
fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
Summary:
`USE_LTO=1` in `make` commands now enables LTO. The archiver (`ar`) needed
to change in this PR to use a wrapper that enables the LTO plugin.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7181
Test Plan:
build a few ways
```
$ make clean && USE_LTO=1 make -j48 db_bench
$ make clean && USE_CLANG=1 USE_LTO=1 make -j48 db_bench
$ make clean && ROCKSDB_NO_FBCODE=1 USE_LTO=1 make -j48 db_bench
```
Reviewed By: cheng-chang
Differential Revision: D22784994
Pulled By: ajkr
fbshipit-source-id: 9c45333bd49bf4615aa04c85b7c6fd3925421152
Summary:
Fix the data race warning by removing an unnecessary variable that causes the warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7177
Test Plan:
`COMPILE_WITH_TSAN=1 make backupable_db_test`
`./backupable_db_test --gtest_filter=*TableFileWithDbChecksumCorruptedDuringBackup*`
Reviewed By: riversand963
Differential Revision: D22774430
Pulled By: gg814
fbshipit-source-id: 3b0b1ac344d0375c64da564cc97f98745c289959
Summary:
On Travis, the old `alignment()` returned by `RandomAccessFileReaderTest` is inconsistent with the `GetRequiredBufferAlignment` returned in `RandomAccessFileReader`. This PR removes `alignment()` and consistently use `GetRequiredBufferAlignment` as page size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7173
Test Plan:
make random_access_file_reader_test && ./random_access_file_reader_test
Watch Travis
Reviewed By: siying
Differential Revision: D22741606
Pulled By: cheng-chang
fbshipit-source-id: f28f29a7c993bbc3594ae70ecd186fa8bab9c4f2
Summary:
BackupableDBTest.RateLimiting test is failing due to timed out
on our test server. It might be because of nested loops run sequentially that test different type of combinations of parameters. This patch converts the test into parameterized test so that all combinations can be tested out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7167
Test Plan: make check -j64
Reviewed By: zhichao-cao
Differential Revision: D22709531
Pulled By: akankshamahajan15
fbshipit-source-id: 95518153e87b3b5311a6c1960a191bca58898786
Summary:
SST Partitioner interface that allows to split SST files during compactions.
It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957
Reviewed By: ajkr
Differential Revision: D22461239
fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
Summary:
There is a typo in TryMerge which may cause MultiRead to internally read more data than expected, but won't affect MultiRead results' correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7157
Test Plan: make random_access_file_reader_test && ./random_access_file_reader_test
Reviewed By: siying
Differential Revision: D22670257
Pulled By: cheng-chang
fbshipit-source-id: d261289455a65aa496b348c6e5582b48b12963b7
Summary:
BlobDB currently syncs each blob file periodically after writing a certain amount of
data (as specified by the configuration option `BlobDBOptions::bytes_per_sync`)
and all open blob files when the base DB's memtables are flushed. With the patch,
in addition to the above, blob files are also synced right before being closed, after
the footer has been written. This will be beneficial for the new integrated blob file
write path as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7160
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D22672646
Pulled By: ltamasi
fbshipit-source-id: 62b34263543a7e74abcbb7adf011daa1e699998f
Summary:
TryMerge() overzealously creates one huge file read request in an attempt to merge smaller disjoint requests. For example, ~30 input requests of ~100 bytes output as 1 request of 100 MiB causing alarmingly large read throughputs to be repeatedly observed by the environment.
Signed-off-by: Jason Volk <jason@zemos.net>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6979
Reviewed By: siying
Differential Revision: D22668892
Pulled By: cheng-chang
fbshipit-source-id: 7506fe9621b7f1a747dadf6b8ddb1b1a141c1937
Summary:
`DBTest::SnapshotFiles` runs the tests in a `while` loop.
Currently, the snapshot directory is not cleaned up in each loop, so previous snapshot files may remain in the next loop's snapshot.
When I'm working on https://github.com/facebook/rocksdb/pull/7129, when checking the tracked WALs in MANIFEST, I find that this test always fails because it reads some unknown WAL. It turns out that the unknown WAL is left from previous loops.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7156
Test Plan: make db_test && ./db_test --gtest_filters=*SnapshotFiles
Reviewed By: siying
Differential Revision: D22668360
Pulled By: cheng-chang
fbshipit-source-id: 69d4aa3506038ba30e218e8ae966357935a99c6c
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected.
Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134
Reviewed By: cheng-chang
Differential Revision: D22646149
fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153
Test Plan: make check.
Reviewed By: riversand963
Differential Revision: D22654136
Pulled By: roghnin
fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
Summary:
BackupEngine requires computing table checksums twice when backing up table files to the `shared_checksum` directory.
The repeated computation can be avoided by utilizing the db session id stored as a part of the table properties.
Filenames of table files in the `shared_checksum` directory depend on the following conditions:
1. the naming scheme is `kOptionalChecksumAndDbSessionId`,
2. `db_session_id` is not empty,
3. checksum is available in the DB manifest.
If 1,2,3 are satisfied, then the filenames will be of the form `<file_number>_<checksum>_<db_session_id>.sst`.
If 1,2 are satisfied, then the filenames will be of the form `<file_number>_<db_session_id>.sst`.
In all other cases, the filenames are of the form `<file_number>_<checksum>_<size>.sst`.
Additionally, if `kOptionalChecksumAndDbSessionId` is used (and not falling back to `kChecksumAndFileSize`), the `<checksum>` appeared in the filenames is hexadecimally encoded, instead of being plain `uint32_t` value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7110
Test Plan: backupable_db_test and manual tests.
Reviewed By: ajkr
Differential Revision: D22508992
Pulled By: gg814
fbshipit-source-id: 5669f0ea9ad5a097f69f6d87aca4abba15032389
Summary:
Two TSAN tests occaionaly fail. Exclude them for now:
[ RUN ] DeleteFileTest.BackgroundPurgeCFDropTest
db/deletefile_test.cc:122: Failure
Expected equality of these values:
required_manifest
Which is: 1
manifest_cnt
Which is: 2
[ RUN ] FormatLatest/ColumnFamilyTest.FlushCloseWALFiles/0
db/column_family_test.cc:3004: Failure
Expected equality of these values:
2
env.num_open_wal_file_.load()
Which is: 1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7152
Test Plan: Watch CircleCI restuls
Reviewed By: ajkr
Differential Revision: D22632285
fbshipit-source-id: 29fa348e8be917be0237c74812a8b0b04978e84e
Summary:
TSAN reports warning in one column_family_test:
WARNING: ThreadSanitizer: data race (pid=16352)
Write of size 8 at 0x7ffcdf042158 by main thread:
#0 pthread_cond_destroy <null> (column_family_test+0x471f65)
https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::~CondVar() /home/circleci/project/port/port_posix.cc:101:49 (column_family_test+0x8a627a)
https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::~SleepingBackgroundTask() /home/circleci/project/./test_util/testutil.h:397:7 (column_family_test+0x54b6e2)
https://github.com/facebook/rocksdb/issues/3 rocksdb::ColumnFamilyTest_FlushCloseWALFiles_Test::TestBody() /home/circleci/project/db/column_family_test.cc:3008:1 (column_family_test+0x54b6e2)
......
Previous read of size 8 at 0x7ffcdf042158 by thread T2 (mutexes: write M0):
#0 pthread_cond_broadcast <null> (column_family_test+0x471dd2)
https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::SignalAll() /home/circleci/project/port/port_posix.cc:139:28 (column_family_test+0x8a651a)
https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::DoSleep() /home/circleci/project/./test_util/testutil.h:412:12 (column_family_test+0x58574b)
......
Likely, SleepingBackgroundTask::DoSleep() started to execute after the main thread has finished everything, cancelled and waited for sleeping tasks to finish. At this time, although DoSlee() will not sleep, but it also accesses the mutex, creating a data race with destructor of the test. Fix this bug by waiting for the sleeping task to start sleeping after it is scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7150
Test Plan: Run these modified tests and make sure it doesn't break.
Reviewed By: riversand963
Differential Revision: D22630716
fbshipit-source-id: cc5781cf69083685de406490438898238bdfc2d3
Summary:
PR https://github.com/facebook/rocksdb/issues/6944 transitioned `BlockIter` from using `Comparator*` to using
concrete `UserComparatorWrapper` and `InternalKeyComparator`. However,
adding them as instance variables to `BlockIter` was not optimal.
Bloating `BlockIter` caused the `ArenaWrappedDBIter`'s arena allocator to do more heap
allocations (in certain cases) which harmed performance of `DB::NewIterator()`. This PR
pushes down the concrete comparator objects to the point of usage, which
forces them to be on the stack. As a result, the `BlockIter` is back to
its original size prior to https://github.com/facebook/rocksdb/issues/6944 (actually a bit smaller since there
were two `Comparator*` before).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7149
Test Plan:
verified our internal `DB::NewIterator()`-heavy regression
test no longer reports regression.
Reviewed By: riversand963
Differential Revision: D22623189
Pulled By: ajkr
fbshipit-source-id: f6d69accfe5de51e0bd9874a480b32b29909bab6
Summary:
Remove the 3 testing cases that cause the time out in linux build by https://github.com/facebook/rocksdb/issues/6765 . Will fix them later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7141
Test Plan: make asan_check, buck run
Reviewed By: ajkr
Differential Revision: D22593831
Pulled By: zhichao-cao
fbshipit-source-id: 14956c36476ecc3393f613178c22e13df843126e
Summary:
In CircleCI tests, we failed to fail tests properly if parallel doesn't return an error code. It's probably would happen when unit tests fail with signals, rather than return values. Fix them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7147
Test Plan: Manually ingest a failure and see it to fail.
Reviewed By: jay-zhuang
Differential Revision: D22611594
fbshipit-source-id: 88a42425a41d1213d29bd2e7c80731d2bdd5644b
Summary:
Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145
Reviewed By: riversand963
Differential Revision: D22593031
Pulled By: ajkr
fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6
Summary:
This PR does a few things:
- The "compression_opts" and "bottom_compression_opts" can now be read/written as name/value pairs of options (instead of only a colon-separated list;
- These options can now be read/written to the Options file;
- The parallel_threads value can now be set (either in the colon or name-value format).
The compression options are now stored and treated as a OptionTypeInfo::Struct by the options system, meaning they can be read and written like the other structs. This change allows them to be read/written easily to the options file.
Additionally, the colon-format was extended to allow support for setting parallel threads. Tests were added to test all of the option settings via the optional parameters in the colon format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6817
Reviewed By: ajkr
Differential Revision: D22396004
Pulled By: zhichao-cao
fbshipit-source-id: 38bcf74b7e9cd5bc2a84540fac2e9ba4f765b2c8
Summary:
Periodic syncing of blob files is handled by a lower layer, namely by
`WritableFileWriter`; the `NeedsFsync` method of `BlobFile` and the
`last_fsync_` member variable are actually unused and thus can be
removed. See also https://github.com/facebook/rocksdb/pull/7125 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7138
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D22562981
Pulled By: ltamasi
fbshipit-source-id: c235aad94a7c27120528c9ec270a7a5b9154e49f
Summary:
Add "examples" build (which build examples folder in rocksdb) in TravisCI to CircleCI. This is helpful before pull request.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7136
Test Plan: Watch for CircleCI results to succeed
Reviewed By: jay-zhuang
Differential Revision: D22555528
Pulled By: akankshamahajan15
fbshipit-source-id: 6bca16647760d5f0131f064765fe9e88e034c578
Summary:
TSAN shows warning with clang with warning similar to this:
WARNING: ThreadSanitizer: data race (pid=10159)
Atomic write of size 8 at 0x7b5000002890 by thread T33:
#0 __tsan_atomic64_store <null> (db_test+0x4ca2b5)
https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<unsigned long>::store(unsigned long, std::memory_order) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:374:2 (db_test+0x774fde)
https://github.com/facebook/rocksdb/issues/2 rocksdb::VersionSet::SetLastSequence(unsigned long) /home/circleci/project/./db/version_set.h:1057:20 (db_test+0x774fde)
https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*) /home/circleci/project/db/db_impl/db_impl_write.cc:449:18 (db_test+0x774fde)
......
Previous read of size 8 at 0x7b5000002890 by thread T5 (mutexes: write M1044689462619020832):
#0 rocksdb::DBImpl::ReleaseSnapshot(rocksdb::Snapshot const*) /home/circleci/project/db/db_impl/db_impl.cc (db_test+0x6f4ae7)
https://github.com/facebook/rocksdb/issues/1 rocksdb::(anonymous namespace)::MTThreadBody(void*) /home/circleci/project/db/db_test.cc:2514:13 (db_test+0x56ac59)
https://github.com/facebook/rocksdb/issues/2 rocksdb::(anonymous namespace)::StartThreadWrapper(void*) /home/circleci/project/env/env_posix.cc:443:3 (db_test+0x88c4cd)
It is not limited to ReleaseSnapshot() and rocksdb::DBImpl::MultiCFSnapshot().
While we are not 100% sure it doesn't indicate any correctness violation, we suppress them for now to keep TSAN clean with more tests so that we can cover more bugs with CI.
In the gcc runs we have been running, this warning rarely shows up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7126
Test Plan: See the mini-TSAN test to pass with reasonable run time.
Reviewed By: ajkr
Differential Revision: D22552375
fbshipit-source-id: ebdd3854cb3becec3403970326a1ca961db2ab00
Summary:
In both cases where `BlobDBImpl::DecompressSlice` is called,
`compression_type` is already checked at the call site; thus, the check
inside the method is redundant and can be turned into an assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7127
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D22533454
Pulled By: ltamasi
fbshipit-source-id: ae524443fc6abe0a5fb12327a3fe761a9cd2c831
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
Reviewed By: anand1976
Differential Revision: D21916789
Pulled By: zhichao-cao
fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
Summary:
Currently, RocksDB lets compaction to go through even in case of
corrupted keys, the number of which is reported in CompactionJobStats.
However, RocksDB does not check this value. We should let compaction run
in a stricter mode.
Temporarily disable two tests that allow corrupted keys in compaction.
With this PR, the two tests will assert(false) and terminate. Still need
to investigate what is the recommended google-test way of doing it.
Death test (EXPECT_DEATH) in gtest has warnings now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22530722
Pulled By: riversand963
fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6
Summary:
https://github.com/facebook/rocksdb/pull/6850, which added compaction
filter support to BlobDB, reused elements of the BlobDB GC mechanism.
This patch updates some log messages in this logic to account for this
fact; namely, it replaces mentions of "GC" with "compaction/GC" to avoid
confusion in cases when GC is not enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7128
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D22535371
Pulled By: ltamasi
fbshipit-source-id: 1f14f3b02ab9983728bbca1cf680420208d9a195
Summary:
This fixes an issue introduced in 0c56fc4 whereby the location of Python is evaluated many times and leads to excessive logging of unknown python locations of CentOS 6.
The location is now only checked once.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7123
Reviewed By: zhichao-cao
Differential Revision: D22532274
Pulled By: ajkr
fbshipit-source-id: cade71b4b46e9a23d63ecb4dd36a4ac8ae217970
Summary:
It is helpful to add some TSAN coverage before a pull request is committed. This diff adds some of them.
Some slow tests are excluded for the running speed. Some are blacklisted because they show warnings. Will investigate these warnings and see whether we can fix or suppress them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7122
Test Plan: Watch CIrcleCI runs
Reviewed By: riversand963
Differential Revision: D22532133
fbshipit-source-id: 81ddd02d9df19c513a12811979e8ddabae911354
Summary:
Periodic syncing of blob files is performed by `WritableFileWriter`;
`bytes_per_sync_` and `next_sync_offset_` in `BlobLogWriter` are
actually unused (or more precisely, only used by methods that are
themselves unused). The patch removes all this dead code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7125
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D22531021
Pulled By: ltamasi
fbshipit-source-id: 6b293ad5a79d3e6bf15c5c68f7aedd7ce7a15f10