Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
`IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
forced, the same as above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
Test Plan: run tests on btrfs and non btrfs
Reviewed By: ajkr
Differential Revision: D30885059
Pulled By: jay-zhuang
fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
Summary:
Made SystemClock into a Customizable class, complete with CreateFromString.
Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv).
Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636
Reviewed By: zhichao-cao
Differential Revision: D30483360
Pulled By: mrambacher
fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
And change the internal time value from seconds to microseconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7293
Reviewed By: pdillinger
Differential Revision: D23253751
Pulled By: jay-zhuang
fbshipit-source-id: 36aa9376b8801b85bd10163173590a17cf4f3a3a
Summary:
There's a potential deadlock caused by MockTimeEnv time value get to a large number, which causes TimedWait() wait forever. The test misuses the microseconds as seconds, making it more likely to happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7277
Reviewed By: pdillinger
Differential Revision: D23183873
Pulled By: jay-zhuang
fbshipit-source-id: 6fc38ebd40b4125a99551204b271f91a27e70086
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223
Reviewed By: riversand963
Differential Revision: D23056737
Pulled By: jay-zhuang
fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
Summary:
```
int* value = new int;
ASSERT_NE(nullptr, value);
```
`ASSERT_NE` can expand the expression such that a memory leak is
reported by clang analyzer.
We can remove this ASSERT_NE since we can assume the memory allocation
must succeed. Otherwise a bad alloc exception will be thrown and the
process will be killed anyway.
Test plan (dev server):
```
USE_CLANG=1 make analyze
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7245
Reviewed By: jay-zhuang
Differential Revision: D23079641
Pulled By: riversand963
fbshipit-source-id: a6739a903f90f8715f6f1ef3e5c8a329245b8e78
Summary:
Timer (defined in timer.h) schedules and runs user-specified fuctions
regularly. Current implementation holds the mutex while running user
function, which will lead to contention and waiting.
To fix, Timer::Run releases mutex before running user function, and
re-acquires it afterwards.
This fix will impact how we can cancel a task. If the task is running,
it is not holding the mutex. The thread calling Cancel() should wait
until the current task finishes.
Test Plan (devserver):
make check
COMPILE_WITH_ASAN=1 make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7228
Reviewed By: jay-zhuang
Differential Revision: D23065487
Pulled By: riversand963
fbshipit-source-id: 07cb59741f506d3eb875c8ab90f73437568d3724
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
Summary:
Disable `TimerTest.SingleScheduleRepeatedlyTest` and `TimerTest.MultipleScheduleRepeatedlyTest`. This is to help people to not hit any hangs (https://github.com/facebook/rocksdb/issues/6698) during their development process while I investigate further; I could not reproduce the issue on my dev machine yet. Note that timer is not being utilized anywhere yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6833
Test Plan:
```
svemuri@devbig187 ~/rocksdb (timer-disable-test) $ TEST_TMPDIR=/dev/shm ./timer_test
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from TimerTest
[ RUN ] TimerTest.SingleScheduleOnceTest
[ OK ] TimerTest.SingleScheduleOnceTest (1 ms)
[ RUN ] TimerTest.MultipleScheduleOnceTest
[ OK ] TimerTest.MultipleScheduleOnceTest (0 ms)
[----------] 2 tests from TimerTest (1 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (1 ms total)
[ PASSED ] 2 tests.
YOU HAVE 2 DISABLED TESTS
```
Reviewed By: pdillinger
Differential Revision: D21502474
Pulled By: sagar0
fbshipit-source-id: ac67caee2011fd14ffb2476a8914a6286a4f9abe
Summary:
Adding a simple timer support to schedule work at a fixed time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6543
Test Plan: TODO: clean up the unit tests, and make them better.
Reviewed By: siying
Differential Revision: D20465390
Pulled By: sagar0
fbshipit-source-id: cba143f70b6339863e1d0f8b8bf92e51c2b3d678