Summary:
If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown.
Closes https://github.com/facebook/rocksdb/pull/2369
Differential Revision: D5127619
Pulled By: siying
fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350
Differential Revision: D5110648
Pulled By: siying
fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
Summary:
Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.
Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
Closes https://github.com/facebook/rocksdb/pull/2204
Differential Revision: D4936256
Pulled By: ajkr
fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163
Differential Revision: D4895953
Pulled By: siying
fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
Summary:
I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there.
Closes https://github.com/facebook/rocksdb/pull/2211
Differential Revision: D4951558
Pulled By: ajkr
fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080
Differential Revision: D4821141
Pulled By: siying
fbshipit-source-id: ca7d768
Summary:
PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756
Differential Revision: D4391738
Pulled By: maysamyabandeh
fbshipit-source-id: 6f3edd3
Summary:
the 50%+ drained constraint wasn't working consistently in some of our test environments, maybe their resources are too low. relax the constraints a bit.
Closes https://github.com/facebook/rocksdb/pull/1970
Differential Revision: D4679419
Pulled By: ajkr
fbshipit-source-id: 3789cd8
Summary:
fix when elapsed time spans non-integral number of intervals since the rate limiter may still be drained during a partial interval.
Closes https://github.com/facebook/rocksdb/pull/1948
Differential Revision: D4651304
Pulled By: ajkr
fbshipit-source-id: b1f9e70
Summary:
This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case.
The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor.
Closes https://github.com/facebook/rocksdb/pull/1946
Differential Revision: D4646489
Pulled By: ajkr
fbshipit-source-id: d8e0161
Summary:
Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux.
Closes https://github.com/facebook/rocksdb/pull/1924
Differential Revision: D4616702
Pulled By: siying
fbshipit-source-id: 13e6549
Summary:
…action
The two options, min_partial_merge_operands and verify_checksums_in_compaction, are not seldom used. Remove them to reduce the total number of options. Also remove them from Java and C interface.
Closes https://github.com/facebook/rocksdb/pull/1902
Differential Revision: D4601219
Pulled By: siying
fbshipit-source-id: aad4cb2
Summary:
The option has been deprecated for two years and has no effect. Removing.
Closes https://github.com/facebook/rocksdb/pull/1866
Differential Revision: D4555203
Pulled By: yiwu-arbug
fbshipit-source-id: c48f627
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859
Differential Revision: D4541292
Pulled By: sagar0
fbshipit-source-id: 5b3a6ca
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823
Differential Revision: D4492902
Pulled By: siying
fbshipit-source-id: c74cb11
Summary:
Added method that returns approx num of entries as well as size for memtables.
Closes https://github.com/facebook/rocksdb/pull/1841
Differential Revision: D4511990
Pulled By: VitaliyLi
fbshipit-source-id: 9a4576e
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787
Differential Revision: D4441111
Pulled By: IslamAbdelRahman
fbshipit-source-id: c11f4c3
Summary:
Enable directIO on WritableFileImpl::Append
with offset being current length of the file.
Enable UniqueID tests on Windows, disable others but
leeting them to compile. Unique tests are valuable to
detect failures on different filesystems and upcoming
ReFS.
Clear output in WinEnv Getchildren.This is different from
previous strategy, do not touch output on failure.
Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746
Differential Revision: D4385681
Pulled By: IslamAbdelRahman
fbshipit-source-id: c07b702
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
1. value 100 byte: 1.8% regular, 1.2% merge values
2. value 1k byte: 11.5% regular, 7.5% merge values
3. value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732
Differential Revision: D4374613
Pulled By: maysamyabandeh
fbshipit-source-id: a077f1a
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.
Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562
Differential Revision: D4218519
Pulled By: siying
fbshipit-source-id: 95e4088
Summary:
If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for
the write request, then fail immediately with Status::Incomplete().
Closes https://github.com/facebook/rocksdb/pull/1527
Differential Revision: D4191405
Pulled By: maysamyabandeh
fbshipit-source-id: 7f3ce3f
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477
Differential Revision: D4149836
Pulled By: maysamyabandeh
fbshipit-source-id: b3df19f
Summary:
The current 10 millisecond waiting for test results may not be sufficient in some test environments. Increase it to 60 seconds and check the results for every 1 milliseond.
Already reviewed: https://reviews.facebook.net/D65457
Closes https://github.com/facebook/rocksdb/pull/1437
Differential Revision: D4099443
Pulled By: siying
fbshipit-source-id: cf1f205
Summary:
The verification condition of the test DBTest.RepeatedWritesToSameKey doesn't hold anymore after 3ce3bb3da2.
Disable the test for now before we find a way to replace it.
Test Plan: Run the test and make sure it is disabled.
Summary:
Changes in the diff
API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API
Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers
Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)
Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: jkedgar, hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65061
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).
* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
* Added check_snapshot option in the DB's AddFile function
* change check_snapshot to skip_snapshot_check
* add unit test for skip_snapshot_check
* Add skip_snapshot_check comment
Summary:
DBTest.CompressionStatsTest on non_shm test where the storage device is slow
DBTest.CompressionStatsTest assumes that a flush happens to check the number of compressed blocks.
This is not always true if the Flush is slow, make the test more deterministic by forcing a flush before doing the check
Test Plan: Run the test locally
Reviewers: andrewkr, yiwu, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61317
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build