Commit Graph

2182 Commits

Author SHA1 Message Date
Andrew Kryczka
941543721d Bytes read stat for VerifyChecksum() and VerifyFileChecksums() APIs (#8741)
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741

Test Plan: new unit test

Reviewed By: zhichao-cao

Differential Revision: D30708578

Pulled By: ajkr

fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449
2021-09-07 13:28:29 -07:00
Peter Dillinger
0ef88538c6 Improve support for using regexes (#8740)
Summary:
* Consolidate use of std::regex for testing to testharness.cc, to
minimize Facebook linters constantly flagging uses in non-production
code.
* Improve syntax and error messages for asserting some string matches a
regex in tests.
* Add a public Regex wrapper class to encapsulate existing usage in
ObjectRegistry.
* Remove unnecessary include <regex>
* Put warnings that use of Regex in production code could cause bad
performance or stack overflow.

Intended follow-up work:
* Replace std::regex with another underlying implementation like RE2
* Improve ObjectRegistry interface in terms of possibly confusing literal
string matching vs. regex and in terms of reporting invalid regex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740

Test Plan:
tests updated, basic unit test for public Regex, and some manual
testing of temporary changes to see example error messages:

utilities/backupable/backupable_db_test.cc:917: Failure
000010_1162373755_138626.blob (child.name)
does not match regex
[0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern)

db/db_basic_test.cc:74: Failure
R3SHSBA8C4U0CIMV2ZB0 (sid3)
does not match regex [0-9A-Z]{20}HAHAHA

Reviewed By: mrambacher

Differential Revision: D30706246

Pulled By: pdillinger

fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd
2021-09-07 13:05:23 -07:00
Peter Dillinger
4750421ece Replace most typedef with using= (#8751)
Summary:
Old typedef syntax is confusing

Most but not all changes with

    perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
    make format

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751

Test Plan: existing

Reviewed By: zhichao-cao

Differential Revision: D30745277

Pulled By: pdillinger

fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
2021-09-07 11:31:59 -07:00
Peter Dillinger
c9cd5d25a8 Remove some unneeded code (#8736)
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736

Test Plan: existing tests

Reviewed By: mrambacher

Differential Revision: D30700039

Pulled By: pdillinger

fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
2021-09-01 14:28:58 -07:00
Hui Xiao
240c4126fd Implement superior user & mid IO priority level in GenericRateLimiter (#8595)
Summary:
Context:
An extra IO_USER priority in rate limiter allows users to optionally charge WAL writes / SST reads to rate limiter at this priority level, which then has higher priority than IO_HIGH and IO_LOW. With an extra IO_USER priority, it allows users to better specify the relative urgency/importance among different requests in rate limiter. As a consequence, IO resource management can better prioritize and limit resource based on user's need.

The IO_USER is implemented as superior priority in GenericRateLimiter, in the sense that its request queue will always be iterated first without being constrained to fairness. The reason is that the notion of fairness is only meaningful in helping lower priorities in background IO (i.e, IO_HIGH/MID/LOW) to gain some fair chance to run so that it does not block foreground IO (i.e, the ones that are charged at the level of IO_USER). As we can see, the ultimate goal here is to not blocking foreground IO at IO_USER level, which justifies the superiority of IO_USER.

Similar benefits exist for IO_MID priority.
- Rewrote the logic of deciding the order of iterating request queues of high/low priorities to include the extra user/mid priority w/o affecting the existing behavior (see PR's [comment](https://github.com/facebook/rocksdb/pull/8595/files#r678749331))
- Included the request queue of user-pri/mid-pri in the code path of next-leader-candidate signaling and GenericRateLimiter's destructor
- Included the extra user/mid-pri in bookkeeping data structures: total_bytes_through_ and total_requests_
- Re-written the previous impl of explicitly iterating priorities with a loop from Env::IO_LOW to Env::IO_TOTAL

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8595

Test Plan:
- passed existing rate_limiter_test.cc
- passed added unit tests in rate_limiter_test.cc
- run performance test to verify performance with only high/low requests is not affected by this change
   - Set-up command:
   `TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --duration=5 --compression_type=none --num=100000000 --disable_auto_compactions=true --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1))`

    - Test command:
   `TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=overwrite --use_existing_db=true --disable_wal=true --duration=30 --compression_type=none --num=100000000 --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1)) --statistics=true --rate_limiter_bytes_per_sec=1048576 --rate_limiter_refill_period_us=1000  --threads=32 |& grep -E '(flush|compact)\.write\.bytes'`

   - Before (on branch upstream/master):
   `rocksdb.compact.write.bytes COUNT : 4014162`
   `rocksdb.flush.write.bytes COUNT : 26715832`
    rocksdb.flush.write.bytes/rocksdb.compact.write.bytes ~= 6.66

   - After (on branch rate_limiter_user_pri):
  `rocksdb.compact.write.bytes COUNT : 3807822`
  `rocksdb.flush.write.bytes COUNT : 26098659`
   rocksdb.flush.write.bytes/rocksdb.compact.write.bytes ~= 6.85

Reviewed By: ajkr

Differential Revision: D30577783

Pulled By: hx235

fbshipit-source-id: 0881f2705ffd13ecd331256bde7e8ec874a353f4
2021-08-31 11:24:27 -07:00
Peter Dillinger
13ded69484 Built-in support for generating unique IDs, bug fix (#8708)
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.

This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.

Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
  * Lots of info from default Env like time, process id, and hostname.
  * std::random_device
  * port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.

DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.

GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.

Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708

Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D30563780

Pulled By: pdillinger

fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
2021-08-30 15:20:41 -07:00
Merlin Mao
6c2bd28a61 Update comments, fix typos. (#8721)
Summary:
- Removed the default empty constructors of `TraceWriter` and `TraceReader`.
- Removed unused `ReadFooter()` from `ReplayerImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8721

Test Plan: None

Reviewed By: zhichao-cao

Differential Revision: D30609743

Pulled By: autopear

fbshipit-source-id: 7e2626b015bd57ebb408a2836b4b4217cea10002
2021-08-27 13:16:32 -07:00
mrambacher
6e63e77af1 Make Configurable/Customizable options copyable (#8704)
Summary:
The atomic variable "is_prepared_" was keeping Configurable objects from being copy-constructed.  Removed the atomic to allow copies.

Since the variable is only changed from false to true (and never back), there is no reason it had to be atomic.

Added tests that simple Configurable and Customizable objects can be put on the stack and copied.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8704

Reviewed By: anand1976

Differential Revision: D30530526

Pulled By: ltamasi

fbshipit-source-id: 4dd4439b3e5ad7fa396573d0b25d9fb709160576
2021-08-25 17:48:08 -07:00
Hui Xiao
74cfe7db60 Refactor WriteBufferManager::CacheRep into CacheReservationManager (#8506)
Summary:
Context:
To help cap various memory usage by a single limit of the block cache capacity, we charge the memory usage through inserting/releasing dummy entries in the block cache. CacheReservationManager is such a class (non thread-safe) responsible for  inserting/removing dummy entries to reserve cache space for memory used by the class user.

- Refactored the inner private class CacheRep of WriteBufferManager into public CacheReservationManager class for reusability such as for https://github.com/facebook/rocksdb/pull/8428

- Encapsulated implementation details of cache key generation and dummy entries insertion/release in cache reservation as discussed in https://github.com/facebook/rocksdb/pull/8506#discussion_r666550838

- Consolidated increase/decrease cache reservation into one API - UpdateCacheReservation.

- Adjusted the previous dummy entry release algorithm in decreasing cache reservation to be loop-releasing dummy entries to stay symmetric to dummy entry insertion algorithm

- Made the previous dummy entry release algorithm in delayed decrease mode more aggressive for better decreasing cache reservation when memory used is less likely to increase back.

  Previously, the algorithms only release 1 dummy entries when new_mem_used < 3/4 * cache_allocated_size_ and cache_allocated_size_ - kSizeDummyEntry > new_mem_used.
Now, the algorithms loop-releases as many dummy entries as possible when new_mem_used < 3/4 * cache_allocated_size_.

- Updated WriteBufferManager's test cases to adapt to changes on the release algorithm mentioned above and left comment for some test cases for clarity

- Replaced the previous cache key prefix generation (utilizing object address related to the cache client) with one that utilizes Cache->NewID() to prevent cache-key collision among dummy entry clients sharing the same cache.

  The specific collision we are preventing happens when the object address is reused for a new cache-key prefix while the old cache-key using that same object address in its prefix still exists in the cache. This could happen due to that, under LRU cache policy, there is a possible delay in releasing a cache entry after the cache client object owning that cache entry get deallocated. In this case, the object address related to the cache client object can get reused for other client object to generate a new cache-key prefix.

  This prefix generation can be made obsolete after Peter's unification of all the code generating cache key, mentioned in https://github.com/facebook/rocksdb/pull/8506#discussion_r667265255

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8506

Test Plan:
- Passing the added unit tests cache_reservation_manager_test.cc
- Passing existing and adjusted write_buffer_manager_test.cc

Reviewed By: ajkr

Differential Revision: D29644135

Pulled By: hx235

fbshipit-source-id: 0fc93fbfe4a40bb41be85c314f8f2bafa8b741f7
2021-08-24 12:43:31 -07:00
Jay Zhuang
249b1078c9 Add extra information to RemoteCompaction APIs (#8680)
Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30444859

Pulled By: jay-zhuang

fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448
2021-08-23 16:27:38 -07:00
Levi Tamasi
8c9e689790 Update version.h and HISTORY.md for the 6.24 release (#8688)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8688

Reviewed By: ajkr, riversand963

Differential Revision: D30467746

Pulled By: ltamasi

fbshipit-source-id: 0fce0d42fe2fe3cb56d7a89607154b3b957f09b6
2021-08-20 22:28:16 -07:00
Peter Dillinger
04db764831 Embed original file number in SST table properties (#8686)
Summary:
I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add
file numbers to external SST files (so that more can share db session
ids for better uniqueness properties), because of forward compatibility.
We would have a version of RocksDB that assumes session IDs are unique
on external SST files and therefore can't really break that invariant in
future files.

This change adds a table property for "orig_file_number" which is
populated by normal SST files and also external SST files generated by
SstFileWriter. SstFileWriter now keeps a db_session_id for life of the
object and increments its own file numbers for embedding in table
properties. (They are arguably "fake" file numbers because these numbers
and not embedded in the file name.)

While updating block_based_table_builder, I removed several unnecessary
fields from Rep, because following the pattern would have created
another unnecessary field.

This change also updates block_based_table_reader to use this new
property when available, which means that for newer SST files, we can
determine the stable/original <db_session_id,file_number> unique
identifier using just the file contents, not the file name. (It's a bit
complicated; detailed comments in block_based_table_reader.)

Also added DB host id to properties listing by sst_dump, which could be
useful in debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686

Test Plan: majorly overhauled StableCacheKeys test for this change

Reviewed By: zhichao-cao

Differential Revision: D30457742

Pulled By: pdillinger

fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445
2021-08-20 20:40:48 -07:00
Peter Dillinger
2a383f21f4 Add Bloom/Ribbon hybrid API support (#8679)
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.

So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)

I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.

C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.

BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679

Test Plan: new + updated tests, including crash test

Reviewed By: jay-zhuang

Differential Revision: D30445797

Pulled By: pdillinger

fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
2021-08-20 18:00:16 -07:00
Merlin Mao
baf22b4ee6 Add IteratorTraceExecutionResult for iterator related trace records. (#8687)
Summary:
- Allow to get `Valid()`, `status()`, `key()` and `value()` of an iterator from `IteratorTraceExecutionResult`.
- Move lower bound and upper bound from `IteratorSeekQueryTraceRecord` to `IteratorQueryTraceRecord`.

Added test in `DBTest2.TraceAndReplay`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8687

Reviewed By: zhichao-cao

Differential Revision: D30457630

Pulled By: autopear

fbshipit-source-id: be433099a25895b3aa6f0c00f95ad7b1d7489c1d
2021-08-20 15:35:56 -07:00
anand76
f35042ca40 Add a PerfContext counter for secondary cache hits (#8685)
Summary:
Add a PerfContext counter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8685

Reviewed By: zhichao-cao

Differential Revision: D30453957

Pulled By: anand1976

fbshipit-source-id: 42888a3ced240e1c44446d52d3b04adfb01f5665
2021-08-20 15:17:30 -07:00
Merlin Mao
ff8953380f Add iterator's lower and upper bounds to TraceRecord (#8677)
Summary:
Trace file V2 added lower/upper bounds to `Iterator::Seek()` and `Iterator::SeekForPrev()`. They were not used anywhere during the execution of a `TraceRecord`. Now they are added to be used by `ReadOptions` during `Iterator::Seek()` and `Iterator::SeekForPrev()` if they are set.

Added test cases in `DBTest2.TraceAndManualReplay`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8677

Reviewed By: zhichao-cao

Differential Revision: D30438255

Pulled By: autopear

fbshipit-source-id: 82563006be0b69155990e506a74951c18af8d288
2021-08-19 17:27:12 -07:00
mrambacher
9eb002fcf0 Fix some minor issues in the Customizable infrastructure (#8566)
Summary:
- Fix issue with OptionType::Vector when the nested item is a Customizable with no names
- Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
- Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
- Fix/Add tests for null/empty customizable objects
- Move the RegisterTestObjects from customizable_test into testutil.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566

Reviewed By: zhichao-cao

Differential Revision: D30303724

Pulled By: mrambacher

fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5
2021-08-19 10:10:47 -07:00
Merlin Mao
d10801e983 Allow Replayer to report the results of TraceRecords. (#8657)
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.

New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".

`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657

Reviewed By: ajkr

Differential Revision: D30290216

Pulled By: autopear

fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
2021-08-18 17:06:14 -07:00
Levi Tamasi
6878cedcc3 Add statistics support to integrated BlobDB (#8667)
Summary:
The patch adds statistics support to the integrated BlobDB implementation,
namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and
`BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms
`BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like
`BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`,
`BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.)
Note that the vast majority of the old BlobDB's tickers/histograms are not
really applicable to the new implementation, since they e.g. pertain to calling
dedicated BlobDB APIs (which the integrated BlobDB does not have) or are
tied to the legacy BlobDB's design of writing blob files synchronously when
a write API is called. Such statistics are marked "legacy BlobDB only" in
`statistics.h`.

Fixes https://github.com/facebook/rocksdb/issues/8645 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667

Test Plan: Ran `make check` and tested the new statistics using `db_bench`.

Reviewed By: riversand963

Differential Revision: D30356884

Pulled By: ltamasi

fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4
2021-08-17 17:22:31 -07:00
anand76
add68bd28a Add a stat to count secondary cache hits (#8666)
Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666

Test Plan: Update a unit test in lru_cache_test

Reviewed By: zhichao-cao

Differential Revision: D30353816

Pulled By: anand1976

fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77
2021-08-16 21:01:14 -07:00
Jay Zhuang
c55460c734 Add property LiveSstFilesSizeAtTemperature for tiered storage (#8644)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8644

Reviewed By: siying, zhichao-cao

Differential Revision: D30236535

Pulled By: jay-zhuang

fbshipit-source-id: 1758d1c46d83a5087560fb63d53a016bf999da81
2021-08-15 14:17:45 -07:00
Baptiste Lemaire
e51be2c5a1 Improve MemPurge sampling (#8656)
Summary:
Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details).
In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID.
I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction).
Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656

Reviewed By: pdillinger

Differential Revision: D30288583

Pulled By: bjlemaire

fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3
2021-08-13 14:35:41 -07:00
Merlin Mao
74a652a45f Code cleanup for trace replayer (#8652)
Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652

Reviewed By: ajkr

Differential Revision: D30276872

Pulled By: autopear

fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37
2021-08-12 09:22:43 -07:00
Merlin Mao
f58d276764 Make TraceRecord and Replayer public (#8611)
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.

User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.

Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611

Reviewed By: ajkr

Differential Revision: D30266329

Pulled By: autopear

fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
2021-08-11 19:32:46 -07:00
Jay Zhuang
87e2358736 Add suggestion for btrfs user to disable preallocation (#8646)
Summary:
Add comment for `options.allow_fallocate` that btrfs
preallocated space are not freed and a suggestion to disable
preallocation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8646

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D30240050

Pulled By: jay-zhuang

fbshipit-source-id: 75b7190bc8276ce8d8ac2d0cb9064b386cbf4768
2021-08-11 14:53:37 -07:00
Baptiste Lemaire
e3a96c4823 Memtable sampling for mempurge heuristic. (#8628)
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628

Reviewed By: pdillinger

Differential Revision: D30149315

Pulled By: bjlemaire

fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
2021-08-10 18:09:03 -07:00
sdong
e7c24168d8 Move old files to warm tier in FIFO compactions (#8310)
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310

Test Plan: Add several unit tests.

Reviewed By: ajkr

Differential Revision: D28493792

fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
2021-08-09 12:51:14 -07:00
Roy Crihfield
d4b75d295f Add more C bindings for OptimisticTransactionDB (#8526)
Summary:
* `rocksdb_optimistictransactiondb_checkpoint_object_create`
* `rocksdb_optimistictransactiondb_write`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8526

Reviewed By: ajkr

Differential Revision: D30076822

Pulled By: jay-zhuang

fbshipit-source-id: a59956a8d5449e75d39a8087fbb2bad148cf697d
2021-08-06 19:10:48 -07:00
Peter Dillinger
a7fd1d0881 Make backup restore atomic, with sync option (#8568)
Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.

Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).

Fixes https://github.com/facebook/rocksdb/issues/8500

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568

Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)

Reviewed By: akankshamahajan15

Differential Revision: D29812605

Pulled By: pdillinger

fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e
2021-08-06 09:50:21 -07:00
mrambacher
d057e8326d Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
 - Added Options/Configurable/Object Registration for TTL and Cassandra variants
 - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char.  Made the delimiter into a configurable option
 - Added tests for new functionality

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481

Reviewed By: zhichao-cao

Differential Revision: D30136050

Pulled By: mrambacher

fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
2021-08-06 08:27:25 -07:00
Akanksha Mahajan
fd2079938d Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620)
Summary:
Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D30091319

Pulled By: akankshamahajan15

fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c
2021-08-05 19:44:51 -07:00
Akanksha Mahajan
8b2f60b668 Cache warming blocks during flush (#8561)
Summary:
Insert warm blocks  (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D29773411

Pulled By: akankshamahajan15

fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66
2021-08-03 12:44:15 -07:00
Merlin Mao
4811115b3e Revert checkpoint fix (#8607)
Summary:
PR https://github.com/facebook/rocksdb/pull/8572 looses custom types in the options file. Need more API changes to fix this issue. Revert this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8607

Reviewed By: ajkr

Differential Revision: D30058289

Pulled By: autopear

fbshipit-source-id: 78f5a154c0bf193e8441bae4a36fa79b95277fd4
2021-08-02 18:29:35 -07:00
mrambacher
3aee4fbd41 Make EventListener into a Customizable Class (#8473)
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
2021-07-27 07:47:02 -07:00
Baptiste Lemaire
4361d6d163 Add simple heuristics for experimental mempurge. (#8583)
Summary:
Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE.
`ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput).
`ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios.
This PR also includes support for `experimental_mempurge_policy` in `db_bench`.
Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583

Reviewed By: pdillinger

Differential Revision: D29888050

Pulled By: bjlemaire

fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1
2021-07-26 11:56:29 -07:00
Merlin Mao
55f7ded80d Checkpoint dir options fix (#8572)
Summary:
Originally the 2 options `db_log_dir` and `wal_dir` will be reused in a snapshot db since the options files are just copied. By default, if `wal_dir` was not set when a db was created, it is set to the db's dir. Therefore, the snapshot db will use the same WAL dir. If both the original db and the snapshot db write to or delete from the WAL dir, one may modify or delete files which belong to the other. The same applies to `db_log_dir` as well, but as info log files are not copied or linked, it is simpler for this option.

2 arguments are added to `Checkpoint::CreateCheckpoint()`, allowing to override these 2 options.

`wal_dir`:  If the function argument `wal_dir` is empty, or set to the original db location, or the checkpoint location, the snapshot's `wal_dir` option will be updated to the checkpoint location. Otherwise, the absolute path specified in the argument will be used. During checkpointing, live WAL files will be copied or linked the new location, instead of the current WAL dir specified in the original db.

`db_log_dir`: Same as `wal_dir`, but no files will be copied or linked.

A new unit test was added: `CheckpointTest.CheckpointWithOptionsDirsTest`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8572

Test Plan:
New unit test
```
checkpoint_test --gtest_filter="CheckpointTest.CheckpointWithOptionsDirsTest"
```

Output
```
Note: Google Test filter = CheckpointTest.CheckpointWithOptionsDirsTest
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CheckpointTest
[ RUN      ] CheckpointTest.CheckpointWithOptionsDirsTest
[       OK ] CheckpointTest.CheckpointWithOptionsDirsTest (11712 ms)
[----------] 1 test from CheckpointTest (11712 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11713 ms total)
[  PASSED  ] 1 test.
```
This test will fail without this patch. Just modify the code to remove the 2 arguments introduced in this patch in `CreateCheckpoint()`.

Reviewed By: zhichao-cao

Differential Revision: D29832761

Pulled By: autopear

fbshipit-source-id: e6a639b4d674380df82998c0839e79cab695fe29
2021-07-23 11:13:01 -07:00
Jay Zhuang
c4a503f3df Fix an race condition during multiple DB opening (#8574)
Summary:
ObjectLibrary is shared between multiple DB instances, the
Register() could have race condition.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8574

Test Plan: pass the failed test

Reviewed By: ajkr

Differential Revision: D29855096

Pulled By: jay-zhuang

fbshipit-source-id: 541eed0bd495d2c963d858d81e7eabf1ba16153c
2021-07-22 13:43:06 -07:00
Jay Zhuang
c04a86a0e9 Update HISTORY.md and version.h 6.23 release (#8552)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8552

Reviewed By: ajkr

Differential Revision: D29746828

Pulled By: jay-zhuang

fbshipit-source-id: 17d564895ae9cb675d455e73626b9a6717db6279
2021-07-16 17:52:14 -07:00
Merlin Mao
3455ab0e2b Remove extra double quote in options.h (#8550)
Summary:
There is an extra "  in options.h (`"index block""`)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8550

Test Plan: None

Reviewed By: jay-zhuang

Differential Revision: D29746077

Pulled By: autopear

fbshipit-source-id: 2e5117296e5414b7c7440d990926bc1e567a0b4f
2021-07-16 17:05:25 -07:00
Peter Dillinger
df5dc73bec Don't hold DB mutex for block cache entry stat scans (#8538)
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.

This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)

This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538

Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.

Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.

Before (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     433.219 micros/op 2308 ops/sec;    0.1 MB/s ( transactions:78999 aborts:0)
    rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
    $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     448.802 micros/op 2228 ops/sec;    0.1 MB/s ( transactions:75999 aborts:0)
    rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323

Notice the 5s P100 write time.

After (hacked):

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     303.645 micros/op 3293 ops/sec;    0.1 MB/s ( transactions:98999 aborts:0)
    rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     310.383 micros/op 3221 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918

P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:

    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     311.365 micros/op 3211 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
    rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
    $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
    randomtransaction :     308.395 micros/op 3242 ops/sec;    0.1 MB/s ( transactions:97999 aborts:0)
    rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832

No substantial difference.

Reviewed By: siying

Differential Revision: D29738847

Pulled By: pdillinger

fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
2021-07-16 14:13:08 -07:00
Mark Rambacher
42ba60b3ba Make EncryptionProvider and BlockCipher into Customizable objects (#8354)
Summary:
Made the EncryptionProvider and BlockCipher classes inherit from Customizable.  Added/fixed the CreateFromString method to these classes to create instances from builtin or registered classes.  Added tests to verify that instances can be registered and retrieved as appropriate.

Added the ability to configure the builtin (CTR, ROT13) classes from configurable properties.  Added the appropriate tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8354

Reviewed By: zhichao-cao

Differential Revision: D29558949

Pulled By: mrambacher

fbshipit-source-id: c20286b32d179777e060f51a58943e9b0cf81d04
2021-07-16 07:58:51 -07:00
zaorangyang
b678cb1f86 The formal parameter types of CompressionOptions constructor should b… (#8510)
Summary:
…e consistent with the member variables's

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8510

Reviewed By: ajkr

Differential Revision: D29654067

Pulled By: mrambacher

fbshipit-source-id: 908baaddfb20c266db7c5aca6a87971393d62ee6
2021-07-15 18:06:50 -07:00
mrambacher
c8665611bc Make FlushBlockPolicyFactory into a Customizable class (#8432)
Summary:
Add ability to treat FlushBlockPolicyFactory as a Customizable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8432

Reviewed By: zhichao-cao

Differential Revision: D29558941

Pulled By: mrambacher

fbshipit-source-id: 4a791af941ea4a65fc2f1fdfb1d7a95f42ca6774
2021-07-12 09:04:59 -07:00
qieqieplus
bb485e986a Add ribbon filter to C API (#8486)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8486

Reviewed By: jay-zhuang

Differential Revision: D29625501

Pulled By: pdillinger

fbshipit-source-id: e6e2a455ae62a71f3a202278a751b9bba17ad03c
2021-07-09 16:22:48 -07:00
sdong
f127d459ad Add comments to options.bottommost_compression (#8415)
Summary:
Add comments to options.bottommost_compression for options.num_levels=1

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8415

Reviewed By: ajkr

Differential Revision: D29181997

fbshipit-source-id: 5f0f49470f75d796320ecb24d5dc4ef4eb6fbe0f
2021-07-08 10:50:59 -07:00
mrambacher
570248aeff Make SecondaryCache Customizable (#8480)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8480

Reviewed By: zhichao-cao

Differential Revision: D29528740

Pulled By: mrambacher

fbshipit-source-id: fd0f70d15f66611c8498257a9973f7e98ca13839
2021-07-06 09:18:08 -07:00
Baptiste Lemaire
9dc887ece0 Memtable "MemPurge" prototype (#8454)
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454

Reviewed By: anand1976

Differential Revision: D29433971

Pulled By: bjlemaire

fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
2021-07-02 05:23:02 -07:00
mrambacher
41c4b665f4 Fix PrepareOptions for Customizable Classes (#8468)
Summary:
Added the Customizable::ConfigureNewObject method.  The method will configure the object if options are found and invoke PrepareOptions if the flag is set properly.

Added tests to test that PrepareOptions is properly called and to test if PrepareOptions fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8468

Reviewed By: zhichao-cao

Differential Revision: D29494703

Pulled By: mrambacher

fbshipit-source-id: d5767dee5d7a98620ac66190262101cd0aa9d2b7
2021-06-30 14:09:36 -07:00
Jay Zhuang
93a7389442 Add statistics support on CompactionService remote side (#8368)
Summary:
Add statistics option on CompactionService remote side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8368

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D28944427

Pulled By: jay-zhuang

fbshipit-source-id: 2a19217f4a69b6e511af87eed12391860ef00c5e
2021-06-29 11:48:14 -07:00
Jay Zhuang
3503f28982 Add sub-compaction support for RemoteCompaction (#8364)
Summary:
Change the job_id for remote compaction interface, which will include
both internal compaction job_id, also a sub_compaction_job_id. It is not
a backward compatible change. The user needs to update interface during
upgrade. (We will avoid backward incompatible change after the feature is
not experimental.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8364

Reviewed By: ajkr

Differential Revision: D28917301

Pulled By: jay-zhuang

fbshipit-source-id: 6d72a21f652bb517ad6954d0387b496797fc4e11
2021-06-29 10:42:19 -07:00
mrambacher
89f66d4484 Add customizable_util.h to the public API (#8301)
Summary:
Useful for allowing new classes to create and manage Customizable objects without using internal APIs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8301

Reviewed By: zhichao-cao

Differential Revision: D29428303

Pulled By: mrambacher

fbshipit-source-id: 3d33d5197cc8379fe35b54d3d169f91f0dfe7a47
2021-06-29 09:08:57 -07:00
mrambacher
be219089ad Add BlobMetaData retrieval methods (#8273)
Summary:
Added BlobMetaData to ColumnFamilyMetaData and LiveBlobMetaData and DB API GetLiveBlobMetaData to retrieve it.

First pass at struct.  More tests and maybe fields to come...

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8273

Reviewed By: ltamasi

Differential Revision: D29102400

Pulled By: mrambacher

fbshipit-source-id: 8a2383a4446328be6b91dced9841fdd3dfc80b73
2021-06-28 08:13:29 -07:00
lorinlee
e9b627fd56 Fix comments in Status (#8429)
Summary:
Signed-off-by: lorinlee <lorinlee1996@gmail.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8429

Reviewed By: zhichao-cao

Differential Revision: D29256538

Pulled By: ajkr

fbshipit-source-id: fa483861148d82119f4c8775d0cf5f6c71b06215
2021-06-21 13:41:08 -07:00
Andrew Kryczka
7ed69fbe72 Update HISTORY.md and version.h 6.22 release (#8427)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8427

Reviewed By: zhichao-cao

Differential Revision: D29246916

Pulled By: ajkr

fbshipit-source-id: ccd44ca1a6dd5101dc37f19b8e1fe6c0e3883e0a
2021-06-19 12:12:46 -07:00
anand76
8ea0a2c1bd Parallelize secondary cache lookup in MultiGet (#8405)
Summary:
Implement the ```WaitAll()``` interface in ```LRUCache``` to allow callers to issue multiple lookups in parallel and wait for all of them to complete. Modify ```MultiGet``` to use this to parallelize the secondary cache lookups in order to reduce the overall latency. A call to ```cache->Lookup()``` returns a handle that has an incomplete value (nullptr), and the caller can call ```cache->IsReady()``` to check whether the lookup is complete, and pass a vector of handles to ```WaitAll``` to wait for completion. If any of the lookups fail, ```MultiGet``` will read the block from the SST file.

Another change in this PR is to rename ```SecondaryCacheHandle``` to ```SecondaryCacheResultHandle``` as it more accurately describes the return result of the secondary cache lookup, which is more like a future.

Tests:
1. Add unit tests in lru_cache_test
2. Benchmark results with no secondary cache configured
Master -
```
readrandom   :      41.175 micros/op 388562 ops/sec;  106.7 MB/s (7277999 of 7277999 found)
readrandom   :      41.217 micros/op 388160 ops/sec;  106.6 MB/s (7274999 of 7274999 found)
multireadrandom :      10.309 micros/op 1552082 ops/sec; (28908992 of 28908992 found)
multireadrandom :      10.321 micros/op 1550218 ops/sec; (29081984 of 29081984 found)
```

This PR -
```
readrandom   :      41.158 micros/op 388723 ops/sec;  106.8 MB/s (7290999 of 7290999 found)
readrandom   :      41.185 micros/op 388463 ops/sec;  106.7 MB/s (7287999 of 7287999 found)
multireadrandom :      10.277 micros/op 1556801 ops/sec; (29346944 of 29346944 found)
multireadrandom :      10.253 micros/op 1560539 ops/sec; (29274944 of 29274944 found)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8405

Reviewed By: zhichao-cao

Differential Revision: D29190509

Pulled By: anand1976

fbshipit-source-id: 6f8eff6246712af8a297cfe22ea0d1c3b2a01bb0
2021-06-18 09:35:59 -07:00
Baptiste Lemaire
e817bc9628 Added memtable garbage statistics (#8411)
Summary:
**Summary**:
2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations.

**Unit test**: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test.
This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR  can be exactly predicted, showing that we have a full understanding of the underlying operations.

**Performance testing**:
`./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times.
Timing done using "date" function in a bash script.
_Results_:
Original Rocksdb fork: mean 66.6 sec, std 1.18 sec.
This feature branch: mean 67.4 sec, std 1.35 sec.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411

Reviewed By: akankshamahajan15

Differential Revision: D29150629

Pulled By: bjlemaire

fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1
2021-06-18 04:57:27 -07:00
Akanksha Mahajan
5ba1b6e549 Cache warming data blocks during flush (#8242)
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.

 Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
 of blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242

Test Plan: Add new unit test

Reviewed By: anand1976

Differential Revision: D28521703

Pulled By: akankshamahajan15

fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
2021-06-17 21:56:47 -07:00
Peter Dillinger
865a25101d Mark Ribbon filter and optimize_filters_for_memory as production (#8408)
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.

Some small refactoring in db_stress.

Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408

Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:

bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384

ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384

Reviewed By: mrambacher

Differential Revision: D29140805

Pulled By: pdillinger

fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
2021-06-17 12:29:16 -07:00
mrambacher
d81c2d1e3d Add Inner method and support for wrapped Customizable (#8373)
Summary:
Inner method added for classes to override to return has-a relationship.  CheckedCast expanded to use Inner to return has-a relationship.

Future Customizable classes (Env, FileSystem, Statistics, etc) will use this feature

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8373

Reviewed By: pdillinger

Differential Revision: D29176369

Pulled By: mrambacher

fbshipit-source-id: cfb6d7702fa365ca4e40c4a50a19e3a534e5ac43
2021-06-17 08:39:36 -07:00
mrambacher
281ac9c89e Add CreateFrom methods to Env/FileSystem (#8174)
Summary:
- Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load.  This method/signature is a precursor to making these classes extend Customizable.

- Added CreateFromSystem to Env.  This method standardizes creating an Env from the environment variables.  Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI.  Now the code is more command/standardized.

- Added CreateFromFlags to Env.  These method allows Env to be create from string options (such as GFLAGS options) in a more standard way.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174

Reviewed By: zhichao-cao

Differential Revision: D28999603

Pulled By: mrambacher

fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb
2021-06-15 03:43:48 -07:00
Peter Dillinger
416dc7bed7 Clarify some DB::Open,OpenForReadOnly semantics (#8379)
Summary:
Longstanding tech debt

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8379

Test Plan: Better than not having an API contract

Reviewed By: jay-zhuang

Differential Revision: D29011131

Pulled By: pdillinger

fbshipit-source-id: 2c4796177733651954024fc17875f8642ca08d09
2021-06-14 16:09:21 -07:00
Hui Xiao
dcddc1065e Make CompactionService derived from Customizable (#8395)
Summary:
(1)Make CompactionService derived from Customizable by defining two extra functions that are needed, as described in customizable.h comment section
(2)Revise the MyTestCompactionService class in compaction_service_test.cc to satisfy the class inheritance requirement
(3)Specify namespace of ToString() in compaction_service_test.cc to avoid function collision with CompactionService's ancestor classes

Test did:
make -j24 compaction_service_test
./compaction_service_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8395

Reviewed By: jay-zhuang

Differential Revision: D29076068

Pulled By: hx235

fbshipit-source-id: c130100fa466939b3137e917f5fdc4b2ae8e37d4
2021-06-14 11:41:57 -07:00
mrambacher
6ad0810393 Make Comparator into a Customizable Object (#8336)
Summary:
Makes the Comparator class into a Customizable object.  Added/Updated the CreateFromString method to create Comparators.  Added test for using the ObjectRegistry to create one.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8336

Reviewed By: jay-zhuang

Differential Revision: D28999612

Pulled By: mrambacher

fbshipit-source-id: bff2cb2814eeb9fef6a00fddc61d6e34b6fbcf2e
2021-06-11 06:22:59 -07:00
David Devecsery
80a59a03a7 Cancel compact range (#8351)
Summary:
Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.

Tested via two tests added to db_tests2.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351

Reviewed By: ajkr

Differential Revision: D28808894

Pulled By: ddevec

fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
2021-06-07 11:41:31 -07:00
anand76
799cf37cb1 Update HISTORY and version to 6.21 (#8363)
Summary:
Update HISTORY and version to 6.21 on master.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8363

Reviewed By: jay-zhuang

Differential Revision: D28888818

Pulled By: anand1976

fbshipit-source-id: 9e5fac3b99ecc9f3b7d9f21474a39fa50decb117
2021-06-03 19:32:14 -07:00
sdong
dfa6b408fe Improve comments of iterate_upper_bound (#8331)
Summary:
ReadOptions.iterate_upper_bound's comment is confusing. Improve it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8331

Reviewed By: jay-zhuang

Differential Revision: D28696635

fbshipit-source-id: 7d9fa6fd1642562572140998c89d434058db8dda
2021-05-26 18:23:03 -07:00
Peter Dillinger
956ce9bde2 Some API clarification for manual compaction and listeners (#8330)
Summary:
Avoid people hitting bugs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8330

Test Plan: comments only

Reviewed By: siying

Differential Revision: D28683157

Pulled By: pdillinger

fbshipit-source-id: 2b34d3efb5e2fa34bea93d54c940cbd425212d25
2021-05-26 08:14:38 -07:00
Zhichao Cao
7303d02bdf Use new Insert and Lookup APIs in table reader to support secondary cache (#8315)
Summary:
Secondary cache is implemented to achieve the secondary cache tier for block cache. New Insert and Lookup APIs are introduced in https://github.com/facebook/rocksdb/issues/8271  . To support and use the secondary cache in block based table reader, this PR introduces the corresponding callback functions that will be used in secondary cache, and update the Insert and Lookup APIs accordingly.

benchmarking:
./db_bench --benchmarks="fillrandom" -num=1000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/tmp/rocks_t/db -partition_index_and_filters=true

./db_bench -db=/tmp/rocks_t/db -use_existing_db=true -benchmarks=readrandom -num=1000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=5 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -stats_dump_period_sec=30 -reads=50000000

master benchmarking results:
readrandom   :       3.923 micros/op 254881 ops/sec;   33.4 MB/s (23849796 of 50000000 found)
rocksdb.db.get.micros P50 : 2.820992 P95 : 5.636716 P99 : 16.450553 P100 : 8396.000000 COUNT : 50000000 SUM : 179947064

Current PR benchmarking results
readrandom   :       4.083 micros/op 244925 ops/sec;   32.1 MB/s (23849796 of 50000000 found)
rocksdb.db.get.micros P50 : 2.967687 P95 : 5.754916 P99 : 15.665912 P100 : 8213.000000 COUNT : 50000000 SUM : 187250053

About 3.8% throughput reduction.
P50: 5.2% increasing, P95, 2.09% increasing, P99 4.77% improvement

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8315

Test Plan: added the testing case

Reviewed By: anand1976

Differential Revision: D28599774

Pulled By: zhichao-cao

fbshipit-source-id: 098c4df0d7327d3a546df7604b2f1602f13044ed
2021-05-21 18:29:12 -07:00
Peter Dillinger
3469d60fcc Add table properties for number of entries added to filters (#8323)
Summary:
With Ribbon filter work and possible variance in actual bits
per key (or prefix; general term "entry") to achieve certain FP rates,
I've received a request to be able to track actual bits per key in
generated filters. This change adds a num_filter_entries table
property, which can be combined with filter_size to get bits per key
(entry).

This can vary from num_entries in at least these ways:
* Different versions of same key are only counted once in filters.
* With prefix filters, several user keys map to the same filter entry.
* A single filter can include both prefixes and user keys.

Note that FilterBlockBuilder::NumAdded() didn't do anything useful
except distinguish empty from non-empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323

Test Plan: basic unit test included, others updated

Reviewed By: jay-zhuang

Differential Revision: D28596210

Pulled By: pdillinger

fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376
2021-05-21 17:11:32 -07:00
sdong
2f1984dd45 Compare memtable insert and flush count (#8288)
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288

Test Plan: Pass all existing tests

Reviewed By: ajkr

Differential Revision: D28369194

fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
2021-05-20 16:07:28 -07:00
Jay Zhuang
3786181a90 Add remote compaction public API (#8300)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8300

Reviewed By: ajkr

Differential Revision: D28464726

Pulled By: jay-zhuang

fbshipit-source-id: 49e9f4fb791808a6cbf39a7b1a331373f645fc5e
2021-05-19 21:41:31 -07:00
Peter Dillinger
311a544c2a Use deleters to label cache entries and collect stats (#8297)
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.

The new information can be seen in info LOG, for example:

    Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
    Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)

And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):

    $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
    rocksdb.block-cache-entry-stats.bytes.data-block: 0
    rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-block: 0
    rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.bytes.index-block: 178992
    rocksdb.block-cache-entry-stats.bytes.misc: 0
    rocksdb.block-cache-entry-stats.bytes.other-block: 0
    rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
    rocksdb.block-cache-entry-stats.capacity: 8388608
    rocksdb.block-cache-entry-stats.count.data-block: 0
    rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-block: 0
    rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
    rocksdb.block-cache-entry-stats.count.index-block: 215
    rocksdb.block-cache-entry-stats.count.misc: 1
    rocksdb.block-cache-entry-stats.count.other-block: 0
    rocksdb.block-cache-entry-stats.count.write-buffer: 0
    rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
    rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
    rocksdb.block-cache-entry-stats.percent.misc: 0.000000
    rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
    rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
    rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
    rocksdb.block-cache-entry-stats.secs_since_last_collection: 0

Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.

This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.

An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".

Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.

Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.

This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297

Test Plan: manual testing with db_bench, and a couple of basic unit tests

Reviewed By: ltamasi

Differential Revision: D28488721

Pulled By: pdillinger

fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 16:51:13 -07:00
Glebanister
748e3acc11 Add StartThread type checking wrapper (#8303)
Summary:
- Add class `FunctorWrapper` to invoke the function with given parameters
- Implement `StartThreadTyped` which wraps `StartThread` with type checking cover
- Demonstrate `StartThreadTyped` in test `util/thread_local_test.cc`

https://github.com/facebook/rocksdb/issues/8285

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8303

Reviewed By: ajkr

Differential Revision: D28539318

Pulled By: pdillinger

fbshipit-source-id: 624789c236bde31163deda95c1e1471aee68933e
2021-05-19 16:51:13 -07:00
anand76
13232e11d4 Allow cache_bench/db_bench to use a custom secondary cache (#8312)
Summary:
This PR adds a ```-secondary_cache_uri``` option to the cache_bench and db_bench tools to allow the user to specify a custom secondary cache URI. The object registry is used to create an instance of the ```SecondaryCache``` object of the type specified in the URI.

The main cache_bench code is packaged into a separate library, similar to db_bench.

An example invocation of db_bench with a secondary cache URI -
```db_bench --env_uri=ws://ws.flash_sandbox.vll1_2/ -db=anand/nvm_cache_2 -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=67108864 -cache_index_and_filter_blocks=true  -secondary_cache_uri='cachelibwrapper://filename=/home/anand76/nvm_cache/cache_file;size=2147483648;regionSize=16777216;admPolicy=random;admProbability=1.0;volatileSize=8388608;bktPower=20;lockPower=12' -partition_index_and_filters=true -duration=1800```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8312

Reviewed By: zhichao-cao

Differential Revision: D28544325

Pulled By: anand1976

fbshipit-source-id: 8f209b9af900c459dc42daa7a610d5f00176eeed
2021-05-19 15:26:18 -07:00
Stanislav Tkach
83d1a66598 Expose CompressionOptions::parallel_threads through C API (#8302)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8302

Reviewed By: jay-zhuang

Differential Revision: D28499262

Pulled By: ajkr

fbshipit-source-id: 7b17b79af871d874dfca76db9bca0d640a6cd854
2021-05-17 22:53:04 -07:00
sdong
0ed8cb666d Write file temperature information to manifest (#8284)
Summary:
As a part of tiered storage, writing tempeature information to manifest is needed so that after DB recovery, RocksDB still has the tiering information, to implement some further necessary functionalities.

Also fix some issues in simulated hybrid FS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8284

Test Plan: Add a new unit test to validate that the information is indeed written and read back.

Reviewed By: zhichao-cao

Differential Revision: D28335801

fbshipit-source-id: 56aeb2e6ea090be0200181dd968c8a7278037def
2021-05-17 15:15:23 -07:00
anand76
feb06e83b2 Initial support for secondary cache in LRUCache (#8271)
Summary:
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.

Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.

cache_bench results before and after this change -
Command
```./cache_bench -ops_per_thread=10000000 -threads=1```
Before
```Complete in 40.688 s; QPS = 245774```
```Complete in 40.486 s; QPS = 246996```
```Complete in 42.019 s; QPS = 237989```
After
```Complete in 40.672 s; QPS = 245869```
```Complete in 44.622 s; QPS = 224107```
```Complete in 42.445 s; QPS = 235599```

db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
Commands
```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```

```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
Before
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
```
After
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271

Reviewed By: zhichao-cao

Differential Revision: D28357511

Pulled By: anand1976

fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
2021-05-13 22:58:40 -07:00
Jay Zhuang
d15fbae449 Refactor Option obj address from char* to void* (#8295)
Summary:
And replace `reinterpret_cast` with `static_cast` or no cast.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8295

Test Plan: `make check`

Reviewed By: mrambacher

Differential Revision: D28420303

Pulled By: jay-zhuang

fbshipit-source-id: 645be123a0df624dc2bea37cd54a35403fc494fa
2021-05-13 14:29:42 -07:00
Jay Zhuang
a79b46c503 Add De/Serialization for CompactionInput/Result (#8247)
Summary:
The functions will be used for remote compaction parameter
input and result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8247

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D28104680

Pulled By: jay-zhuang

fbshipit-source-id: c0a5178e6277125118384278efea2acbf90aa6cb
2021-05-12 12:36:43 -07:00
Peter Dillinger
78a309bf86 New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:

* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)

I have enhanced cache_bench to validate this approach:

* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.

To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225

Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.

The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.

Baseline typical output:

```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662

Operation latency (ns):
Count: 80000000 Average: 11223.9494  StdDev: 29.61
Min: 0  Median: 7759.3973  Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[       0,       1 ]       68   0.000%   0.000%
(    2900,    4400 ]       89   0.000%   0.000%
(    4400,    6600 ] 33630240  42.038%  42.038% ########
(    6600,    9900 ] 18129842  22.662%  64.700% #####
(    9900,   14000 ]  7877533   9.847%  74.547% ##
(   14000,   22000 ] 15193238  18.992%  93.539% ####
(   22000,   33000 ]  3037061   3.796%  97.335% #
(   33000,   50000 ]  1626316   2.033%  99.368%
(   50000,   75000 ]   421532   0.527%  99.895%
(   75000,  110000 ]    56910   0.071%  99.966%
(  110000,  170000 ]    16134   0.020%  99.986%
(  170000,  250000 ]     5166   0.006%  99.993%
(  250000,  380000 ]     3017   0.004%  99.996%
(  380000,  570000 ]     1337   0.002%  99.998%
(  570000,  860000 ]      805   0.001%  99.999%
(  860000, 1200000 ]      319   0.000% 100.000%
( 1200000, 1900000 ]      231   0.000% 100.000%
( 1900000, 2900000 ]      100   0.000% 100.000%
( 2900000, 4300000 ]       39   0.000% 100.000%
( 4300000, 6500000 ]       16   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
```

New, gather_stats=false. Median thread ops/sec of 5 runs:

```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458

Operation latency (ns):
Count: 80000000 Average: 11298.1027  StdDev: 42.18
Min: 0  Median: 7722.0822  Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[       0,       1 ]      109   0.000%   0.000%
(    2900,    4400 ]      793   0.001%   0.001%
(    4400,    6600 ] 34054563  42.568%  42.569% #########
(    6600,    9900 ] 17482646  21.853%  64.423% ####
(    9900,   14000 ]  7908180   9.885%  74.308% ##
(   14000,   22000 ] 15032072  18.790%  93.098% ####
(   22000,   33000 ]  3237834   4.047%  97.145% #
(   33000,   50000 ]  1736882   2.171%  99.316%
(   50000,   75000 ]   446851   0.559%  99.875%
(   75000,  110000 ]    68251   0.085%  99.960%
(  110000,  170000 ]    18592   0.023%  99.983%
(  170000,  250000 ]     7200   0.009%  99.992%
(  250000,  380000 ]     3334   0.004%  99.997%
(  380000,  570000 ]     1393   0.002%  99.998%
(  570000,  860000 ]      700   0.001%  99.999%
(  860000, 1200000 ]      293   0.000% 100.000%
( 1200000, 1900000 ]      196   0.000% 100.000%
( 1900000, 2900000 ]       69   0.000% 100.000%
( 2900000, 4300000 ]       32   0.000% 100.000%
( 4300000, 6500000 ]       10   0.000% 100.000%
```

New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:

```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551

Operation latency (ns):
Count: 80000000 Average: 11311.2629  StdDev: 45.28
Min: 0  Median: 7686.5458  Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[       0,       1 ]       71   0.000%   0.000%
(    2900,    4400 ]      291   0.000%   0.000%
(    4400,    6600 ] 34492060  43.115%  43.116% #########
(    6600,    9900 ] 16727328  20.909%  64.025% ####
(    9900,   14000 ]  7845828   9.807%  73.832% ##
(   14000,   22000 ] 15510654  19.388%  93.220% ####
(   22000,   33000 ]  3216533   4.021%  97.241% #
(   33000,   50000 ]  1680859   2.101%  99.342%
(   50000,   75000 ]   439059   0.549%  99.891%
(   75000,  110000 ]    60540   0.076%  99.967%
(  110000,  170000 ]    14649   0.018%  99.985%
(  170000,  250000 ]     5242   0.007%  99.991%
(  250000,  380000 ]     3260   0.004%  99.995%
(  380000,  570000 ]     1599   0.002%  99.997%
(  570000,  860000 ]     1043   0.001%  99.999%
(  860000, 1200000 ]      471   0.001%  99.999%
( 1200000, 1900000 ]      275   0.000% 100.000%
( 1900000, 2900000 ]      143   0.000% 100.000%
( 2900000, 4300000 ]       60   0.000% 100.000%
( 4300000, 6500000 ]       27   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
( 9800000, 14000000 ]        1   0.000% 100.000%

Gather stats latency (us):
Count: 46 Average: 980387.5870  StdDev: 60911.18
Min: 879155  Median: 1033777.7778  Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
(  860000, 1200000 ]       45  97.826%  97.826% ####################
( 1200000, 1900000 ]        1   2.174% 100.000%

Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```

Reviewed By: mrambacher

Differential Revision: D28295742

Pulled By: pdillinger

fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 16:17:10 -07:00
mrambacher
78e82410eb Added static methods for simple types to OptionTypeInfo (#8249)
Summary:
Added ParseType, SerializeType, and TypesAreEqual methods to OptionTypeInfo.  These methods can be used for serialization and deserialization of basic types.

Change the MutableCF/DB Options to use this format.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8249

Reviewed By: jay-zhuang

Differential Revision: D28351190

Pulled By: mrambacher

fbshipit-source-id: 72a78643b804f2f0bf59c32ffefa63346672ad16
2021-05-11 16:15:47 -07:00
mrambacher
9f2d255aed Add ObjectRegistry to ConfigOptions (#8166)
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used

The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166

Reviewed By: zhichao-cao

Differential Revision: D27657952

Pulled By: mrambacher

fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
2021-05-11 06:47:22 -07:00
Andrew Kryczka
a639c02f8e Allow applying CompactionFilter outside of compaction (#8243)
Summary:
From HISTORY.md release note:

- Allow `CompactionFilter`s to apply in more table file creation scenarios such as flush and recovery. For compatibility, `CompactionFilter`s by default apply during compaction. Users can customize this behavior by overriding `CompactionFilterFactory::ShouldFilterTableFileCreation()`.
- Removed unused structure `CompactionFilterContext`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8243

Test Plan: added unit tests

Reviewed By: pdillinger

Differential Revision: D28088089

Pulled By: ajkr

fbshipit-source-id: 0799be7908e3b39fea09fc3f1ab00e13ad817fae
2021-05-07 16:01:40 -07:00
Peter Dillinger
c26b75baa5 Deprecate obsolete "backupable db" from public APIs (#8274)
Summary:
An early design of BackupEngine used stackable DB, so I guess a
DB had to opt-in to being backupable. Unfortunately the naming of that
obsolete design still infects our public API and implementation.

This change fixes the public API, with a deprecated
backward-compatibility header. `BackupableDBOptions` is renamed to
`BackupEngineOptions` (copy-replace in the public header) and
backup_engine.h replaces backupable_db.h (present for backward
compatibility). The only other change in backupable_db.h ->
backup_engine.h is cleaning up headers.

Later changes will fix the internal implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274

Test Plan:
The internal implementation of BackupEngine uses the name
BackupEngineOptions, while the unit tests use the old name
BackupableDBOptions. This gives me confidence that both still work.

Reviewed By: mrambacher

Differential Revision: D28259471

Pulled By: pdillinger

fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613
2021-05-07 13:53:15 -07:00
sdong
a4919d6b62 Cap automatic arena block size to 1 MB (#7907)
Summary:
Larger arena block size does provide the benefit of reducing allocation overhead, however it may cause other troubles. For example, allocator is more likely not to allocate them to physical memory and trigger page fault. Weighing the risk, we cap the arena block size to 1MB. Users can always use a larger value if they want.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7907

Test Plan: Run all existing tests

Reviewed By: pdillinger

Differential Revision: D26135269

fbshipit-source-id: b7f55afd03e6ee1d8715f90fa11b6c33944e9ea8
2021-05-07 13:15:34 -07:00
Peter Dillinger
3b981eaa1d Fix use-after-free threading bug in ClockCache (#8261)
Summary:
In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with
-use_clock_cache, as well as db_bench -use_clock_cache, but not
single-threaded. Smaller cache size hits failure much faster. ASAN
reported the failuer as calling malloc_usable_size on the `key` pointer
of a ClockCache handle after it was reportedly freed. On detailed
inspection I found this bad sequence of operations for a cache entry:

state=InCache=1,refs=1
[thread 1] Start ClockCacheShard::Unref (from Release, no mutex)
[thread 1] Decrement ref count
state=InCache=1,refs=0
[thread 1] Suspend before CalcTotalCharge (no mutex)

[thread 2] Start UnsetInCache (from Insert, mutex held)
[thread 2] clear InCache bit
state=InCache=0,refs=0
[thread 2] Calls RecycleHandle (based on pre-updated state)
[thread 2] Returns to Insert which calls Cleanup which deletes `key`

[thread 1] Resume ClockCacheShard::Unref
[thread 1] Read `key` in CalcTotalCharge

To fix this, I've added a field to the handle to store the metadata
charge so that we can efficiently remember everything we need from
the handle in Unref. We must not read from the handle again if we
decrement the count to zero with InCache=1, which means we don't own
the entry and someone else could eject/overwrite it immediately.

Note before this change, on amd64 sizeof(Handle) == 56 even though there
are only 48 bytes of data. Grouping together the uint32_t fields would
cut it down to 48, but I've added another uint32_t, which takes it
back up to 56. Not a big deal.

Also fixed DisownData to cooperate with ASAN as in LRUCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261

Test Plan:
Manual + adding use_clock_cache to db_crashtest.py

Base performance
./cache_bench -use_clock_cache
Complete in 17.060 s; QPS = 2458513
New performance
./cache_bench -use_clock_cache
Complete in 17.052 s; QPS = 2459695

Any difference is easily buried in small noise.

Crash test shows still more bug(s) in ClockCache, so I'm expecting to
disable ClockCache from production code in a follow-up PR (if we
can't find and fix the bug(s))

Reviewed By: mrambacher

Differential Revision: D28207358

Pulled By: pdillinger

fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294
2021-05-04 22:18:00 -07:00
sdong
c3ff14e2c1 Hint temperature of bottommost level files to FileSystem (#8222)
Summary:
As the first part of the effort of having placing different files on different storage types, this change introduces several things:
(1) An experimental interface in FileSystem that specify temperature to a new file created.
(2) A test FileSystemWrapper,  SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature.
(3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file.
(4) A db_bench parameter that applies the (2) and (3) to db_bench.

The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222

Test Plan:
./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000  -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000

followed by

./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000

and see results as expected.

Reviewed By: ajkr

Differential Revision: D28003028

fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce
2021-05-03 13:34:04 -07:00
Peter Dillinger
d2ca04e3ed Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.

To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.

I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)

At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)

Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246

Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost

Reviewed By: mrambacher

Differential Revision: D28099346

Pulled By: pdillinger

fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
2021-04-30 13:50:13 -07:00
Peter Dillinger
85becd94c1 Refactor: use TableBuilderOptions to reduce parameter lists (#8240)
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.

Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.

Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).

Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240

Test Plan: ASAN make check

Reviewed By: mrambacher

Differential Revision: D28075891

Pulled By: pdillinger

fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
2021-04-29 07:00:50 -07:00
Duarte Nunes
3949731de3 Add WAL flush API to C client (#8226)
Summary:
The C client is missing the`manual_wal_flush` option and the `flush_wal` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8226

Reviewed By: ajkr

Differential Revision: D28000869

Pulled By: jay-zhuang

fbshipit-source-id: ed44937e7e7e75bc0dfa870a14147fbeef0c38f8
2021-04-27 14:56:23 -07:00
Sahir Hoda
13c655a887 New C API to expose NewCompactOnDeletionCollectorFactory (#8233)
Summary:
New C API rocksdb_options_add_compact_on_deletion_collector_factory to expose NewCompactOnDeletionCollectorFactory

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8233

Reviewed By: mrambacher

Differential Revision: D28018381

Pulled By: anand1976

fbshipit-source-id: 674c9ed902c91ff0d9f09e7a60c5f37b907604c6
2021-04-27 10:14:04 -07:00
mrambacher
6bab3a34e9 Move RegisterOptions into the Configurable API (#8223)
Summary:
As previously coded, a Configurable extension would need access to code not in the public API.  This change moves RegisterOptions into the Configurable class and therefore available to public extensions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8223

Reviewed By: anand1976

Differential Revision: D27960188

Pulled By: mrambacher

fbshipit-source-id: ac88b19397183df633902def5b5701b9b65fbf40
2021-04-26 03:13:24 -07:00
Sahir Hoda
d65d7d657d Expose JemallocNodumpAllocator to C API (#8178)
Summary:
Add new C APIs to create the JemallocNodumpAllocator and set it on a Cache object.

`make test` passes with and without `DISABLE_JEMALLOC=1`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8178

Reviewed By: jay-zhuang

Differential Revision: D27944631

Pulled By: ajkr

fbshipit-source-id: 2531729aa285a8985c58f22f093c4d53029c4a7b
2021-04-22 22:22:34 -07:00
sdong
4985cea141 Add comment to DisableManualCompaction() (#8186)
Summary:
Add comment to DisableManualCompaction() which was missing.
Also explictly return from DBImpl::CompactRange() to avoid memtable flush when manual compaction is disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8186

Test Plan: Run existing unit tests.

Reviewed By: jay-zhuang

Differential Revision: D27744517

fbshipit-source-id: 449548a48905903b888dc9612bd17480f6596a71
2021-04-21 15:23:46 -07:00
Akanksha Mahajan
596e9008e4 Stall writes in WriteBufferManager when memory_usage exceeds buffer_size (#7898)
Summary:
When WriteBufferManager is shared across DBs and column families
to maintain memory usage under a limit, OOMs have been observed when flush cannot
finish but writes continuously insert to memtables.
In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write,
this change will stall incoming writers until flush is completed and memory_usage
drops.

Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_
(memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true.

DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall()
(which adds dummy stall object to the writer's queue).
Then DB is blocked on a state State::Blocked (current write doesn't go
through). WBStallInterface object maintained by every DB instance is added to the queue of
WriteBufferManager.

If multiple DBs tries to write during this stall, they will also be
blocked when check WriteBufferManager::ShouldStall() returns true.

End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory
waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush
to complete and avoid continous stalling if memory usage remains close to buffer_size.

WriterBufferManager::EndWriteStall() is called,
which removes all instances from its queue and signal them to continue.
Their state is changed to State::Running and they are unblocked. DBImpl
then signal all incoming writers of that DB to continue by calling
write_thread_.EndWriteStall() (which removes dummy stall object from the
queue).

DB instance creates WBMStallInterface which is an interface to block and
signal DBs during stall.
When DB needs to be blocked or signalled by WriteBufferManager,
state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898

Test Plan: Added a new test db/db_write_buffer_manager_test.cc

Reviewed By: anand1976

Differential Revision: D26093227

Pulled By: akankshamahajan15

fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae
2021-04-21 13:54:02 -07:00
Peter Dillinger
95f6add746 Revert Ribbon starting level support from #8198 (#8212)
Summary:
This partially reverts commit 10196d7edc.

The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.

We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212

Test Plan: CI

Reviewed By: jay-zhuang

Differential Revision: D27896468

Pulled By: pdillinger

fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e
2021-04-20 19:46:40 -07:00
Jay Zhuang
a89740fbc6 Fix unittest no space issue (#8204)
Summary:
Unittest reports no space from time to time, which can be reproduced on a small memory machine with SHM. It's caused by large WAL files generated during the test, which is preallocated, but didn't truncate during close(). Adding the missing APIs to set preallocation.
It added arm test as nightly build, as the test runs more than 1 hour.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8204

Test Plan: test on small memory arm machine

Reviewed By: mrambacher

Differential Revision: D27873145

Pulled By: jay-zhuang

fbshipit-source-id: f797c429d6bc13cbcc673bc03fcc72adda55f506
2021-04-20 08:42:28 -07:00
Akanksha Mahajan
531a5f88a1 Update release version to 6.20 (#8199)
Summary:
Update release version to 6.20

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8199

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D27838750

Pulled By: akankshamahajan15

fbshipit-source-id: f02f722fc6bdd37d626d47a0e932bbecea3507a8
2021-04-16 20:15:36 -07:00
Peter Dillinger
10196d7edc Ribbon long-term support, starting level support (#8198)
Summary:
Since the Ribbon filter schema seems good (compatible back to
6.15.0), this change commits to long term support of the SST schema,
even though we expect the API for enabling Ribbon to change (still
called NewExperimentalRibbonFilterPolicy).

This also adds support for "hybrid" configuration in which some levels
use Bloom (higher levels, lower numbered) for speed and the rest use
Ribbon (lower levels, higher numbered) for memory space efficiency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198

Test Plan: unit test added, crash test support

Reviewed By: jay-zhuang

Differential Revision: D27831232

Pulled By: pdillinger

fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab
2021-04-16 15:43:08 -07:00
mrambacher
4c41e51c07 Add Blob Options to C API (#8148)
Summary:
Added the Blob option settings from the AdvancedColmnFamilyOptions to the C API.

There are no tests for getting/setting options in the C API currently, hence no specific test plans.  Should there be a some?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8148

Reviewed By: ltamasi

Differential Revision: D27568495

Pulled By: mrambacher

fbshipit-source-id: 3a52b784467ea2c4bc58be5f75c5d41f0a5c55d6
2021-04-16 05:56:00 -07:00