Commit Graph

2245 Commits

Author SHA1 Message Date
Baptiste Lemaire
92822655fd Remove deprecated table_cache_remove_scan_count_limit option. (#9450)
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9450

Reviewed By: ajkr

Differential Revision: D33802466

Pulled By: bjlemaire

fbshipit-source-id: 97570985f1400525304053476450f7ef504c0cd5
2022-01-27 09:33:31 -08:00
Peter Dillinger
449029f865 Remove deprecated ObjectLibrary::Register() (and Regex public API) (#9439)
Summary:
Regexes are considered potentially problematic for use in
registering RocksDB extensions, so we are removing
ObjectLibrary::Register() and the Regex public API it depended on (now
unused).

In reference to https://github.com/facebook/rocksdb/issues/9389

Why?
* The power of Regexes can make it hard to reason about which extension
will match what. (The replacement API isn't perfect, but we are at least
"holding the line" on patterns we have seen in practice.)
* It is easy to make regexes that don't quite mean what you think they
mean, such as forgetting that the `.` in `foo.bar` can match any character
or that matching is nondeterministic, as in `a🅱️42` matching `.*:[0-9]+`.
* Some regexes and implementations can have disastrously bad
performance. This might not be much practical concern for ObjectLibray
here, but we don't want to encourage potentially dangerous further use
in production code. (Testing code is fine. See TestRegex.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9439

Test Plan: CI

Reviewed By: mrambacher

Differential Revision: D33792342

Pulled By: pdillinger

fbshipit-source-id: 4f64dcb04764e639162c8977a5fa196f67754cec
2022-01-26 16:22:44 -08:00
Siddhartha Roychowdhury
c27ca23644 Add option for WAL compression algorithm (#9432)
Summary:
Add an option to set the WAL compression algorithm - wal_compression.

TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432

Reviewed By: pdillinger

Differential Revision: D33797275

Pulled By: sidroyc

fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
2022-01-26 14:23:00 -08:00
Jay Zhuang
961d8dacf2 Remove unused option purge_redundant_kvs_while_flush (#9429)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9429

Test Plan: fake release for test: D33754513

Reviewed By: riversand963

Differential Revision: D33753637

Pulled By: jay-zhuang

fbshipit-source-id: 18db4701e8f28dda8f1ab660c2be9890a8312c12
2022-01-26 10:24:16 -08:00
Akanksha Mahajan
8822562d75 Remove deprecated function DB::AddFile (#9433)
Summary:
RocksDB has marked DB::AddFile() as "DEPRECATED_FUNC" for a long time, and
it will be removed in the upcoming 7.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9433

Test Plan: make check -j64; CircleCI

Reviewed By: riversand963

Differential Revision: D33763987

Pulled By: akankshamahajan15

fbshipit-source-id: a3407324479bb43689e1213e4e29d53095e7579a
2022-01-25 23:22:58 -08:00
Jay Zhuang
022b400cba Make bottommost_temperature dynamically changeable (#9402)
Summary:
Make `AdvancedColumnFamilyOptions.bottommost_temperature`
dynamically changeable with `SetOptions` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9402

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D33674487

Pulled By: jay-zhuang

fbshipit-source-id: 8943768156aa6197c63850a64238a8092527d517
2022-01-25 15:23:04 -08:00
Yanqin Jin
fa52376117 Move RADOS support to separate repo (#9206)
Summary:
This PR moves RADOS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host RADOS support. At this point,
people can start from the example repo and fork.

The goal is to include this commit in RocksDB 7.0 release.

Reference:
https://github.com/ajkr/dedupfs by ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9206

Test Plan:
Follow instructions in https://github.com/riversand963/rocksdb-rados-env/blob/main/README.md and build
test binary `env_librados_test` and run it.

Also, make check

Reviewed By: ajkr

Differential Revision: D33751690

Pulled By: riversand963

fbshipit-source-id: 30466c62afa9e4619847a48567ed158e62835e35
2022-01-24 22:50:07 -08:00
Yanqin Jin
50135c1bf3 Move HDFS support to separate repo (#9170)
Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.

Java/JNI is not included yet, and needs to be done later if necessary.

The goal is to include this commit in RocksDB 7.0 release.

Reference:
https://github.com/ajkr/dedupfs by ajkr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170

Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.

make check

Reviewed By: ajkr

Differential Revision: D33751662

Pulled By: riversand963

fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
2022-01-24 20:23:54 -08:00
anand76
e8f116deab Update version to 6.29.0 (#9418)
Summary:
Update version for 6.29 release

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9418

Reviewed By: riversand963

Differential Revision: D33721048

Pulled By: anand1976

fbshipit-source-id: e73602ee1c829c2e47ce6e181bca4db7cb663979
2022-01-21 18:23:07 -08:00
Peter Dillinger
e7ac7363b4 Add to HISTORY and minor loose ends from #9294, #9254 (#9386)
Summary:
Loose ends relate to mmap on 32-bit systems. (Testing is more
complicated when the feature was completely disabled on 32-bit.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9386

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D33590715

Pulled By: pdillinger

fbshipit-source-id: f2637036a538a552200adee65b6765fce8cae27b
2022-01-21 13:04:19 -08:00
Brian Chen
93a0e9f3fa Mark destructors as override (#9404)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9404

It is better practice to mark destructors as override. Without this
change there can be issues building with
-Wsuggest-destructor-override.

Reviewed By: riversand963

Differential Revision: D33671992

fbshipit-source-id: 75b0c15010cbab5fbc071c150fef1dc85d5d9d96
2022-01-20 08:44:27 -08:00
Peter Dillinger
ffe1e4b820 Make some FilterPolicy deprecations more clear (#9403)
Summary:
The old block-based filter has been deprecated for years, but
this makes that more clear by marking the functions specific to it and
logging a warning when the feature is used.

It is deprecated because of performance. In that old design, you have to
binary search through the full SST index before a bloom filter query, which
is much more expensive than a bloom query itself.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9403

Test Plan:
Used db_bench with and without -use_block_based_filter,
running at the same time

    TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom,readrandom -num=10000000 -duration=20 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0

No significant difference in construction time but 3x slower readrandom
with -use_block_based_filter:
readrandom   :     100.517 micros/op 9948 ops/sec;    1.1 MB/s
vs.
readrandom   :      33.368 micros/op 29968 ops/sec;    3.3 MB/s

Also saw deprecation message (just once) in LOG only with
-use_block_based_filter

Reviewed By: ajkr

Differential Revision: D33673202

Pulled By: pdillinger

fbshipit-source-id: 99f6f0eff619408d9e5f7ef546954ed0be6c7a5b
2022-01-19 18:12:10 -08:00
Andrew Kryczka
875bfd75a0 Add API warning for Iterator::Refresh() with range tombstones (#9398)
Summary:
Need this until we properly return an error or fix the combination. Reported in https://github.com/facebook/rocksdb/issues/9255.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9398

Reviewed By: riversand963

Differential Revision: D33641396

Pulled By: ajkr

fbshipit-source-id: 9fe804108f7b93912f5b9c7252ac49acedc4f805
2022-01-19 10:13:27 -08:00
Hui Xiao
f61df25cc2 Add missing comment to RateLimiter::Request() (#9392)
Summary:
**Context/Summary:**
There are two `RateLimiter::Request()` in public header. One of them is missing some comment that the other one has.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9392

Test Plan: rely on CI test

Reviewed By: pdillinger

Differential Revision: D33623609

Pulled By: hx235

fbshipit-source-id: 42dc06308ff0bcf5ee7ef67e0b1c0172fc239b20
2022-01-19 10:09:49 -08:00
Peter Dillinger
5576ded762 Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363)
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.

Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363

Test Plan:
basic coverage in unit tests

Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see

force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s

Reviewed By: mrambacher

Differential Revision: D33636559

Pulled By: pdillinger

fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
2022-01-18 17:31:03 -08:00
Sergei Petrunia
c9042db619 Range Locking: add support for escalation barriers (#9290)
Summary:
Range Locking supports Lock Escalation. Lock Escalation is invoked when
lock memory is nearly exhausted and it reduced the amount of memory used
by joining adjacent locks.

Bridging the gap between certain locks has adverse effects. For example,
in MyRocks it is not a good idea to bridge the gap between locks in
different indexes, as that get the lock to cover large portions of
indexes, or even entire indexes.

Resolve this by introducing Escalation Barrier. The escalation process
will call the user-provided barrier callback function:
   bool(const Endpoint& a, const Endpoint& b)

If the function returns true, there's a barrier between a and b and Lock
Escalation will not try to bridge the gap between a and b.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9290

Reviewed By: akankshamahajan15

Differential Revision: D33486753

Pulled By: riversand963

fbshipit-source-id: f97910b67aba0579ea1d35f523ca6863d3dd018e
2022-01-14 12:46:09 -08:00
Si Ke
93b1de4f45 Enable db_test running in Centos 32 bit OS and Alpine 32 bit OS (#9294)
Summary:
Closes https://github.com/facebook/rocksdb/issues/9271

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9294

Reviewed By: riversand963, hx235

Differential Revision: D33586002

Pulled By: pdillinger

fbshipit-source-id: 3d1a2fa71023e108613ff03dbd37a5f954fc4920
2022-01-14 11:58:18 -08:00
Niklas Fiekas
f8bdd5797f Take compression level_values as const pointer (#9376)
Summary:
Compatible change, more natural (especially in generated Rust bindings), no risk that the API will ever need mutable access because it has to make a copy anyway.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9376

Reviewed By: ajkr

Differential Revision: D33541435

Pulled By: pdillinger

fbshipit-source-id: 15c512a0d70b6e8694fa99d598b7d022751c1e59
2022-01-12 08:34:53 -08:00
mrambacher
1973fcba11 Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362)
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions.  The PatternEntry methods were left in place but renamed to AddFactory.  The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.

Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362

Reviewed By: pdillinger

Differential Revision: D33432562

Pulled By: mrambacher

fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
2022-01-11 06:33:48 -08:00
Yanqin Jin
55a2105258 Make RocksDB codebase compatible with newer compilers like clang-12 (#9370)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9370

GCC and newer clang, e.g. clang-12 treat `std::unique_ptr` slightly differently.
For the following code
```
#include <iostream>
#include <memory>
#include <type_traits>

struct A {
    std::unique_ptr<int> m1;
};

int main()
{
    std::cout << std::boolalpha;
    std::cout << std::is_standard_layout<A>::value << '\n';
    return 0;
}
```
GCC11(C++20) (tested on https://en.cppreference.com/w/cpp/types/is_standard_layout) will print "true", while newer clang, e.g. clang-12 will print "false". This breaks the usage of `offsetof()` on structs with non-static members of type `std::unique_ptr`.
Fixing this by replacing the builtin `offsetof` with a trick documented at https://gist.github.com/graphitemaster/494f21190bb2c63c5516.

Reviewed By: jay-zhuang

Differential Revision: D33420840

fbshipit-source-id: 02bde281dfa28809bec787ad0f7019e85dd9c607
2022-01-10 11:09:05 -08:00
Hui Xiao
f62efb9d35 Clarify Options::rate_limiter api (#9361)
Summary:
Context/Summary:
I believe we also rate-limit read rate using the rate limiter passed into db options, e.g, https://github.com/facebook/rocksdb/blob/6.27.fb/file/random_access_file_reader.cc#L159

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9361

Test Plan: Existing tests

Reviewed By: jay-zhuang

Differential Revision: D33420803

Pulled By: hx235

fbshipit-source-id: 0ef3c4d0aaacb9bee9a5d2caceddfc76588c8949
2022-01-06 10:13:53 -08:00
mrambacher
fe31dc53ca Make the Env class Customizable (#9293)
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293

Reviewed By: pdillinger, zhichao-cao

Differential Revision: D33181591

Pulled By: mrambacher

fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
2022-01-04 16:45:49 -08:00
mrambacher
1c39b7952b Remove/Reduce use of Regex in ObjectRegistry/Library (#9264)
Summary:
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex.  For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes.  For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes.  There is also a class for Customizable::IndividualId format matches.

Added tests for the new derivative classes and got all unit tests to pass.

Resolves https://github.com/facebook/rocksdb/issues/9225.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264

Reviewed By: pdillinger

Differential Revision: D33062001

Pulled By: mrambacher

fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03
2021-12-29 07:56:23 -08:00
Andrew Kryczka
aa2b3bf675 Added TraceOptions::preserve_write_order (#9334)
Summary:
This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB.

By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334

Test Plan:
- See if below unsynced data loss crash test can run  for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering.

```
DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400
```

Reviewed By: zhichao-cao

Differential Revision: D33301990

Pulled By: ajkr

fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3
2021-12-28 15:04:26 -08:00
Andrew Kryczka
2ee20a669d Extend trace filtering to more operation types (#9335)
Summary:
- Extended trace filtering to cover `MultiGet()`, `Seek()`, and `SeekForPrev()`. Now all user ops that can be traced support filtering.
- Enabled the new filter masks in `db_stress` since it only cares to trace writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9335

Test Plan:
- trace-heavy `db_stress` command reduced 30% elapsed time  (79.21 -> 55.47 seconds)

Benchmark command:
```
$ /usr/bin/time ./db_stress -ops_per_thread=100000 -sync_fault_injection=1 --db=/dev/shm/rocksdb_stress_db/ --expected_values_dir=/dev/shm/rocksdb_stress_expected/ --clear_column_family_one_in=0
```

- replay-heavy `db_stress` command reduced 12.4% elapsed time (23.69 -> 20.75 seconds)

Setup command:
```
$  ./db_stress -ops_per_thread=100000000 -sync_fault_injection=1 -db=/dev/shm/rocksdb_stress_db/ -expected_values_dir=/dev/shm/rocksdb_stress_expected --clear_column_family_one_in=0 & sleep 120; pkill -9 db_stress
```

Benchmark command:
```
$ /usr/bin/time ./db_stress -ops_per_thread=1 -reopen=0 -expected_values_dir=/dev/shm/rocksdb_stress_expected/ -db=/dev/shm/rocksdb_stress_db/ --clear_column_family_one_in=0 --destroy_db_initially=0
```

Reviewed By: zhichao-cao

Differential Revision: D33304580

Pulled By: ajkr

fbshipit-source-id: 0df10f87c1fc506e9484b6b42cea2ef96c7ecd65
2021-12-28 11:46:30 -08:00
slk
2e5f764294 Make IncreaseFullHistoryTsLow to a public API (#9221)
Summary:
As (https://github.com/facebook/rocksdb/issues/9210) discussed, the **full_history_ts_low** is a member of CompactRangeOptions currently, which means a CF's fullHistoryTsLow is advanced only when users submit a CompactRange request.
However, users may want to advance the fllHistoryTsLow without an immediate compact.
This merge make IncreaseFullHistoryTsLow to a public API so users can advance each CF's fullHistoryTsLow seperately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9221

Reviewed By: akankshamahajan15

Differential Revision: D33201106

Pulled By: riversand963

fbshipit-source-id: 9cb1d013ba93260f72e16353e693ffee167b47ee
2021-12-23 11:03:51 -08:00
Sergei Petrunia
1b076e82db Expose locktree's wait count in RangeLockManagerHandle::Counters (#9289)
Summary:
locktree is a module providing Range Locking. It has a counter for
the number of times a lock acquisition request was blocked by an
existing conflicting lock and had to wait for it to be released.

Expose this counter in RangeLockManagerHandle::Counters::lock_wait_count.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9289

Reviewed By: jay-zhuang

Differential Revision: D33079182

Pulled By: riversand963

fbshipit-source-id: 25b1a362d9da247536ab5007bd15900b319f139e
2021-12-22 21:14:48 -08:00
Andrew Kryczka
782fcc44e1 Fix race condition in error_handler_fs_test (#9325)
Summary:
We saw the below assertion failure in `error_handler_fs_test`:

```
db/error_handler_fs_test.cc:2471: Failure
Expected equality of these values:
  listener->new_bg_error()
    Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
  Status::Aborted()
    Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00>
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/error_handler_fs_test.cc:2471: Failure
Expected equality of these values:
  listener->new_bg_error()
    Which is: 16-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>
  Status::Aborted()
    Which is: 16-byte object <0A-00 00-00 60-61 00-00 00-00 00-00 00-00 00-00>
Received signal 6 (Aborted)
```

The problem was completing `OnErrorRecoveryCompleted()` would
wake up the main thread and allow it to proceed to that assertion. But
that assertion assumes `OnErrorRecoveryEnd()` has completed since
only `OnErrorRecoveryEnd()` affects `new_bg_error()`.

The fix is just to make `OnErrorRecoveryCompleted()` not wake up the
main thread, by means of not implementing it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9325

Test Plan:
- ran `while TEST_TMPDIR=/dev/shm ./error_handler_fs_test ; do : ; done` for a while
- injected sleep between `OnErrorRecovery{Completed,End}()` callbacks, which guaranteed repro before this PR

Reviewed By: anand1976

Differential Revision: D33249200

Pulled By: ajkr

fbshipit-source-id: 1659ee183cd09f90d4dbd898f65103473fcf84a8
2021-12-20 23:16:52 -08:00
Akanksha Mahajan
7bfad07194 Update to version 6.28 (#9312)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9312

Reviewed By: ajkr

Differential Revision: D33196324

Pulled By: akankshamahajan15

fbshipit-source-id: 471da75eaedc54d3151672adc28643bc1d6fdf23
2021-12-17 16:20:39 -08:00
mrambacher
423538a816 Make MemoryAllocator into a Customizable class (#8980)
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980

Reviewed By: zhichao-cao

Differential Revision: D32990403

Pulled By: mrambacher

fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
2021-12-17 04:20:47 -08:00
Yanqin Jin
5455cacd18 Fix link error reported in issue 9272 (#9278)
Summary:
As title, Closes https://github.com/facebook/rocksdb/issues/9272
Since TimestampAssigner-related classes needs to access
`WriteBatch::ProtectionInfo` objects which is for internal use only,
it's difficult to make `AssignTimestamp` methods a template and put them
in the same public header, `include/rocksdb/write_batch.h`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9278

Test Plan:
```
make check
# Also manually test following the repro-steps in issue 9272
```

Reviewed By: ltamasi

Differential Revision: D33012686

Pulled By: riversand963

fbshipit-source-id: 89f24a86a1170125bd0b94ef3b32e69aa08bd949
2021-12-10 20:33:46 -08:00
Yanqin Jin
bd513fd075 Add commit marker with timestamp (#9266)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266

This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.

Reviewed By: ltamasi

Differential Revision: D31721350

fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
2021-12-10 11:05:35 -08:00
Peter Dillinger
653c392e47 More refactoring ahead of footer & meta changes (#9240)
Summary:
I'm working on a new format_version=6 to support context
checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
updates to support that change.

Test coverage data and manual inspection agree on dead code in
block_based_table_reader.cc (removed).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240

Test Plan:
tests enhanced to cover more cases etc.

Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.

TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}

(Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
Before w/ wal: 454512
After w/ wal: 444820 (-2.1%)
Before w/o wal: 1004560
After w/o wal: 998897 (-0.6%)

Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.

This regression will be corrected in a follow-up PR.

Reviewed By: ajkr

Differential Revision: D32813769

Pulled By: pdillinger

fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
2021-12-10 08:13:26 -08:00
anand76
ecf2bec613 Add a listener callback for end of auto error recovery (#9244)
Summary:
Previously, the OnErrorRecoveryCompleted callback was called when
RocksDB was able to successfully recover from a retryable error.
However, if the recovery failed and was eventually stopped, there was no
indication of the status. To fix that, a new OnErrorRecoveryEnd callback
is introduced that deprecates the OnErrorRecoveryCompleted callback. The
new callback is called with the original error and the new error status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9244

Test Plan: Add a new unit test in error_handler_fs_test

Reviewed By: zhichao-cao

Differential Revision: D32922303

Pulled By: anand1976

fbshipit-source-id: f04e77a9cb92c5ea6385590682d3fcf559971b99
2021-12-08 14:30:57 -08:00
Hui Xiao
9daf07305c Replace TableProperties::properties_offsets map with external_sst_file_global_seqno_offset (#9212)
Summary:
**Context:**
Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties.

Note:
- See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets`
- See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence.

**Summary:**
- Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212

Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested.

Reviewed By: ajkr

Differential Revision: D32665941

Pulled By: hx235

fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795
2021-12-02 08:30:36 -08:00
mrambacher
7cd5835a28 Make RateLimiter Customizable (#9141)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9141

Reviewed By: zhichao-cao

Differential Revision: D32432190

Pulled By: mrambacher

fbshipit-source-id: 7930ed88a02412128cd407b5063522484e45c6ce
2021-12-01 06:57:02 -08:00
Yanqin Jin
924616526a Update WriteBatch::AssignTimestamp() and Add (#9205)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9205

Update WriteBatch::AssignTimestamp() APIs so that they take an
additional argument, i.e. a function object called `checker` indicating the user-specified logic of performing
checks on timestamp sizes.

WriteBatch is a building block used by multiple other RocksDB components, each of which may track
timestamp information in different data structures. For example, transaction can either write to
`WriteBatchWithIndex` which is a `WriteBatch` with index, or write directly to raw `WriteBatch` if
`Transaction::DisableIndexing()` is called.
`WriteBatchWithIndex` keeps mapping from column family id to comparator, and transaction needs
to keep similar information for the `WriteBatch` if user calls `Transaction::DisableIndexing()` (dynamically)
so that we will know the size of each timestamp later. The bookkeeping info maintained by `WriteBatchWithIndex`
and `Transaction` should not overlap.
When we later call `WriteBatch::AssignTimestamp()`, we need to use these data structures to guarantee
that we do not accidentally assign timestamps for keys from column families that disable timestamp.

Reviewed By: ltamasi

Differential Revision: D31735186

fbshipit-source-id: 8b1709ed880ac72f995aa9e012e5873b290840a7
2021-11-30 22:33:00 -08:00
Artem Krylysov
552256cb1a Add rocksdb_livefiles_column_family_name C API (#9232)
Summary:
Extend C API to add new function `rocksdb_livefiles_column_family_name`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9232

Reviewed By: akankshamahajan15

Differential Revision: D32736516

Pulled By: ajkr

fbshipit-source-id: a854256a0f4652c903ab5ad8355ded051ac19987
2021-11-30 16:54:27 -08:00
mrambacher
7aa31ba4a9 Fix GetOptionsPtr for Wrapped Customizable; Allow null options map (#9213)
Summary:
1.  Fix GetOptionsPtr for Wrapped (Inner() != nullptr) Customizable objects.  This allows the inner options to be returned via this method.

2.  Allow the option type map to be nullptr.  This allows objects to be registered as options (for GetOptionsPtr) but not be used by the configuration methods.

Added tests as appropriate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9213

Reviewed By: zhichao-cao

Differential Revision: D32718882

Pulled By: mrambacher

fbshipit-source-id: 563203d1f006a2629060feb31c5dff9a233e1e83
2021-11-30 13:23:25 -08:00
jsteemann
5384f0af6e Fix compile warnings (#9199)
Summary:
* added missing override specifiers for overriden methods
  this fixes compiler warnings emitted by g++ and clang++ when compile option `-Wsuggest-override` is turned on.
* fix compile warning with -Wmaybe-uninitialized
  g++-11 warns about a _potentially_ uninitialized variable when using `-Wmaybe_uninitialized`:
  ```
      env/env.cc: In member function ‘virtual rocksdb::Status rocksdb::Env::GetHostNameString(std::string*)’:
      env/env.cc:738:66: error: ‘hostname_buf’ may be used uninitialized [-Werror=maybe-uninitialized]
        738 |   Status s = GetHostName(hostname_buf.data(), hostname_buf.size());
            |                                                                  ^
      In file included from /usr/include/c++/11/tuple:39,
                       from /usr/include/c++/11/functional:54,
                       from ./include/rocksdb/env.h:22,
                       from env/env.cc:10:
      /usr/include/c++/11/array:176:7: note: by argument 1 of type ‘const std::array<char, 256>*’ to ‘constexpr std::array<_Tp, _Nm>::size_type std::array<_Tp, _Nm>::size() const [with _Tp = char; long unsigned int _Nm = 256]’ declared here
        176 |       size() const noexcept { return _Nm; }
            |       ^~~~
      env/env.cc:737:37: note: ‘hostname_buf’ declared here
        737 |   std::array<char, kMaxHostNameLen> hostname_buf;
  ```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9199

Reviewed By: jay-zhuang

Differential Revision: D32630703

Pulled By: pdillinger

fbshipit-source-id: 9ea3010b1105a582548e3c3c0db4475b201e4a10
2021-11-24 11:19:06 -08:00
Yanqin Jin
dc0ee3e5ca Fix internal build error (#9195)
Summary:
Internal build reported:
```
rocksdb/listener.h:470:3: error: extra ';' inside a struct [-Werror,-Wextra-semi]
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9195

Test Plan: import to fbcode and compile.

Reviewed By: akankshamahajan15

Differential Revision: D32590138

Pulled By: riversand963

fbshipit-source-id: ca4ed9cca210a1a9a12d3de17c789ef9151c57e8
2021-11-22 09:38:02 -08:00
Yanqin Jin
8101643611 Update HISTORY and version.h for 6.27 release (#9192)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9192

Reviewed By: ltamasi

Differential Revision: D32578141

Pulled By: riversand963

fbshipit-source-id: 16216451c87e383ca8fd309acf15106e46172aaa
2021-11-19 22:11:56 -08:00
Levi Tamasi
dc5de45af8 Support readahead during compaction for blob files (#9187)
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187

Test Plan: Ran `make check` and the stress/crash test.

Reviewed By: riversand963

Differential Revision: D32565512

Pulled By: ltamasi

fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
2021-11-19 17:53:47 -08:00
Jay Zhuang
6cde8d2190 Deprecating iter_start_seqnum and preserve_deletes (#9091)
Summary:
`ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are
deprecated, please try using user defined timestamp feature instead.
The feature is used to support differential snapshots, but not well
maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which
returns an internal key from the iterator. The user defined timestamp
feature is a more flexible feature to support similar usecase, please
switch to that if you have such usecase.
The deprecated feature will be removed in a future release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091

Test Plan:
check LOG

Fix https://github.com/facebook/rocksdb/issues/9090

Reviewed By: ajkr

Differential Revision: D32071750

Pulled By: jay-zhuang

fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104
2021-11-19 16:55:45 -08:00
Andrew Kryczka
8cf4294e25 Adhere to per-DB concurrency limit when bottom-pri compactions exist (#9179)
Summary:
- Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect.
- Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179

Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency.

Reviewed By: riversand963

Differential Revision: D32488048

Pulled By: ajkr

fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310
2021-11-18 17:31:50 -08:00
Akanksha Mahajan
4a7c1dc375 Add listener API that notifies on IOError (#9177)
Summary:
Add a new API in listener.h that notifies about IOErrors on
Read/Write/Append/Flush etc. The API reports about IOStatus, filename, Operation
name, offset and length.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9177

Test Plan: Added new unit tests

Reviewed By: anand1976

Differential Revision: D32470627

Pulled By: akankshamahajan15

fbshipit-source-id: 189a717033590ae227b3beae8b1e7e185e4cdc12
2021-11-18 17:11:19 -08:00
Peter Dillinger
230660be73 Improve / clean up meta block code & integrity (#9163)
Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)

Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163

Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.

Reviewed By: ajkr, mrambacher

Differential Revision: D32514757

Pulled By: pdillinger

fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
2021-11-18 11:43:44 -08:00
Hui Xiao
74544d582f Account Bloom/Ribbon filter construction memory in global memory limit (#9073)
Summary:
Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.

**Context:**
Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`.

The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.

**Summary:**
- Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction:
   - hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance),
   - banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`),
   - final filter (i.e, `mutable_buf`'s size).
      - Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as  explicitly delete the filter bits builder when done with the final filter in block based table.
- Added option fo run `filter_bench` with this memory reservation feature

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073

Test Plan:
- Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of  `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory`
  - To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally
- Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter
- Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below:
   - FastLocalBloom
      - baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`:
         - **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0)
      - new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`:
         - **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0)
      - new feature of RibbonFilter with fallback  (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
         - **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0)

    - Ribbon128
       - baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`:
           - **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0)
       - new feature  (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `:
           - **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0)
       - new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
          - **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0)
          - And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console.

Reviewed By: pdillinger

Differential Revision: D31991348

Pulled By: hx235

fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e
2021-11-18 09:42:20 -08:00
Zhichao Cao
b694cd0e0d Add tiered storage related read bytes stats to Statistic (#9123)
Summary:
Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123

Test Plan: added new testing cases.

Reviewed By: siying

Differential Revision: D32154745

Pulled By: zhichao-cao

fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da
2021-11-16 15:17:17 -08:00
Peter Dillinger
f8c685c4fc Check for and disallow shared key space in block caches (#9172)
Summary:
We have three layers of block cache that often use the same key
but map to different physical data:
* BlockBasedTableOptions::block_cache
* BlockBasedTableOptions::block_cache_compressed
* BlockBasedTableOptions::persistent_cache

If any two of these happen to share an underlying implementation and key
space (insertion into one shows up in another), then memory safety is
broken. The simplest case is block_cache == block_cache_compressed.
(Credit mrambacher for asking about this case in a review.)

With this change, we explicitly check for overlap and preemptively and
safely fail with a Status code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9172

Test Plan: test added. Crashes without new check

Reviewed By: anand1976

Differential Revision: D32465659

Pulled By: pdillinger

fbshipit-source-id: 3876b45b6dce6167e5a7a642725ddc86b96f8e40
2021-11-16 11:16:05 -08:00