Commit Graph

418 Commits

Author SHA1 Message Date
Baptiste Lemaire
e817bc9628 Added memtable garbage statistics (#8411)
Summary:
**Summary**:
2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations.

**Unit test**: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test.
This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR  can be exactly predicted, showing that we have a full understanding of the underlying operations.

**Performance testing**:
`./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times.
Timing done using "date" function in a bash script.
_Results_:
Original Rocksdb fork: mean 66.6 sec, std 1.18 sec.
This feature branch: mean 67.4 sec, std 1.35 sec.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411

Reviewed By: akankshamahajan15

Differential Revision: D29150629

Pulled By: bjlemaire

fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1
2021-06-18 04:57:27 -07:00
mrambacher
8948dc8524 Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262)
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).

Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.

Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262

Reviewed By: pdillinger

Differential Revision: D28226540

Pulled By: mrambacher

fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
2021-05-05 14:00:17 -07:00
Andrew Kryczka
1ba2b8a568 Add sample_for_compression results to table properties (#8139)
Summary:
Added `TableProperties::{fast,slow}_compression_estimated_data_size`.
These properties are present in block-based tables when
`ColumnFamilyOptions::sample_for_compression > 0` and the necessary
compression library is supported when the file is generated. They
contain estimates of what `TableProperties::data_size` would be if the
"fast"/"slow" compression library had been used instead. One
limitation is we do not record exactly which "fast" (ZSTD or Zlib)
or "slow" (LZ4 or Snappy) compression library produced the result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139

Test Plan:
- new unit test
- ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression

Reviewed By: riversand963

Differential Revision: D27454338

Pulled By: ajkr

fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4
2021-03-31 18:21:50 -07:00
Jay Zhuang
a781b103da Fix getApproximateMemTableStats() return type (#8098)
Summary:
Which should return 2 long instead of an array.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8098

Reviewed By: mrambacher

Differential Revision: D27308741

Pulled By: jay-zhuang

fbshipit-source-id: 44beea2bd28cf6779b048bebc98f2426fe95e25c
2021-03-31 09:46:47 -07:00
Zhichao Cao
08ec5e7321 Add the statistics and info log for Error handler (#8050)
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050

Test Plan: make check and add test to error_handler_fs_test

Reviewed By: anand1976

Differential Revision: D26990565

Pulled By: zhichao-cao

fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
2021-03-17 22:38:13 -07:00
Xiaopeng Zhang
c603f2f898 support getUsage and getPinnedUsage in JavaAPI for Cache (#7925)
Summary:
support getUsage and getPinnedUsage in JavaAPI for Cache
also fix a typo in LRUCacheTest.java that the highPriPoolRatio is not valid(set 5, I guess it means 0.05)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7925

Reviewed By: mrambacher

Differential Revision: D26900241

Pulled By: ajkr

fbshipit-source-id: 735d1e40a16fa8919c89c7c7154ba7f81208ec33
2021-03-17 09:30:33 -07:00
Andrew Kryczka
d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
Tomas Kolda
1001bc01c9 Read Options to support direct slice (#7132)
Summary:
This request is adding support for using DirectSlice in ReadOptions lower/upper bounds.

To be more efficient I have added setLength to DirectSlice so I can just update the length to be used by slice from direct buffer. It is also needed, because when one creates iterator it keep pointer to original slice so setting new slice in options does not help (it needs to reuse existing one). Using this approach one can modify the slice any time during operations with iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7132

Reviewed By: zhichao-cao

Differential Revision: D25840092

Pulled By: jay-zhuang

fbshipit-source-id: 760167baf61568c9a35138145c4bf9b06824cb71
2021-01-15 17:05:18 -08:00
Tomas Kolda
ac956f2bea S390 Linux is failing tests ColumnFamilyOptionsTest.cfPaths (#7853)
Summary:
Fix ColumnFamilyOptionsTest.cfPaths and OptionsTest.cfPaths in 6.15 branch (and probably other branches including master)

has_exception variable was not initialized which was causing test failures and incorrect behavior on s390 platform (and maybe others as variable content is undefined).

adamretter please take a look.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7853

Reviewed By: akankshamahajan15

Differential Revision: D25901639

Pulled By: jay-zhuang

fbshipit-source-id: 151b5db27b495fc6d8ed54c0eccbde2508215ac5
2021-01-15 16:32:31 -08:00
Adam Retter
62afa968c2 Fix various small build issues, Java API naming (#7776)
Summary:
* Compatibility with older GCC.
* Compatibility with older jemalloc libraries.
* Remove Docker warning when building i686 binaries.
* Fix case inconsistency in Java API naming (potential update to HISTORY.md deferred)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7776

Reviewed By: akankshamahajan15

Differential Revision: D25607235

Pulled By: pdillinger

fbshipit-source-id: 7ab0fb7fa7a34e97ed0bec991f5081acb095777d
2020-12-18 16:12:26 -08:00
Ramkumar Vadivelu
9a690a74e1 In ParseInternalKey(), include corrupt key info in Status (#7515)
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497

When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`

Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.

Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test

Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515

Reviewed By: ajkr

Differential Revision: D24240264

Pulled By: ramvadiv

fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
2020-10-28 10:12:58 -07:00
Adam Retter
ccbf468cb1 Small JNI improvements (#7371)
Summary:
* Avoid some unnecessary array copy operations on read/write
* Remove some duplicated code
* Don't leak arrays on some exceptions
* Fixed some doc comments

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7371

Reviewed By: jay-zhuang

Differential Revision: D24312932

Pulled By: pdillinger

fbshipit-source-id: 422fe6b98bbdb922a148922ac0d2d965c715176e
2020-10-14 22:23:56 -07:00
Tomasz Posluszny
6528ecc800 Add event listeners to RocksJava (#7425)
Summary:
Allows adding event listeners in RocksJava.

* Adds listeners getter and setter in `Options` and `DBOptions` classes.
* Adds `EventListener` Java interface and base class for implementing custom event listener callbacks - `AbstractEventListener`, which has an underlying native callback class implementing C++ `EventListener` class.
* `AbstractEventListener` class has mechanism for selectively enabling its callback methods in order to prevent invoking Java method if it is not implemented. This decreases performance cost in case only subset of event listener callback methods is needed - the JNI code for remaining "no-op" callbacks is not executed.
* The code is covered by unit tests in `EventListenerTest.java`, there are also tests added for setting/getting listeners field in `OptionsTest.java` and `DBOptionsTest.java`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7425

Reviewed By: pdillinger

Differential Revision: D24063390

Pulled By: jay-zhuang

fbshipit-source-id: 508c359538983d6b765e70d9989c351794a944ee
2020-10-14 11:33:52 -07:00
Akanksha Mahajan
38d0a365e3 Add Stats for MultiGet (#7366)
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
    1. Number of index and filter blocks read from file as part of MultiGet
    request per level.
    2. Number of data blocks read from file per level.
    3. Number of SST files loaded from file system per level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366

Reviewed By: anand1976

Differential Revision: D24127040

Pulled By: akankshamahajan15

fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
2020-10-07 13:28:48 -07:00
Ramkumar Vadivelu
e04a50923d Change ParseInternalKey() to return Status instead of bool (#7457)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7430

Change ParseInternalKey() to return Status instead of bool.

db_bench (seekrandom) based before/after results with value size of 100 bytes and 16 bytes can be found at (tests ran on an udb server):
https://www.dropbox.com/s/47bwamdy5ozngph/PIK_ret_Status_results.xlsx?dl=0

![db_bench_results](https://user-images.githubusercontent.com/62277872/94642825-2a21a800-029a-11eb-88f2-124136c83fd3.png)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7457

Reviewed By: ajkr

Differential Revision: D24002433

Pulled By: ramvadiv

fbshipit-source-id: ac253ecf577a29044c47c3fe254a01e71404c44c
2020-09-30 19:16:47 -07:00
Tomasz Posluszny
6efae4b00d Add missing Java API for boolean and numerical fields in DBOptions (#7387)
Summary:
Exposes the following previously missing DBOptions fields in the RocksJava API:
- persist_stats_to_disk
- max_write_batch_group_size_bytes
- skip_checking_sst_file_sizes_on_db_open
- avoid_unnecessary_blocking_io
- write_dbid_to_manifest
- log_readahead_size
- best_efforts_recovery
- max_bgerror_resume_count
- bgerror_resume_retry_interval

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7387

Reviewed By: siying

Differential Revision: D23707785

Pulled By: jay-zhuang

fbshipit-source-id: e5688c7d40d83128734605ef7b0720a55fdfa699
2020-09-18 12:28:40 -07:00
Adam Retter
3ac07a12fe RocksJava - Add errorIfLogFileExists parameter to RocksDB.openReadOnly (#7046)
Summary:
Expose from C++ API to Java API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7046

Reviewed By: riversand963

Differential Revision: D23726297

Pulled By: pdillinger

fbshipit-source-id: fc66bf626ce6fe9797e7d021ac849eacab91bf6d
2020-09-17 15:41:25 -07:00
Tomasz Posluszny
6b72342a12 Implement missing Java API for ColumnFamilyOptions (#7372)
Summary:
Covered methods:
- OldDefaults()
- OptimizeForSmallDb(std::shared_ptr<Cache>)

Covered fields:
- cf_paths

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7372

Reviewed By: pdillinger

Differential Revision: D23683449

Pulled By: jay-zhuang

fbshipit-source-id: 3e5a8b657cc382c19de3a48c666a3b0e8d96968d
2020-09-14 12:09:04 -07:00
Tomasz Posluszny
ec5add398c Implement Java API for ConcurrentTaskLimiter class and compaction_thread_limiter field in ColumnFamilyOptions (#7347)
Summary:
as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7347

Test Plan: unit tests included

Reviewed By: jay-zhuang

Differential Revision: D23592552

Pulled By: pdillinger

fbshipit-source-id: 1c3571b6f42bfd0cfd723ff49d01fbc02a1be45b
2020-09-09 12:44:20 -07:00
Aaron Kabcenell
56ed601df3 Compaction Read/Write Stats by Compaction Type (#7165)
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165

Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:

/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1

Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.

Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking

Reviewed By: ajkr

Differential Revision: D22693050

Pulled By: akabcenell

fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
2020-07-29 13:39:29 -07:00
Tomas Kolda
cd4592c220 SST Partitioner interface that allows to split SST files (#6957)
Summary:
SST Partitioner interface that allows to split SST files during compactions.

It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957

Reviewed By: ajkr

Differential Revision: D22461239

fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
2020-07-24 13:44:49 -07:00
Adam Retter
a08f4031cb Align RocksJava BlockBasedTableOptions with C++ API (#7088)
Summary:
Closes https://github.com/facebook/rocksdb/issues/6729

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7088

Reviewed By: riversand963

Differential Revision: D22481624

Pulled By: pdillinger

fbshipit-source-id: 27c0ebd4168d374ae81f3595e034150c1c97f8b8
2020-07-10 14:32:58 -07:00
Adam Retter
899e59ecb7 Add DB::OpenAsSecondary to RocksJava (#7047)
Summary:
Closes https://github.com/facebook/rocksdb/issues/5852
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7047

Reviewed By: cheng-chang

Differential Revision: D22335162

Pulled By: pdillinger

fbshipit-source-id: 75f3c524deccea7ebc0ad288da41f1ea81406c1c
2020-07-06 11:48:57 -07:00
Adam Retter
0117cbfc96 Adds a function to RocksJava for retrieving the version (#7083)
Summary:
Adds the function `RocksDB#rocksdbVersion()` for retrieving the RocksDB version.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7083

Reviewed By: cheng-chang

Differential Revision: D22391628

Pulled By: pdillinger

fbshipit-source-id: e1cabcf28aa81f5ee8dcdce5c9eca6b3155a279e
2020-07-06 11:06:21 -07:00
Akanksha Mahajan
2677bd5967 Add logs and stats in DeleteScheduler (#6927)
Summary:
Add logs and stats for files marked as trash and files deleted immediately in DeleteScheduler
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6927

Test Plan: make check -j64

Reviewed By: riversand963

Differential Revision: D21869068

Pulled By: akankshamahajan15

fbshipit-source-id: e9f673c4fa8049ce648b23c75d742f2f9c6c57a1
2020-06-05 09:43:04 -07:00
Adam Retter
9060e6fa79 Add newer WBWI::NewIteratorWithBase functions to RocksJava (#6872)
Summary:
Exposes the `ReadOptions` arguments to `NewIteratorWithBase`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6872

Reviewed By: ajkr

Differential Revision: D21725867

Pulled By: pdillinger

fbshipit-source-id: 4079ba590cc13ba7a6244ed91439d89c40a543b6
2020-05-27 11:59:12 -07:00
mrambacher
4cbc19d2a1 Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389)
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings.  There is a mishmash of parameters in use here with more needed in the future.  This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389

Reviewed By: siying

Differential Revision: D21163707

Pulled By: zhichao-cao

fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
2020-04-21 17:38:17 -07:00
Peter Dillinger
45d2b4efca Fix tabs and lint-ignores (#6734)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6734

Reviewed By: cheng-chang

Differential Revision: D21134556

Pulled By: pdillinger

fbshipit-source-id: 3636cc1d1333137b70031f8277458781c21631fb
2020-04-20 11:39:31 -07:00
Nicolas Pépin-Perreault
9e6f3efcd2 Add RocksIterator::Refresh (#6573)
Summary:
This PR exposes the `Iterator::Refresh` method to the Java API by adding it on the `RocksIteratorInterface` interface. There are three concrete implementations: `RocksIterator`, `SstFileReaderIterator`, and `WBWIRocksIterator`. For the first two cases, the JNI side simply delegates to the underlying `Iterator::Refresh` method; in the last case, as it doesn't share an ancestor, and per the discussion in https://github.com/facebook/rocksdb/issues/3465, a `Status::NotSupported` exception is thrown.

As the last PR had no activity in a while, I'm opening a new one - I'm completely fine with merging the previous PR if it gets completed before this is reviewed.

Let me know if there's anything missing or anything else I can do 👍
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6573

Reviewed By: cheng-chang

Differential Revision: D20604666

Pulled By: pdillinger

fbshipit-source-id: 4de17df1180c3b87b76cfdd77b674b81fc0563f7
2020-04-16 15:55:26 -07:00
Tomas Kolda
0b136308b0 Fix crash in JNI getApproximateSizes (#6652)
Summary:
This change is fixing a crash happening in getApproximateSizes JNI implementation. It also reenables Java test that was crashing most likelly because if this bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6652

Reviewed By: cheng-chang

Differential Revision: D20874865

Pulled By: pdillinger

fbshipit-source-id: da95516f15e5df2efe1a4e5690a2ce172cb53f87
2020-04-07 20:19:25 -07:00
Sahib Pandori
487ebe4fd5 Add Java API for rocksdb::CancelAllBackgroundWork() (#6657)
Summary:
Adding a Java API for rocksdb::CancelAllBackgroundWork() so that the user can call this (when required) before closing the DB. This is to **prevent the crashes when manual compaction is running and the user decides to close the DB**.

Calling CancelAllBackgroundWork() seems to be the recommended way to make sure that it's safe to close the DB (according to RocksDB FAQ: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ#basic-readwrite).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6657

Reviewed By: cheng-chang

Differential Revision: D20896395

Pulled By: pdillinger

fbshipit-source-id: 8a8208c10093db09bd35db9af362211897870d96
2020-04-07 20:15:38 -07:00
Otto Kekäläinen
f6c2777d95 Fix spelling: commited -> committed (#6481)
Summary:
In most places in the code the variable names are spelled correctly as
COMMITTED but in a couple places not. This fixes them and ensures the
variable is always called COMMITTED everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481

Differential Revision: D20306776

Pulled By: pdillinger

fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022
2020-03-06 12:45:20 -08:00
sdong
fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Tomas Kolda
e412a426d6 JNI direct buffer support for basic operations (#2283)
Summary:
It is very useful to support direct ByteBuffers in Java. It allows to have zero memory copy and some serializers are using that directly so one do not need to create byte[] array for it.

This change also contains some fixes for Windows JNI build.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/2283

Differential Revision: D19834971

Pulled By: pdillinger

fbshipit-source-id: 44173aa02afc9836c5498c592fd1ea95b6086e8e
2020-02-11 14:48:30 -08:00
Mike Kolupaev
637e64b9ac Add an option to prevent DB::Open() from querying sizes of all sst files (#6353)
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.

If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.

We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353

Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.

Differential Revision: D19656425

Pulled By: al13n321

fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
2020-02-04 01:27:26 -08:00
Adam Retter
7242dae7fe Improve RocksJava Comparator (#6252)
Summary:
This is a redesign of the API for RocksJava comparators with the aim of improving performance. It also simplifies the class hierarchy.

**NOTE**: This breaks backwards compatibility for existing 3rd party Comparators implemented in Java... so we need to consider carefully which release branches this goes into.

Previously when implementing a comparator in Java the developer had a choice of subclassing either `DirectComparator` or `Comparator` which would use direct and non-direct byte-buffers resepectively (via `DirectSlice` and `Slice`).

In this redesign there we have eliminated the overhead of using the Java Slice classes, and just use `ByteBuffer`s. The `ComparatorOptions` supplied when constructing a Comparator allow you to choose between direct and non-direct byte buffers by setting `useDirect`.

In addition, the `ComparatorOptions` now allow you to choose whether a ByteBuffer is reused over multiple comparator calls, by setting `maxReusedBufferSize > 0`. When buffers are reused, ComparatorOptions provides a choice of mutex type by setting `useAdaptiveMutex`.

 ---
[JMH benchmarks previously indicated](https://github.com/facebook/rocksdb/pull/6241#issue-356398306) that the difference between C++ and Java for implementing a comparator was ~7x slowdown in Java.

With these changes, when reusing buffers and guarding access to them via mutexes the slowdown is approximately the same. However, these changes offer a new facility to not reuse mutextes, which reduces the slowdown to ~5.5x in Java. We also offer a `thread_local` mechanism for reusing buffers, which reduces slowdown to ~5.2x in Java (closes https://github.com/facebook/rocksdb/pull/4425).

These changes also form a good base for further optimisation work such as further JNI lookup caching, and JNI critical.

 ---
These numbers were captured without jemalloc. With jemalloc, the performance improves for all tests, and the Java slowdown reduces to between 4.8x and 5.x.

```
ComparatorBenchmarks.put                                                native_bytewise  thrpt   25  124483.795 ± 2032.443  ops/s
ComparatorBenchmarks.put                                        native_reverse_bytewise  thrpt   25  114414.536 ± 3486.156  ops/s
ComparatorBenchmarks.put              java_bytewise_non-direct_reused-64_adaptive-mutex  thrpt   25   17228.250 ± 1288.546  ops/s
ComparatorBenchmarks.put          java_bytewise_non-direct_reused-64_non-adaptive-mutex  thrpt   25   16035.865 ± 1248.099  ops/s
ComparatorBenchmarks.put                java_bytewise_non-direct_reused-64_thread-local  thrpt   25   21571.500 ±  871.521  ops/s
ComparatorBenchmarks.put                  java_bytewise_direct_reused-64_adaptive-mutex  thrpt   25   23613.773 ± 8465.660  ops/s
ComparatorBenchmarks.put              java_bytewise_direct_reused-64_non-adaptive-mutex  thrpt   25   16768.172 ± 5618.489  ops/s
ComparatorBenchmarks.put                    java_bytewise_direct_reused-64_thread-local  thrpt   25   23921.164 ± 8734.742  ops/s
ComparatorBenchmarks.put                              java_bytewise_non-direct_no-reuse  thrpt   25   17899.684 ±  839.679  ops/s
ComparatorBenchmarks.put                                  java_bytewise_direct_no-reuse  thrpt   25   22148.316 ± 1215.527  ops/s
ComparatorBenchmarks.put      java_reverse_bytewise_non-direct_reused-64_adaptive-mutex  thrpt   25   11311.126 ±  820.602  ops/s
ComparatorBenchmarks.put  java_reverse_bytewise_non-direct_reused-64_non-adaptive-mutex  thrpt   25   11421.311 ±  807.210  ops/s
ComparatorBenchmarks.put        java_reverse_bytewise_non-direct_reused-64_thread-local  thrpt   25   11554.005 ±  960.556  ops/s
ComparatorBenchmarks.put          java_reverse_bytewise_direct_reused-64_adaptive-mutex  thrpt   25   22960.523 ± 1673.421  ops/s
ComparatorBenchmarks.put      java_reverse_bytewise_direct_reused-64_non-adaptive-mutex  thrpt   25   18293.317 ± 1434.601  ops/s
ComparatorBenchmarks.put            java_reverse_bytewise_direct_reused-64_thread-local  thrpt   25   24479.361 ± 2157.306  ops/s
ComparatorBenchmarks.put                      java_reverse_bytewise_non-direct_no-reuse  thrpt   25    7942.286 ±  626.170  ops/s
ComparatorBenchmarks.put                          java_reverse_bytewise_direct_no-reuse  thrpt   25   11781.955 ± 1019.843  ops/s
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6252

Differential Revision: D19331064

Pulled By: pdillinger

fbshipit-source-id: 1f3b794e6a14162b2c3ffb943e8c0e64a0c03738
2020-02-03 12:30:13 -08:00
Adam Retter
e697da0b18 RocksDB#keyMayExist should not assume database values are unicode strings (#6186)
Summary:
Closes https://github.com/facebook/rocksdb/issues/6183
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6186

Differential Revision: D19201281

Pulled By: pdillinger

fbshipit-source-id: 1c96b4ea09e826f91e44b0009eba3de0991d9053
2019-12-20 14:27:58 -08:00
Adam Retter
b433bbefe9 Add missing mutable DBOptions to RocksJava (#6152)
Summary:
As requested in https://github.com/facebook/rocksdb/issues/6127
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6152

Differential Revision: D18955608

Pulled By: pdillinger

fbshipit-source-id: 3e1367d944e44d5f1675a422f7dd2451c86feb6f
2019-12-12 12:01:19 -08:00
Peter Dillinger
57f3032285 Allow fractional bits/key in BloomFilterPolicy (#6092)
Summary:
There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.

This change not only requires refining the logic for choosing the best
num_probes for a given bits/key setting, it revealed a flaw in that logic.
As bits/key gets higher, the best num_probes for a cache-local Bloom
filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
standard Bloom filter. For example, at 16 bits per key, the best
num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
This change fixes and refines that logic (for the format_version=5
Bloom filter only, just in case) based on empirical tests to find
accuracy inflection points between each num_probes.

Although bits_per_key is now specified as a double, the new Bloom
filter converts/rounds this to "millibits / key" for predictable/precise
internal computations. Just in case of unforeseen compatibility
issues, we round to the nearest whole number bits / key for the
legacy Bloom filter, so as not to unlock new behaviors for it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092

Test Plan: unit tests included

Differential Revision: D18711313

Pulled By: pdillinger

fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a
2019-11-26 15:59:34 -08:00
Chen, You
51185592fd Add unordered write option rocksjava (#5839)
Summary:
Add unordered_write option api and related ut to rocksjava
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5839

Differential Revision: D17604446

Pulled By: maysamyabandeh

fbshipit-source-id: c6b07e85ca9d5e3a92973ddb6ab2bc079e53c9c1
2019-09-27 10:19:40 -07:00
Yikun Jiang
2367656b6c Remove invalid comparison of va_list and nullptr (#5836)
Summary:
The comparison of va_list and nullptr is always False under any arch, and will raise invalid operands of types error in aarch64 env (`error: invalid operands of types ‘va_list {aka __va_list}’ and ‘std::nullptr_t’ to binary ‘operator!=’`).

This patch removes this invalid assert.

Closes: https://github.com/facebook/rocksdb/issues/4277
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5836

Differential Revision: D17532470

fbshipit-source-id: ca98078ecbc6a9416c69de3bd6ffcfa33a0f0185
2019-09-23 12:15:01 -07:00
sdong
c06b54d0c6 Apply formatter on recent 45 commits. (#5827)
Summary:
Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827

Test Plan: Run all existing tests.

Differential Revision: D17483727

fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f
2019-09-19 12:34:17 -07:00
Tomas Kolda
3a408eeae9 Adding support for deleteFilesInRanges in JNI (#4031)
Summary:
It is very useful method call to achieve https://github.com/facebook/rocksdb/wiki/Delete-A-Range-Of-Keys
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4031

Differential Revision: D13515418

Pulled By: vjnadimpalli

fbshipit-source-id: 930b48e0992ef07fd1edd0b0cb5f780fabb1b4b5
2019-09-18 09:45:11 -07:00
Richard He
699e1b5ede Added support for SstFileReader JNI interface (#5556)
Summary:
Feature request as per https://github.com/facebook/rocksdb/issues/5538 issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5556

Differential Revision: D17219008

fbshipit-source-id: e31f18dec318416eac9dea8213bab31da96e1f3a
2019-09-09 18:12:53 -07:00
Wilfried Goesgens
fbab9913e2 upgrade gtest 1.7.0 => 1.8.1 for json result writing
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332

Differential Revision: D17242232

fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64
2019-09-09 11:24:11 -07:00
Zhongyi Xie
2f41ecfe75 Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022

Differential Revision: D14394062

Pulled By: miasantreble

fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
2019-08-23 13:55:34 -07:00
Maysam Yabandeh
208556ee13 WritePrepared: fix Get without snapshot (#5664)
Summary:
if read_options.snapshot is not set, ::Get will take the last sequence number after taking a super-version and uses that as the sequence number. Theoretically max_eviceted_seq_ could advance this sequence number. This could lead ::IsInSnapshot that will be invoked by the ReadCallback to notice the absence of the snapshot. In this case, the ReadCallback should have passed a non-value to snap_released so that it could be set by the ::IsInSnapshot. The patch does that, and adds a unit test to verify it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5664

Differential Revision: D16614033

Pulled By: maysamyabandeh

fbshipit-source-id: 06fb3fd4aacd75806ed1a1acec7961f5d02486f2
2019-08-05 13:41:21 -07:00
Mike Kolupaev
b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00
Jermy Li
c92c58f84d JNI: Do not create 8M block cache for negative blockCacheSize values (#5465)
Summary:
As [BlockBasedTableConfig setBlockCacheSize()](1966a7c055/java/src/main/java/org/rocksdb/BlockBasedTableConfig.java (L728)) said, If cacheSize is non-positive, then cache will not be used. but when we configure a negative number or 0, there is an unexpected result: the block cache becomes 8M.

- Allow 0 as a valid size. When block cache size is 0, an 8MB block cache is created, as it is the default C++ API behavior. Also updated the comment.
- Set no_block_cache true if negative value is passed to block cache size, and no block cache will be created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5465

Differential Revision: D15968788

Pulled By: sagar0

fbshipit-source-id: ee02d6e95841c9e2c316a64bfdf192d46ff5638a
2019-06-24 11:37:04 -07:00
Siying Dong
000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00