Summary:
This PR moves HDFS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host hdfs support. At this point,
people can start from the example repo and fork.
Java/JNI is not included yet, and needs to be done later if necessary.
The goal is to include this commit in RocksDB 7.0 release.
Reference:
https://github.com/ajkr/dedupfs by ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9170
Test Plan:
Follow the instructions in https://github.com/riversand963/rocksdb-hdfs-env/blob/master/README.md. Build and run db_bench and db_stress.
make check
Reviewed By: ajkr
Differential Revision: D33751662
Pulled By: riversand963
fbshipit-source-id: 22b4db7f31762ed417a20239f5a08dcd1696244f
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369
Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all
Reviewed By: pdillinger
Differential Revision: D33517260
Pulled By: riversand963
fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
Summary:
Existing multiGet() in java calls multi_get_helper() which then calls DB::std::vector MultiGet(). This doesn't take advantage of io_uring.
This change adds another JNI level method that runs a parallel code path using the DB::void MultiGet(), using ByteBuffers at the JNI level. We call it multiGetDirect(). In addition to using the io_uring path, this code internally returns pinned slices which we can copy out of into our direct byte buffers; this should reduce the overall number of copies in the code path to/from Java. Some jmh benchmark runs (100k keys, 1000 key multiGet) suggest that for value sizes > 1k, we see about a 20% performance improvement, although performance is slightly reduced for small value sizes, there's a little bit more overhead in the JNI methods.
Closes https://github.com/facebook/rocksdb/issues/8407
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9224
Reviewed By: mrambacher
Differential Revision: D32951754
Pulled By: jay-zhuang
fbshipit-source-id: 1f70df7334be2b6c42a9c8f92725f67c71631690
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
**Context:**
Searching `TableProperties::properties_offsets` across the codebase reveals that internally it is only used to find the external SST file's global seqno offeset. Therefore we can narrow it down and replace this map property with a uint64_t property `external_sst_file_global_seqno_offset` to save memory usage related to table properties.
Note:
- See PR comments for discussion about potential impact on existing external usage of `TableProperties::properties_offsets`
- See PR comments for discussion on keeping external SST file global seqno's offset VS using a simple flag indicating seqno's existence.
**Summary:**
- Replaced `TableProperties::properties_offsets` with `TableProperties::external_sst_file_global_seqno_offset`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9212
Test Plan: - Relied on existing tests should be sufficient since `TableProperties::properties_offsets` existed before and should already be tested.
Reviewed By: ajkr
Differential Revision: D32665941
Pulled By: hx235
fbshipit-source-id: 718e44617346dc4f3b1276ee953e61c196277795
Summary:
Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123
Test Plan: added new testing cases.
Reviewed By: siying
Differential Revision: D32154745
Pulled By: zhichao-cao
fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da
Summary:
https://github.com/facebook/rocksdb/pull/7898 enable write buffer manager to stall write when memory_usage exceeds buffer_size, this is really useful for container running case to limit the memory usage. However, this feature is not visiable for rocksJava yet.
This PR targets to introduce this feature for rocksJava.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9076
Reviewed By: akankshamahajan15
Differential Revision: D31931092
Pulled By: anand1976
fbshipit-source-id: 5531c16a87598663a02368c07b5e13a503164578
Summary:
Implementation of https://github.com/facebook/rocksdb/issues/8221, plus/including extension of Java options API to allow the get() of options from RocksDB. The extension allows more comprehensive testing of options at the Java side, by validating that the options are set at the C++ side.
Variations on methods:
MutableColumnFamilyOptions.MutableColumnFamilyOptionsBuilder getOptions()
MutableDBOptions.MutableDBOptionsBuilder getDBOptions()
retrieve the options via RocksDB C++ interfaces, and parse the resulting string into one of the Java-style option objects.
This necessitated generalising the parsing of option strings in Java, which now parses the full range of option strings returned by the C++ interface, rather than a useful subset. This necessitates the list-separator being changed to :(colon) from , (comma).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8999
Reviewed By: jay-zhuang
Differential Revision: D31655487
Pulled By: ltamasi
fbshipit-source-id: c38e98145c81c61dc38238b0df580db176ce4efd
Summary:
closes https://github.com/facebook/rocksdb/issues/8039
Unnecessary use of multiple local JNI references at the same time, 1 per key, was limiting the size of the key array. The local references don't need to be held simultaneously, so if we rearrange the code we can make it work for bigger key arrays.
Incidentally, make errors throw helpful exception messages rather than returning a null pointer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9012
Reviewed By: mrambacher
Differential Revision: D31580862
Pulled By: jay-zhuang
fbshipit-source-id: ce05831d52ede332e1b20e74d2dc621d219b9616
Summary:
* Don't hardcode namespace rocksdb (use ROCKSDB_NAMESPACE)
* Don't #include <rocksdb/...> (use double quotes)
* Support putting NOCOMMIT (any case) in source code that should not be
committed/pushed in current state.
These will be run with `make check` and in GitHub actions
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8821
Test Plan: existing tests, manually try out new checks
Reviewed By: zhichao-cao
Differential Revision: D30791726
Pulled By: pdillinger
fbshipit-source-id: 399c883f312be24d9e55c58951d4013e18429d92
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819
Reviewed By: pdillinger
Differential Revision: D30779238
Pulled By: ajkr
fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8741
Test Plan: new unit test
Reviewed By: zhichao-cao
Differential Revision: D30708578
Pulled By: ajkr
fbshipit-source-id: d06b961f7e199ae92c266b683e39870aa8f63449
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
Test Plan: existing tests
Reviewed By: mrambacher
Differential Revision: D30700039
Pulled By: pdillinger
fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666
Test Plan: Update a unit test in lru_cache_test
Reviewed By: zhichao-cao
Differential Revision: D30353816
Pulled By: anand1976
fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310
Test Plan: Add several unit tests.
Reviewed By: ajkr
Differential Revision: D28493792
fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536
Reviewed By: mrambacher
Differential Revision: D29962120
Pulled By: zhichao-cao
fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
Summary:
**Summary**:
2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations.
**Unit test**: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test.
This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR can be exactly predicted, showing that we have a full understanding of the underlying operations.
**Performance testing**:
`./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times.
Timing done using "date" function in a bash script.
_Results_:
Original Rocksdb fork: mean 66.6 sec, std 1.18 sec.
This feature branch: mean 67.4 sec, std 1.35 sec.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411
Reviewed By: akankshamahajan15
Differential Revision: D29150629
Pulled By: bjlemaire
fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
Added `TableProperties::{fast,slow}_compression_estimated_data_size`.
These properties are present in block-based tables when
`ColumnFamilyOptions::sample_for_compression > 0` and the necessary
compression library is supported when the file is generated. They
contain estimates of what `TableProperties::data_size` would be if the
"fast"/"slow" compression library had been used instead. One
limitation is we do not record exactly which "fast" (ZSTD or Zlib)
or "slow" (LZ4 or Snappy) compression library produced the result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139
Test Plan:
- new unit test
- ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression
Reviewed By: riversand963
Differential Revision: D27454338
Pulled By: ajkr
fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4
Summary:
Which should return 2 long instead of an array.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8098
Reviewed By: mrambacher
Differential Revision: D27308741
Pulled By: jay-zhuang
fbshipit-source-id: 44beea2bd28cf6779b048bebc98f2426fe95e25c
Summary:
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050
Test Plan: make check and add test to error_handler_fs_test
Reviewed By: anand1976
Differential Revision: D26990565
Pulled By: zhichao-cao
fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
Summary:
support getUsage and getPinnedUsage in JavaAPI for Cache
also fix a typo in LRUCacheTest.java that the highPriPoolRatio is not valid(set 5, I guess it means 0.05)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7925
Reviewed By: mrambacher
Differential Revision: D26900241
Pulled By: ajkr
fbshipit-source-id: 735d1e40a16fa8919c89c7c7154ba7f81208ec33
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
Summary:
This request is adding support for using DirectSlice in ReadOptions lower/upper bounds.
To be more efficient I have added setLength to DirectSlice so I can just update the length to be used by slice from direct buffer. It is also needed, because when one creates iterator it keep pointer to original slice so setting new slice in options does not help (it needs to reuse existing one). Using this approach one can modify the slice any time during operations with iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7132
Reviewed By: zhichao-cao
Differential Revision: D25840092
Pulled By: jay-zhuang
fbshipit-source-id: 760167baf61568c9a35138145c4bf9b06824cb71
Summary:
Fix ColumnFamilyOptionsTest.cfPaths and OptionsTest.cfPaths in 6.15 branch (and probably other branches including master)
has_exception variable was not initialized which was causing test failures and incorrect behavior on s390 platform (and maybe others as variable content is undefined).
adamretter please take a look.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7853
Reviewed By: akankshamahajan15
Differential Revision: D25901639
Pulled By: jay-zhuang
fbshipit-source-id: 151b5db27b495fc6d8ed54c0eccbde2508215ac5
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497
When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`
Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.
Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test
Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515
Reviewed By: ajkr
Differential Revision: D24240264
Pulled By: ramvadiv
fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
Summary:
Allows adding event listeners in RocksJava.
* Adds listeners getter and setter in `Options` and `DBOptions` classes.
* Adds `EventListener` Java interface and base class for implementing custom event listener callbacks - `AbstractEventListener`, which has an underlying native callback class implementing C++ `EventListener` class.
* `AbstractEventListener` class has mechanism for selectively enabling its callback methods in order to prevent invoking Java method if it is not implemented. This decreases performance cost in case only subset of event listener callback methods is needed - the JNI code for remaining "no-op" callbacks is not executed.
* The code is covered by unit tests in `EventListenerTest.java`, there are also tests added for setting/getting listeners field in `OptionsTest.java` and `DBOptionsTest.java`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7425
Reviewed By: pdillinger
Differential Revision: D24063390
Pulled By: jay-zhuang
fbshipit-source-id: 508c359538983d6b765e70d9989c351794a944ee
Summary:
Add following stats for MultiGet in Histogram to get more insight on MultiGet.
1. Number of index and filter blocks read from file as part of MultiGet
request per level.
2. Number of data blocks read from file per level.
3. Number of SST files loaded from file system per level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7366
Reviewed By: anand1976
Differential Revision: D24127040
Pulled By: akankshamahajan15
fbshipit-source-id: e63a003056b833729b277edc0639c08fb432756b
Summary:
as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7347
Test Plan: unit tests included
Reviewed By: jay-zhuang
Differential Revision: D23592552
Pulled By: pdillinger
fbshipit-source-id: 1c3571b6f42bfd0cfd723ff49d01fbc02a1be45b
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165
Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:
/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1
Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.
Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking
Reviewed By: ajkr
Differential Revision: D22693050
Pulled By: akabcenell
fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
Summary:
SST Partitioner interface that allows to split SST files during compactions.
It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957
Reviewed By: ajkr
Differential Revision: D22461239
fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389
Reviewed By: siying
Differential Revision: D21163707
Pulled By: zhichao-cao
fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd