Commit Graph

1924 Commits

Author SHA1 Message Date
Jay Zhuang
a79b46c503 Add De/Serialization for CompactionInput/Result (#8247)
Summary:
The functions will be used for remote compaction parameter
input and result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8247

Test Plan: `make check`

Reviewed By: ajkr

Differential Revision: D28104680

Pulled By: jay-zhuang

fbshipit-source-id: c0a5178e6277125118384278efea2acbf90aa6cb
2021-05-12 12:36:43 -07:00
Andrew Kryczka
c70bae1b05 Fix ConcurrentTaskLimiter token release for shutdown (#8253)
Summary:
Previously the shutdown process did not properly wait for all
`compaction_thread_limiter` tokens to be released before proceeding to
delete the DB's C++ objects. When this happened, we saw tests like
"DBCompactionTest.CompactionLimiter" flake with the following error:

```
virtual
rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl():
Assertion `outstanding_tasks_ == 0' failed.
```

There is a case where a token can still be alive even after the shutdown
process has waited for BG work to complete. In particular, this happens
because the shutdown process only waits for flush/compaction scheduled/unscheduled counters to all
reach zero. These counters are decremented in `BackgroundCallCompaction()`
functions. However, tokens are released in `BGWork*Compaction()` functions, which
actually wrap the `BackgroundCallCompaction()` function.

A simple sleep could repro the race condition:

```
$ diff --git a/db/db_impl/db_impl_compaction_flush.cc
b/db/db_impl/db_impl_compaction_flush.cc
index 806bc548a..ba59efa89 100644
 --- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2442,6 +2442,7 @@ void DBImpl::BGWorkCompaction(void* arg) {
       static_cast<PrepickedCompaction*>(ca.prepicked_compaction);
   static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction(
       prepicked_compaction, Env::Priority::LOW);
+  sleep(1);
   delete prepicked_compaction;
 }

$ ./db_compaction_test --gtest_filter=DBCompactionTest.CompactionLimiter
db_compaction_test: util/concurrent_task_limiter_impl.cc:24: virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed.
Received signal 6 (Aborted)
#0   /usr/local/fbcode/platform007/lib/libc.so.6(gsignal+0xcf) [0x7f02673c30ff] ??      ??:0
https://github.com/facebook/rocksdb/issues/1   /usr/local/fbcode/platform007/lib/libc.so.6(abort+0x134) [0x7f02673ac934] ??       ??:0
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8253

Test Plan: sleeps to expose race conditions

Reviewed By: akankshamahajan15

Differential Revision: D28168064

Pulled By: ajkr

fbshipit-source-id: 9e5167c74398d323e7975980c5cc00f450631160
2021-05-04 17:27:24 -07:00
mrambacher
0ca6d6297f Rename variables in ImmutableCFOptions to avoid conflicts with ImmutableDBOptions (#8227)
Summary:
Renaming ImmutableCFOptions::info_log and statistics to logger and stats.  This is stage 2 in creating an ImmutableOptions class.  It is necessary because the names match those in ImmutableOptions and have different types.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8227

Reviewed By: jay-zhuang

Differential Revision: D28000967

Pulled By: mrambacher

fbshipit-source-id: 3bf2aa04e8f1e8724d825b7deacf41080c14420b
2021-04-26 12:43:45 -07:00
mrambacher
01e460d538 Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176)
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code.  With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.

readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).

There are still fields in the ImmutableCFOptions that are not CF options but DB options.  Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions).  But that will be part of a future PR to minimize changes and disruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176

Reviewed By: pdillinger

Differential Revision: D27954339

Pulled By: mrambacher

fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
2021-04-22 20:43:54 -07:00
Peter Dillinger
95f6add746 Revert Ribbon starting level support from #8198 (#8212)
Summary:
This partially reverts commit 10196d7edc.

The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.

We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212

Test Plan: CI

Reviewed By: jay-zhuang

Differential Revision: D27896468

Pulled By: pdillinger

fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e
2021-04-20 19:46:40 -07:00
Peter Dillinger
10196d7edc Ribbon long-term support, starting level support (#8198)
Summary:
Since the Ribbon filter schema seems good (compatible back to
6.15.0), this change commits to long term support of the SST schema,
even though we expect the API for enabling Ribbon to change (still
called NewExperimentalRibbonFilterPolicy).

This also adds support for "hybrid" configuration in which some levels
use Bloom (higher levels, lower numbered) for speed and the rest use
Ribbon (lower levels, higher numbered) for memory space efficiency.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198

Test Plan: unit test added, crash test support

Reviewed By: jay-zhuang

Differential Revision: D27831232

Pulled By: pdillinger

fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab
2021-04-16 15:43:08 -07:00
Levi Tamasi
303cb23a0f Introduce a ThreadGuard class and use it in ExternalSSTFileTest.PickedLevelBug (#8112)
Summary:
The patch adds a resource management/RAII class called `ThreadGuard`,
which can be used to ensure that the managed thread is joined when the
`ThreadGuard` is destroyed, regardless of whether it is due to the
object going out of scope, an early return, an exception etc. This is
important because if an `std::thread` object is destroyed without having
been joined (or detached) first, the process is aborted (via
`std::terminate`).

For now, `ThreadGuard` is only used in the test case
`ExternalSSTFileTest.PickedLevelBug`; however, it could come in handy
elsewhere in the codebase as well (both in test code and "real" code).
Case in point: in the `PickedLevelBug` test case, with the earlier code we
could end up in the above situation when the following assertion (which is
before the threads are joined) is triggered:

```
ASSERT_FALSE(bg_compact_started.load());
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8112

Test Plan:
```
make check
gtest-parallel --repeat=10000 ./external_sst_file_test --gtest_filter="*PickedLevelBug"
```

Reviewed By: riversand963

Differential Revision: D27343185

Pulled By: ltamasi

fbshipit-source-id: 2a8c3aa68bc78cc03ec0dbae909fb25c2cd15c69
2021-03-25 22:08:58 -07:00
Jay Zhuang
45c65d6dcf Use thread-safe strerror_r() to get error message (#8087)
Summary:
`strerror()` is not thread-safe, using `strerror_r()` instead. The API could be different on the different platforms, used the code from 0deef031cb/folly/String.cpp (L457)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8087

Reviewed By: mrambacher

Differential Revision: D27267151

Pulled By: jay-zhuang

fbshipit-source-id: 4b8856d1ec069d5f239b764750682c56e5be9ddb
2021-03-24 23:07:27 -07:00
Peter Dillinger
da6b90ab48 Improve bloom_test bits_per_key flag (#8093)
Summary:
Improved handling of -bits_per_key other than 10, but at least
the OptimizeForMemory test is simply not designed for generally handling
other settings. (ribbon_test does have a statistical framework for this
kind of testing, but it's not important to do that same for Bloom right
now.)

Closes https://github.com/facebook/rocksdb/issues/7019

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8093

Test Plan: for I in `seq 1 20`; do ./bloom_test --gtest_filter=-*OptimizeForMemory* --bits_per_key=$I &> /dev/null || echo FAILED; done

Reviewed By: mrambacher

Differential Revision: D27275875

Pulled By: pdillinger

fbshipit-source-id: 7362e8ac2c41ea11f639412e4f30c8b375f04388
2021-03-23 21:42:40 -07:00
mrambacher
3dff28cf9b Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.

For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.

There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.

Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:

6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)

(Note that the times for 6.18 are prior to revert of the SystemClock).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033

Reviewed By: pdillinger

Differential Revision: D27014563

Pulled By: mrambacher

fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
2021-03-15 04:34:11 -07:00
Peter Dillinger
01c2ec3fcb Add ROCKSDB_GTEST_BYPASS (#8048)
Summary:
This is for cases that do not meet the Facebook criteria for
SKIP (see new comments). Also made ROCKSDB_GTEST_{SKIP,BYPASS} print the
message because gtest doesn't ever seem to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8048

Test Plan: manual inspection of ./ribbon_test output, CI

Reviewed By: mrambacher

Differential Revision: D26953688

Pulled By: pdillinger

fbshipit-source-id: c914eaffe7d419db6ab90a193d474531e23582e5
2021-03-12 16:02:06 -08:00
Yanqin Jin
82b3888433 Enable backward iterator for keys with user-defined timestamp (#8035)
Summary:
This PR does the following:

- Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet.
- Introduces a new helper API `Comparator::EqualWithoutTimestamp()`.
- Fix a typo in `SetTimestamp()`.
- Add/update unit tests

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps.

1. Disable timestamp, run:
```
./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Results:
> Baseline
> - seekrandom [AVG    6 runs] : 96115 ops/sec;   53.2 MB/sec
> - seekrandom [MEDIAN 6 runs] : 98075 ops/sec;   54.2 MB/sec
>
> This PR
> - seekrandom [AVG    6 runs] : 95521 ops/sec;   52.8 MB/sec
> - seekrandom [MEDIAN 6 runs] : 96338 ops/sec;   53.3 MB/sec

2. Enable timestamp, run:
```
./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
```
Result:
> Baseline: not supported
>
> This PR
> - seekrandom [AVG    6 runs] : 90514 ops/sec;   50.1 MB/sec
> - seekrandom [MEDIAN 6 runs] : 90834 ops/sec;   50.2 MB/sec

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035

Reviewed By: ltamasi

Differential Revision: D26926668

Pulled By: riversand963

fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5
2021-03-10 11:15:46 -08:00
David CARLIER
7a3444bf1f Mac M1 crc32 intrinsics ARM64 check support proposal (#7893)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7893

Reviewed By: ajkr

Differential Revision: D26050966

Pulled By: jay-zhuang

fbshipit-source-id: 9df2bb65d82defd7fad49d5369979b03e22d39c2
2021-03-10 09:05:56 -08:00
Peter Dillinger
4b18c46d10 Refactor: add LineFileReader and Status::MustCheck (#8026)
Summary:
Removed confusing, awkward, and undocumented internal API
ReadOneLine and replaced with very simple LineFileReader.

In refactoring backupable_db.cc, this has the side benefit of
removing the arbitrary cap on the size of backup metadata files.

Also added Status::MustCheck to make it easy to mark a Status as
"must check." Using this, I can ensure that after
LineFileReader::ReadLine returns false the caller checks GetStatus().

Also removed some excessive conditional compilation in status.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026

Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED

Reviewed By: mrambacher

Differential Revision: D26831687

Pulled By: pdillinger

fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f
2021-03-09 20:12:38 -08:00
Levi Tamasi
cb25bc1128 Update compaction statistics to include the amount of data read from blob files (#8022)
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022

Test Plan: Ran `make check` and `db_bench`.

Reviewed By: riversand963

Differential Revision: D26801199

Pulled By: ltamasi

fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
2021-03-04 00:43:48 -08:00
Peter Dillinger
a8b3b9a20c Refine Ribbon configuration, improve testing, add Homogeneous (#7879)
Summary:
This change only affects non-schema-critical aspects of the production candidate Ribbon filter. Specifically, it refines choice of internal configuration parameters based on inputs. The changes are minor enough that the schema tests in bloom_test, some of which depend on this, are unaffected. There are also some minor optimizations and refactorings.

This would be a schema change for "smash" Ribbon, to fix some known issues with small filters, but "smash" Ribbon is not accessible in public APIs. Unit test CompactnessAndBacktrackAndFpRate updated to test small and medium-large filters. Run with --thoroughness=100 or so for much better detection power (not appropriate for continuous regression testing).

Homogenous Ribbon:
This change adds internally a Ribbon filter variant we call Homogeneous Ribbon, in collaboration with Stefan Walzer. The expected "result" value for every key is zero, instead of computed from a hash. Entropy for queries not to be false positives comes from free variables ("overhead") in the solution structure, which are populated pseudorandomly. Construction is slightly faster for not tracking result values, and never fails. Instead, FP rate can jump up whenever and whereever entries are packed too tightly. For small structures, we can choose overhead to make this FP rate jump unlikely, as seen in updated unit test CompactnessAndBacktrackAndFpRate.

Unlike standard Ribbon, Homogeneous Ribbon seems to scale to arbitrary number of keys when accepting an FP rate penalty for small pockets of high FP rate in the structure. For example, 64-bit ribbon with 8 solution columns and 10% allocated space overhead for slots seems to achieve about 10.5% space overhead vs. information-theoretic minimum based on its observed FP rate with expected pockets of degradation. (FP rate is close to 1/256.) If targeting a higher FP rate with fewer solution columns, Homogeneous Ribbon can be even more space efficient, because the penalty from degradation is relatively smaller. If targeting a lower FP rate, Homogeneous Ribbon is less space efficient, as more allocated overhead is needed to keep the FP rate impact of degradation relatively under control. The new OptimizeHomogAtScale tool in ribbon_test helps to find these optimal allocation overheads for different numbers of solution columns. And Ribbon widths, with 128-bit Ribbon apparently cutting space overheads in half vs. 64-bit.

Other misc item specifics:
* Ribbon APIs in util/ribbon_config.h now provide configuration data for not just 5% construction failure rate (95% success), but also 50% and 0.1%.
  * Note that the Ribbon structure does not exhibit "threshold" behavior as standard Xor filter does, so there is a roughly fixed space penalty to cut construction failure rate in half. Thus, there isn't really an "almost sure" setting.
  * Although we can extrapolate settings for large filters, we don't have a good formula for configuring smaller filters (< 2^17 slots or so), and efforts to summarize with a formula have failed. Thus, small data is hard-coded from updated FindOccupancy tool.
* Enhances ApproximateNumEntries for public API Ribbon using more precise data (new API GetNumToAdd), thus a more accurate but not perfect reversal of CalculateSpace. (bloom_test updated to expect the greater precision)
* Move EndianSwapValue from coding.h to coding_lean.h to keep Ribbon code easily transferable from RocksDB
* Add some missing 'const' to member functions
* Small optimization to 128-bit BitParity
* Small refactoring of BandingStorage in ribbon_alg.h to support Homogeneous Ribbon
* CompactnessAndBacktrackAndFpRate now has an "expand" test: on construction failure, a possible alternative to re-seeding hash functions is simply to increase the number of slots (allocated space overhead) and try again with essentially the same hash values. (Start locations will be different roundings of the same scaled hash values--because fastrange not mod.) This seems to be as effective or more effective than re-seeding, as long as we increase the number of slots (m) by roughly m += m/w where w is the Ribbon width. This way, there is effectively an expansion by one slot for each ribbon-width window in the banding. (This approach assumes that getting "bad data" from your hash function is as unlikely as it naturally should be, e.g. no adversary.)
* 32-bit and 16-bit Ribbon configurations are added to ribbon_test for understanding their behavior, e.g. with FindOccupancy. They are not considered useful at this time and not tested with CompactnessAndBacktrackAndFpRate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7879

Test Plan: unit test updates included

Reviewed By: jay-zhuang

Differential Revision: D26371245

Pulled By: pdillinger

fbshipit-source-id: da6600d90a3785b99ad17a88b2a3027710b4ea3a
2021-02-26 08:50:42 -08:00
sherriiiliu
e017af15c1 Fix testcase failures on windows (#7992)
Summary:
Fixed 5 test case failures found on Windows 10/Windows Server 2016
1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles.
2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes.
3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end.
4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case.
5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992

Reviewed By: jay-zhuang

Differential Revision: D26615283

Pulled By: ajkr

fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d
2021-02-23 14:35:06 -08:00
Andrew Kryczka
d904233d2f Limit buffering for collecting samples for compression dictionary (#7970)
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.

However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.

Related changes include:

- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970

Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.

Reviewed By: pdillinger

Differential Revision: D26467994

Pulled By: ajkr

fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
2021-02-19 14:09:54 -08:00
Max Neunhoeffer
cf14cb3e29 Avoid self-move-assign in pop operation of binary heap. (#7942)
Summary:
The current implementation of a binary heap in `util/heap.h` does a move-assign in the `pop` method. In the case that there is exactly one element stored in the heap, this ends up being a self-move-assign. This can cause trouble with certain classes, which are not prepared for this. Furthermore, it trips up the glibc STL debugger (`-D_GLIBCXX_DEBUG`), which produces an assertion failure in this case.

This PR addresses this problem by not doing the (unnecessary in this case) move-assign if there is only one element in the heap.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7942

Reviewed By: jay-zhuang

Differential Revision: D26528739

Pulled By: ajkr

fbshipit-source-id: 5ca570e0c4168f086b10308ad766dff84e6e2d03
2021-02-19 13:47:25 -08:00
Wilfried Goesgens
8a05c21e32 add string separation while composing error message (#7919)
Summary:
This will fix a missing string separation between `msg[n]` and `state_`.
Example of an error message how its looking now:
```
IO error: No space left on deviceWhile appending to file: /home/willi/src/stable-3.7/tmp/arangosh_CL6EFQ/shell_client/single1/data/engine-rocksdb/126426.sst: No space left on device
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7919

Reviewed By: ajkr

Differential Revision: D26242246

Pulled By: jay-zhuang

fbshipit-source-id: 5d9a0997a410aecfb3781478e57395d3d937bb84
2021-02-18 12:25:35 -08:00
Zhichao Cao
d1c510baec Handoff checksum Implementation (#7523)
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523

Test Plan: add new unit test, pass make check/ make asan_check

Reviewed By: pdillinger

Differential Revision: D24313271

Pulled By: zhichao-cao

fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
2021-02-10 22:20:32 -08:00
Peter Dillinger
e4f1e64c30 Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889)
Summary:
Adds support for prefetching data in Ribbon queries,
which especially optimizes batched Ribbon queries for MultiGet
(~222ns/key to ~97ns/key) but also single key queries on cold memory
(~333ns to ~226ns) because many queries span more than one cache line.

This required some refactoring of the query algorithm, and there
does not appear to be a noticeable regression in "hot memory" query
times (perhaps from 48ns to 50ns).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889

Test Plan:
existing unit tests, plus performance validation with
filter_bench:

Each data point is the best of two runs. I saturated the machine
CPUs with other filter_bench runs in the background.

Before:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 125.86
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 48.0111
      Batched, prepared net ns/op: 222.384
      Batched, unprepared net ns/op: 343.908
      Skewed 50% in 1% net ns/op: 252.916
      Skewed 80% in 20% net ns/op: 320.579
      Random filter net ns/op: 332.957

After:

    $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 128.117
    Number of filters: 1993
    Total size (MB): 168.166
    Reported total allocated memory (MB): 183.211
    Reported internal fragmentation: 8.94626%
    Bits/key stored: 7.05341
    Prelim FP rate %: 0.951827
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 49.8812
      Batched, prepared net ns/op: 97.1514
      Batched, unprepared net ns/op: 222.025
      Skewed 50% in 1% net ns/op: 197.48
      Skewed 80% in 20% net ns/op: 212.457
      Random filter net ns/op: 226.464

Bloom comparison, for reference:

    $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
    WARNING: Assertions are enabled; benchmarks unnecessarily slow
    Building...
    Build avg ns/key: 35.3042
    Number of filters: 1993
    Total size (MB): 238.488
    Reported total allocated memory (MB): 262.875
    Reported internal fragmentation: 10.2255%
    Bits/key stored: 10.0029
    Prelim FP rate %: 0.965327
    ----------------------------
    Mixed inside/outside queries...
      Single filter net ns/op: 9.09931
      Batched, prepared net ns/op: 34.21
      Batched, unprepared net ns/op: 88.8564
      Skewed 50% in 1% net ns/op: 139.75
      Skewed 80% in 20% net ns/op: 181.264
      Random filter net ns/op: 173.88

Reviewed By: jay-zhuang

Differential Revision: D26378710

Pulled By: pdillinger

fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69
2021-02-10 21:04:56 -08:00
Andrew Kryczka
78ee8564ad Integrity protection for live updates to WriteBatch (#7748)
Summary:
This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).

The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.

When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748

Test Plan:
- an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
- add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
- [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.

Reviewed By: pdillinger

Differential Revision: D25754492

Pulled By: ajkr

fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
2021-01-29 12:18:58 -08:00
mrambacher
4a09d632c4 Remove Legacy and Custom FileWrapper classes from header files (#7851)
Summary:
Removed the uses of the Legacy FileWrapper classes from the source code.  The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem.

Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851

Reviewed By: anand1976

Differential Revision: D26114816

Pulled By: mrambacher

fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150
2021-01-28 22:10:32 -08:00
mrambacher
0a9a05ae12 Make builds reproducible (#7866)
Summary:
Closes https://github.com/facebook/rocksdb/issues/7035

Changed how build_version.cc was generated:
- Included the GIT tag/branch in the build_version file
- Changed the "Build Date" to be:
      - If the GIT branch is "clean" (no changes), the date of the last git commit
      - If the branch is not clean, the current date
 - Added APIs to access the "build information", rather than accessing the strings directly.

The build_version.cc file is now regenerated whenever the library objects are rebuilt.

Verified that the built files remain the same size across builds on a "clean build" and the same information is reported by sst_dump --version

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7866

Reviewed By: pdillinger

Differential Revision: D26086565

Pulled By: mrambacher

fbshipit-source-id: 6fcbe47f6033989d5cf26a0ccb6dfdd9dd239d7f
2021-01-28 17:42:16 -08:00
mrambacher
12f1137355 Add a SystemClock class to capture the time functions of an Env (#7858)
Summary:
Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.

Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.

There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).

Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858

Reviewed By: pdillinger

Differential Revision: D26006406

Pulled By: mrambacher

fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
2021-01-25 22:09:11 -08:00
Andrew Kryczka
e18a4df62a workaround race conditions during PeriodicWorkScheduler registration (#7888)
Summary:
This provides a workaround for two race conditions that will be fixed in
a more sophisticated way later. This PR:

(1) Makes the client serialize calls to `Timer::Start()` and `Timer::Shutdown()` (see https://github.com/facebook/rocksdb/issues/7711). The long-term fix will be to make those functions thread-safe.
(2) Makes `PeriodicWorkScheduler` atomically add/cancel work together with starting/shutting down its `Timer`. The long-term fix will be for `Timer` API to offer more specialized APIs so the client will not need to synchronize.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7888

Test Plan: ran the repro provided in https://github.com/facebook/rocksdb/issues/7881

Reviewed By: jay-zhuang

Differential Revision: D25990891

Pulled By: ajkr

fbshipit-source-id: a97fdaebbda6d7db7ddb1b146738b68c16c5be38
2021-01-21 08:50:38 -08:00
Andrew Kryczka
5b748b9e68 Cover all status codes in Status::ToString() (#7872)
Summary:
- Completed the switch statement for all possible `Code` values (the only one missing was `kCompactionTooLarge`).
- Removed the default case so compiler can alert us if a new value is added to `Code` without handling it in `Status::ToString()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7872

Test Plan:
verified the log message for this scenario looks right

```
2021/01/15-17:26:34.564450 7fa6845fe700 [ERROR] [/db_impl/db_impl_compaction_flush.cc:2621] Waiting after background compaction error: Compaction too large: , Accumulated background error counts: 1
```

Reviewed By: ramvadiv

Differential Revision: D25934539

Pulled By: ajkr

fbshipit-source-id: 2e0b3c0d993e356a4987276d6f8a163f0ee8be7a
2021-01-16 04:28:50 -08:00
mrambacher
c1a65a4de4 Make StringEnv, StringSink, StringSource use FS classes (#7786)
Summary:
Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones.  The StringSink and StringSource classes were changed to be based on the corresponding FS file classes.

Part of a cleanup to use the newer interfaces.  This change also eliminates some of the casts/wrappers to LegacyFile classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786

Reviewed By: jay-zhuang

Differential Revision: D25761460

Pulled By: anand1976

fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f
2021-01-04 16:01:01 -08:00
Andrew Kryczka
61e324422e fix thread status synchronization in thread_list_test (#7825)
Summary:
The test was flaky because the BG threads could increase
`running_count_` up to `job_count_` before applying their thread status
updates. Then the test thread would see non-deterministic results when
counting threads with each status. The fix is to acquire mutex in test
thread so it sees `running_count_` and thread status updated atomically.
I think simply reordering the two updates would have been insufficient
since the thread status update uses `memory_order_relaxed`. This change
happens to also eliminate an undesirable sleep loop.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7825

Test Plan:
injected sleeps to verify the failure repros before this PR and does not
repro after.

Reviewed By: jay-zhuang

Differential Revision: D25742409

Pulled By: ajkr

fbshipit-source-id: 926a2223fe856e20bc4c0c27df6736ee5cb02c97
2021-01-04 10:46:24 -08:00
mrambacher
55e99688cc No elide constructors (#7798)
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds.  This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it.  In this case,  without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798

Reviewed By: jay-zhuang

Differential Revision: D25680451

Pulled By: pdillinger

fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917
2020-12-23 16:55:53 -08:00
mrambacher
02418194d7 Add more tests for assert status checked (#7524)
Summary:
Added 10 more tests that pass the ASSERT_STATUS_CHECKED test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7524

Reviewed By: akankshamahajan15

Differential Revision: D24323093

Pulled By: ajkr

fbshipit-source-id: 28d4106d0ca1740c3b896c755edf82d504b74801
2020-12-22 23:45:58 -08:00
Peter Dillinger
239d17a19c Support optimize_filters_for_memory for Ribbon filter (#7774)
Summary:
Primarily this change refactors the optimize_filters_for_memory
code for Bloom filters, based on malloc_usable_size, to also work for
Ribbon filters.

This change also replaces the somewhat slow but general
BuiltinFilterBitsBuilder::ApproximateNumEntries with
implementation-specific versions for Ribbon (new) and Legacy Bloom
(based on a recently deleted version). The reason is to emphasize
speed in ApproximateNumEntries rather than 100% accuracy.

Justification: ApproximateNumEntries (formerly CalculateNumEntry) is
only used by RocksDB for range-partitioned filters, called each time we
start to construct one. (In theory, it should be possible to reuse the
estimate, but the abstractions provided by FilterPolicy don't really
make that workable.) But this is only used as a heuristic estimate for
hitting a desired partitioned filter size because of alignment to data
blocks, which have various numbers of unique keys or prefixes. The two
factors lead us to prioritize reasonable speed over 100% accuracy.

optimize_filters_for_memory adds extra complication, because precisely
calculating num_entries for some allowed number of bytes depends on state
with optimize_filters_for_memory enabled. And the allocator-agnostic
implementation of optimize_filters_for_memory, using malloc_usable_size,
means we would have to actually allocate memory, many times, just to
precisely determine how many entries (keys) could be added and stay below
some size budget, for the current state. (In a draft, I got this
working, and then realized the balance of speed vs. accuracy was all
wrong.)

So related to that, I have made CalculateSpace, an internal-only API
only used for testing, non-authoritative also if
optimize_filters_for_memory is enabled. This simplifies some code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774

Test Plan:
unit test updated, and for FilterSize test, range of tested
values is greatly expanded (still super fast)

Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes.

Bloom+optimize_filters_for_memory:

      1 Filter size: 197 (224 in memory)
    134 Filter size: 3525 (3584 in memory)
    107 Filter size: 4037 (4096 in memory)
    Total on disk: 904,506
    Total in memory: 918,752

Ribbon+optimize_filters_for_memory:

      1 Filter size: 3061 (3072 in memory)
    110 Filter size: 3573 (3584 in memory)
     58 Filter size: 4085 (4096 in memory)
    Total on disk: 633,021 (-30.0%)
    Total in memory: 634,880 (-30.9%)

Bloom (no offm):

      1 Filter size: 261 (320 in memory)
      1 Filter size: 3333 (3584 in memory)
    240 Filter size: 3717 (4096 in memory)
    Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm)
    Total in memory: 986,944 (+7.4% vs. +offm)

Ribbon (no offm):

      1 Filter size: 2949 (3072 in memory)
      1 Filter size: 3381 (3584 in memory)
    167 Filter size: 3701 (4096 in memory)
    Total on disk: 624,397 (-30.3% vs. Bloom)
    Total in memory: 690,688 (-30.0% vs. Bloom)

Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom.

Reviewed By: jay-zhuang

Differential Revision: D25592970

Pulled By: pdillinger

fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912
2020-12-18 14:31:03 -08:00
Peter Dillinger
003e72b201 Use size_t for filter APIs, protect against overflow (#7726)
Summary:
Deprecate CalculateNumEntry and replace with
ApproximateNumEntries (better name) using size_t instead of int and
uint32_t, to minimize confusing casts and bad overflow behavior
(possible though probably not realistic). Bloom sizes are now explicitly
capped at max size supported by implementations: just under 4GiB for
fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This
hardening could help to set up for fuzzing.

Also, since RocksDB only uses this information as an approximation
for trying to hit certain sizes for partitioned filters, it's more important
that the function be reasonably fast than for it to be completely
accurate. It's hard enough to be 100% accurate for Ribbon (currently
reversing CalculateSpace) that adding optimize_filters_for_memory
into the mix is just not worth trying to be 100% accurate for num
entries for bytes.

Also:
- Cleaned up filter_policy.h to remove MSVC warning handling and
potentially unsafe use of exception for "not implemented"
- Correct the number of entries limit beyond which current Ribbon
implementation falls back on Bloom instead.
- Consistently use "num_entries" rather than "num_entry"
- Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially
obsolete from general implementation
BuiltinFilterBitsBuilder::CalculateNumEntries.
- Fix filter_bench to skip some tests that don't make sense when only
one or a small number of filters has been generated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726

Test Plan:
expanded existing unit tests for CalculateSpace /
ApproximateNumEntries. Also manually used filter_bench to verify Legacy and
fv=5 Bloom size caps work (much too expensive for unit test). Note that
the actual bits per key is below requested due to space cap.

    $ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate
    ...
    Total size (MB): 511.992
    Bits/key stored: 16.777
    ...
    $ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000
    ...
    Total size (MB): 4096
    Bits/key stored: 17.1799
    ...
    $

Reviewed By: jay-zhuang

Differential Revision: D25239800

Pulled By: pdillinger

fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4
2020-12-11 22:18:12 -08:00
pkubaj
66e54c5984 Fix build on FreeBSD/powerpc64(le) (#7732)
Summary:
To build on FreeBSD, arch_ppc_probe needs to be adapted to FreeBSD.

Since FreeBSD uses elf_aux_info as an getauxval equivalent, use it and include necessary headers:
- machine/cpu.h for PPC_FEATURE2_HAS_VEC_CRYPTO,
- sys/auxv.h for elf_aux_info,
- sys/elf_common.h for AT_HWCAP2.

elf_aux_info isn't checked for being available, because it's available since FreeBSD 12.0. rocksdb assumes using Clang on FreeBSD, but powerpc* platforms switch to Clang only since 13.0.

This patch makes rocksdb build on FreeBSD on powerpc64 and powerpc64le platforms.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7732

Reviewed By: ltamasi

Differential Revision: D25399194

Pulled By: pdillinger

fbshipit-source-id: 9c905147d75f98cd2557dd2f86a940b8e6c5afcd
2020-12-08 15:31:56 -08:00
Vincent Milum Jr
93c6c18cf9 Adding ARM AT_HWCAP support for FreeBSD (#7750)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7750

Reviewed By: ltamasi

Differential Revision: D25400609

Pulled By: pdillinger

fbshipit-source-id: 13b15e2f490acc011b648fbd9615ea8e580cccc7
2020-12-08 13:33:21 -08:00
Adam Retter
ee4bd4780b Fix compilation on Apple Silicon (#7714)
Summary:
Closes - https://github.com/facebook/rocksdb/issues/7710

I tested this on an Apple DTK (Developer Transition Kit) with an Apple A12Z Bionic CPU and macOS Big Sur (11.0.1).

Previously the arm64 specific CRC optimisations were limited to Linux only OS... Well now Apple Silicon is also arm64 but runs macOS ;-)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7714

Reviewed By: ltamasi

Differential Revision: D25287349

Pulled By: pdillinger

fbshipit-source-id: 639b168bf0ac2652907531e9604936ac4974b577
2020-12-04 15:22:33 -08:00
Peter Dillinger
3b9bfe8f14 Skip minimum rate check in Sandcastle (#7728)
Summary:
The minimum rate check in RateLimiterTest.Rate can fail in
Facebook's CI system Sandcastle, presumably due to heavily loaded
machines. This change disables the minimum rate check for Sandcastle
runs, and cleans up the code disabling it on other CI environments. (The
amount of conditionally compiled code shall be minimized.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7728

Test Plan: try new test with and without setting envvar SANDCASTLE=1

Reviewed By: ltamasi

Differential Revision: D25247642

Pulled By: pdillinger

fbshipit-source-id: d786233af37af9a874adbb3a9e2707ec52c27a5a
2020-12-02 15:16:49 -08:00
Adam Retter
b937be3779 Fix Compilation on ppc64le using Clang 11 (#7713)
Summary:
Closes https://github.com/facebook/rocksdb/issues/7691

The optimised CRC code for PPC64le which was originally imported in https://github.com/facebook/rocksdb/pull/2353 is not compatible with Clang 11. It looks like the code most likely originated from https://github.com/antonblanchard/crc32-vpmsum.

The code relied on a GCC header file `ppc-asm.h` which is not available in Clang.

To solve this, I have taken the same approach as the the upstream project from which the CRC code came ffc8018efc (diff-ec3e62c56fbcddeb07230f2a4673c1abd7f0f1cc8e48a2aa560056cfc1b25d60) and simply imported a copy of the GCC header file into our code-base which will be used when Clang is the compiler on pcc64le.

**NOTE**: The new file `util/ppc-asm.h` may have licensing implications which I guess need to be approved by RocksDB/Facebook before this is merged

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7713

Reviewed By: jay-zhuang

Differential Revision: D25222645

Pulled By: pdillinger

fbshipit-source-id: e3fec9136f26ce1eb7a027048bcf77a6cb3c769c
2020-12-01 11:21:44 -08:00
Peter Dillinger
0baa5055f1 Add Ribbon schema test to bloom_test (#7696)
Summary:
These new unit tests should ensure that we don't accidentally
change the interpretation of bits for what I call Standard128Ribbon
filter internally, available publicly as NewExperimentalRibbonFilterPolicy.
There is very little intuitive reason for the values we check against in
these tests; I just plug in the right expected values upon watching the
test fail initially.

Most (but not all) of the tests are essentially "whitebox" "round-trip." We
create a filter from fixed keys, and first compare the checksum of those
filter bytes against a saved value. We also run queries against other fixed
keys, comparing which return false positives against a saved set.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7696

Test Plan: test addition and refactoring only

Reviewed By: jay-zhuang

Differential Revision: D25082289

Pulled By: pdillinger

fbshipit-source-id: b5ca646fdcb5a1c2ad2085eda4a1fd44c4287f67
2020-11-22 19:52:04 -08:00
Peter Dillinger
60af964372 Experimental (production candidate) SST schema for Ribbon filter (#7658)
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)

Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.

### Benchmarking

```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 18.7508
  Random filter net ns/op: 258.246
    Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
  Single filter net ns/op: 58.4523
  Random filter net ns/op: 363.717
    Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```

168.166 / 238.488 = 0.705  -> 29.5% space reduction

130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)

### Working around a hashing "flaw"

bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate.  The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant.  Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))

As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)

TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.

### Other related changes

* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658

Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.

Reviewed By: jay-zhuang

Differential Revision: D24899349

Pulled By: pdillinger

fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
2020-11-12 20:46:14 -08:00
Yanqin Jin
8b6b6aeb1a Refactor with VersionEditHandler (#6581)
Summary:
Added a few classes in the same class hierarchy to remove code duplication and
refactor the logic of reading and processing MANIFEST files.

New classes are as follows.
```
class VersionEditHandlerBase;
class ListColumnFamiliesHandler : VersionEditHandlerBase;
class FileChecksumRetriever : VersionEditHandlerBase;
class DumpManifestHandler : VersionEditHandler;
```
Classes that already existed before this PR are as follows.
```
class VersionEditHandler : VersionEditHandlerBase;
```

With these classes, refactored functions: `VersionSet::Recover()`,
`VersionSet::ListColumnFamilies()`, `VersionSet::DumpManifest()`,
`GetFileChecksumFromManifest()`.

Test Plan (devserver):
```
make check
COMPILE_WITH_ASAN=1 make check
```
These refactored code, especially recovery-related logic, will be tested intensively by
all existing unit tests and stress tests. For example, run
```
make crash_test
```
Verified 3 successful runs on devserver.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6581

Reviewed By: ajkr

Differential Revision: D20616217

Pulled By: riversand963

fbshipit-source-id: 048c7743aa4be2623ccd0cc3e61c0027e604e78b
2020-11-11 08:00:14 -08:00
Peter Dillinger
c57f914482 Use NPHash64 in more places (#7632)
Summary:
Since the hashes should not be persisted in output_validator
nor mock_env.

Also updated NPHash64 to use 64-bit seed, and comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7632

Test Plan:
make check, and new build setting that enables modification
to NPHash64, to check for behavior depending on specific values. Added
that setting to one of the CircleCI configurations.

Reviewed By: jay-zhuang

Differential Revision: D24833780

Pulled By: pdillinger

fbshipit-source-id: 02a57652ccf1ac105fbca79e77875bb7bf7c071f
2020-11-10 23:42:13 -08:00
Peter Dillinger
8b8a2e9f05 Ribbon: major re-work of hashing, seeds, and more (#7635)
Summary:
* Fully optimized StandardHasher, in terms of efficiently generating Start, CoeffRow, and ResultRow from a stock hash value, with sufficient independence between them to have no measurably degraded behavior. (Degraded behavior would be an FP rate higher than explainable by 2^-b and, if using a 32-bit stock hash function, expected stock hash collisions.) Details in code comments.
* Our standard 64-bit and 32-bit hash functions do not exhibit sufficient independence on sequential seeds (for one Ribbon construction attempt to have independent probability from the next). I have worked around this in the Ribbon code by "pre-mixing" "ordinal seeds," sequentially tried and appropriate for storage in persisted metadata, into "raw seeds," ready for application and appropriate for in-memory storage. This way the pre-mixing step (though fast) is only applied on loading or configuring the structure, not on each query or banding add.
* Fix a subtle flaw in which backtracking not clearing ResultRow data could lead to elevated FP rate on keys that were backtracked on and should (for generality) exhibit the same FP rate as novel keys.
* Added a basic test for PhsfQuery and construction algorithms (map or "retrieval structure" rather than set or filter), and made a few trivial related fixes.
* Better random configuration generation in unit tests
* Some other minor cleanup / clarification / etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7635

Test Plan: unit tests included

Reviewed By: jay-zhuang

Differential Revision: D24738978

Pulled By: pdillinger

fbshipit-source-id: f9d03599d9e2ca3e30e9d3e7d81cd936b56f76f0
2020-11-07 17:22:54 -08:00
Peter Dillinger
746909ceda Ribbon: InterleavedSolutionStorage (#7598)
Summary:
The core algorithms for InterleavedSolutionStorage and the
implementation SerializableInterleavedSolution make Ribbon fast for
filter queries. Example output from new unit test:

    Simple      outside query, hot, incl hashing, ns/key: 117.796
    Interleaved outside query, hot, incl hashing, ns/key: 42.2655
    Bloom       outside query, hot, incl hashing, ns/key: 24.0071

Also includes misc cleanup of previous Ribbon code and comments.

Some TODOs and FIXMEs remain for futher work / investigation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7598

Test Plan: unit tests included (integration work and tests coming later)

Reviewed By: jay-zhuang

Differential Revision: D24559209

Pulled By: pdillinger

fbshipit-source-id: fea483cd354ba782aea3e806f2bc96e183d59441
2020-11-03 12:46:36 -08:00
Andrew Kryczka
1adbceb581 Expand effect of dictionary settings in ColumnFamilyOptions::compression_opts (#7619)
Summary:
In dictionary compression's initial implementation, in order to save CPU overhead, we only enabled it
for bottom level under the assumption that the vast majority of data is
stored there. At that time, there was no
such thing as `ColumnFamilyOptions::bottommost_compression_opts`, so we just
hardcoded disabling dictionary compression in flush and compactions to
non-bottommost level. Now, we have users who generate all their files
through flush and are considering using dictionary compression.

To support such a use case, this PR expands the scope of `ColumnFamilyOptions::compression_opts` to
additionally include flushed files and files generated by compaction to
a non-bottommost level. Users can still get the old behavior by moving
their dictionary settings to `ColumnFamilyOptions::bottommost_compression_opts`
and explicitly enabling both that and `ColumnFamilyOptions::bottommost_compression`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7619

Reviewed By: ltamasi

Differential Revision: D24665610

Pulled By: ajkr

fbshipit-source-id: 656b90bce1033fe21c71e09af931ef5bde3e464c
2020-11-02 19:21:11 -08:00
Yanqin Jin
394210f280 Remove unused includes (#7604)
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D24579392

Pulled By: riversand963

fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
2020-10-28 23:22:27 -07:00
mrambacher
f35f7f2704 Fix many tests to run with MEM_ENV and ENCRYPTED_ENV; Introduce a MemoryFileSystem class (#7566)
Summary:
This PR does a few things:

1.  The MockFileSystem class was split out from the MockEnv.  This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one).  The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.

2.  Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set.  To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).

3.  Updated the test framework to have a ROCKSDB_GTEST_SKIP macro.  This can be used to flag tests that are skipped.  Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).

I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV,  both, and neither under both MacOS and RedHat.  A few tests were disabled/skipped for the MEM/ENCRYPTED cases.  The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem.  (I will also push a change to disable those tests soon).  There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.

Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.

Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale.  I do not know how to do that, so if someone could write that job, it would be appreciated :)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566

Reviewed By: zhichao-cao

Differential Revision: D24408980

Pulled By: jay-zhuang

fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
2020-10-27 10:33:09 -07:00
Peter Dillinger
25d54c799c Ribbon: initial (general) algorithms and basic unit test (#7491)
Summary:
This is intended as the first commit toward a near-optimal alternative to static Bloom filters for SSTs. Stephan Walzer and I have agreed upon the name "Ribbon" for a PHSF based on his linear system construction in "Efficient Gauss Elimination for Near-Quadratic Matrices with One Short Random Block per Row, with Applications" ("SGauss") and my much faster "on the fly" algorithm for gaussian elimination (or for this linear system, "banding"), which can be faster than peeling while also more compact and flexible. See util/ribbon_alg.h for more detailed introduction and background. RIBBON = Rapid Incremental Boolean Banding ON-the-fly

This commit just adds generic (templatized) core algorithms and a basic unit test showing some features, including the ability to construct structures within 2.5% space overhead vs. information theoretic lower bound. (Compare to cache-local Bloom filter's ~50% space overhead -> ~30% reduction anticipated.) This commit does not include the storage scheme necessary to make queries fast, especially for filter queries, nor fractional "result bits", but there is some description already and those implementations will come soon. Nor does this commit add FilterPolicy support, for use in SST files, but that will also come soon.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7491

Reviewed By: jay-zhuang

Differential Revision: D24517954

Pulled By: pdillinger

fbshipit-source-id: 0119ee597e250d7e0edd38ada2ba50d755606fa7
2020-10-25 20:44:49 -07:00
Peter Dillinger
a16d1b2fd3 Add Encode/DecodeFixedGeneric, coding_lean.h (#7587)
Summary:
To minimize dependencies for Ribbon filter code in progress,
core part of coding.h for fixed sizes has been moved to coding_lean.h.
Also, generic versions of these functions have been added to math128.h
(since the generic versions are likely only to be used along with
Unsigned128).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7587

Test Plan: Unit tests added for new functions

Reviewed By: jay-zhuang

Differential Revision: D24486718

Pulled By: pdillinger

fbshipit-source-id: a69768f742379689442135fa52237c01dfe2647e
2020-10-23 14:11:15 -07:00