Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.
Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.
TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717
Differential Revision: D13212686
Pulled By: ajkr
fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
Summary:
…ons (#4676)"
This reverts commit b32d087dbb.
`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697
Differential Revision: D13133490
Pulled By: yiwu-arbug
fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
Summary:
Rather than storing a `vector<RangeTombstone>`, we now store a
`vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A
`RangeTombstoneStack` contains the start and end keys of a range tombstone
fragment, and indices into the seqnum vector to indicate which sequence
numbers the fragment is located at. The diagram below illustrates an
example:
```
tombstones_: [a, b) [c, e) [h, k)
| \ / \ / |
| \ / \ / |
v v v v
tombstone_seqs_: [ 5 3 10 7 2 8 6 ]
```
This format allows binary searching the tombstone list to use less key
comparisons, which helps in cases where there are many overlapping
tombstones. Also, this format makes it easier to add DBIter-like
semantics to `FragmentedRangeTombstoneIterator` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632
Differential Revision: D13053103
Pulled By: abhimadan
fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0
Summary:
We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed.
The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there.
Some code is slightly refactored so that the cachable information can flow better.
Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650
Differential Revision: D12969070
Pulled By: siying
fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.
It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676
Differential Revision: D13047662
Pulled By: yiwu-arbug
fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
Differential Revision: D12934992
Pulled By: sagar0
fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
Summary:
valgrind tests with 1 thread run too long. To make it shorter, black list some long tests. These are already blacklisted in parallel valgrind tests, but they are not in non-parallel mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4642
Differential Revision: D12945237
Pulled By: siying
fbshipit-source-id: 04cf977d435996480fe87aa09f14b17975b74f7d
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.
This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594
Differential Revision: D12826893
Pulled By: abhimadan
fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590
Differential Revision: D10866340
Pulled By: yiwu-arbug
fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found)
```
Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41
500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52
1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94
5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55
10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82
25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81
50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32
```
After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
Differential Revision: D10842844
Pulled By: abhimadan
fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.
In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```
...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```
The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.
Readrandom results before PR:
```
readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found)
```
Readrandom results after PR:
```
readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found)
```
So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
Differential Revision: D10370575
Pulled By: abhimadan
fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
Summary:
PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels.
This PR takes advantage of the feature to populate a few counters related to bloom filters
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581
Differential Revision: D10518010
Pulled By: miasantreble
fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8
Summary:
This fixes three tests that fail with relatively recent tools and libraries:
The tests are:
* `spatial_db_test`
* `table_test`
* `db_universal_compaction_test`
I'm using:
* `gcc` 7.3.0
* `glibc` 2.27
* `snappy` 1.1.7
* `gflags` 2.2.1
* `zlib` 1.2.11
* `bzip2` 1.0.6.0.1
* `lz4` 1.8.2
* `jemalloc` 5.0.1
The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`.
However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`:
```
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms)
```
I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562
Differential Revision: D10510917
Pulled By: maysamyabandeh
fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226
Differential Revision: D10369509
Pulled By: miasantreble
fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933
Differential Revision: D10219571
Pulled By: riversand963
fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.
We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.
Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437
Differential Revision: D10132814
Pulled By: yiwu-arbug
fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564
Summary:
Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381
Differential Revision: D9879505
Pulled By: maysamyabandeh
fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
Summary:
This is a follow up to #4370. The earlier comment is not correct.
Thanks to ajkr for pointing this out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4380
Differential Revision: D9874667
Pulled By: sagar0
fbshipit-source-id: f4e092d86b29c715258210b770643d367e38caae
Summary:
With #3983 the size of IndexBlockIter was increased. This had resulted in a regression on P50 latencies in one of our benchmarks. The patch reduces IndexBlockIter size be eliminating active_comparator_ field from the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4358
Differential Revision: D9781737
Pulled By: maysamyabandeh
fbshipit-source-id: 71e2b28d90ff0813db9e04b737ae73e185583c52
Summary:
Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting.
Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347
Differential Revision: D9668365
Pulled By: ajkr
fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.
Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339
Differential Revision: D9654990
Pulled By: ajkr
fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
Summary:
In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file.
ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary.
There are a couple other changes included in this PR:
- Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it.
- Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251
Differential Revision: D9257078
Pulled By: ajkr
fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438
Summary:
When returning `kNoEntry` from HashIndex lookup, previously we invalidate the
`biter` by set `current_=restarts_`, so that the search can continue to the next
block in case the search result may reside in the next block.
There is one problem: when we are searching for a missing key, if the search
finds a `kNoEntry` and continue the search to the next block, there is also a
non-trivial possibility that the HashIndex return `kNoEntry` too, and the
expensive index iterator `Next()` will happen several times for nothing.
The solution is that if the hash table returns `kNoEntry`, `SeekForGetImpl()` just search the last restart interval for the key. It will stop at the first key that is large than the seek_key, or to the end of the block, and each case will be handled correctly.
Microbenchmark script:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq,readtocache,readmissing \
--cache_size=20000000000 --use_data_block_hash_index={true|false}
```
`readmissing` performance (lower is better):
```
binary: 3.6098 micros/op
hash (before applying diff): 4.1048 micros/op
hash (after applying diff): 3.3502 micros/op
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4296
Differential Revision: D9419159
Pulled By: fgwu
fbshipit-source-id: 21e3eedcccbc47a249aa8eb4bf405c9def0b8a05
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039
Differential Revision: D8670178
Pulled By: riversand963
fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
Summary:
`BlockBasedTableBuilder::Add()` eventually calls
`DataBlockHashIndexBuilder::EstimateSize()`. The previous implementation
divides the `num_keys` by the `util_ratio_` to get an estimizted `num_buckets`.
Such division is expensive as it happens in every
`BlockBasedTableBuilder::Add()`.
This diff estimates the `num_buckets` by double addition instead of double
division. Specifically, in each `Add()`, we add `bucket_per_key_`
(inverse of `util_ratio_`) to the current `estimiated_num_buckets_`. The cost is
that we are gonna have the `estimated_num_buckets_` stored as one extra field
in the DataBlockHashIndexBuilder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4293
Differential Revision: D9412472
Pulled By: fgwu
fbshipit-source-id: 2925c8509a401e7bd3c1ab1d9e9c7244755c277a
Summary:
Suppress two CLANG analyze warnings. They don't seem to be real bugs
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4291
Differential Revision: D9407333
Pulled By: siying
fbshipit-source-id: 2ed63d88fa0b217fdccb1572d7508467c2203dc8
Summary:
Followup for #4266. There is one more place in **get_context.cc** where **MergeOperator::ShouldMerge** should be called with reversed list of operands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4284
Differential Revision: D9380008
Pulled By: sagar0
fbshipit-source-id: 70ec26e607e5b88465e1acbdcd6c6171bd76b9f2
Summary:
Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`.
`--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`;
`--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281
Differential Revision: D9361476
Pulled By: fgwu
fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`
The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174
Differential Revision: D8965914
Pulled By: fgwu
fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
Summary:
Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170
Differential Revision: D8962513
Pulled By: siying
fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407
Summary:
After refactoring in https://github.com/facebook/rocksdb/pull/4158 the properties block is written after the index block. This breaks the existing logic in estimating the index size in partitioned indexes. The patch fixes that by using the accurate index block size, which is available since by the time we write the properties block, the index block is already written.
The patch also fixes an issue in estimating the partition size with format_version=3 which was resulting into partitions smaller than the configured metadata_block_size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4259
Differential Revision: D9274454
Pulled By: maysamyabandeh
fbshipit-source-id: c82d045505cca3e7ed1a44ee1eaa26e4f25a4272
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4.
The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585 rocksdb read-only range=100
Format 3:
19569 rocksdb read-only range=100
Format 4:
19352 rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
Differential Revision: D8361343
Pulled By: maysamyabandeh
fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
Summary:
If jemalloc is disabled or the API is incorrectly referenced (jemalloc api on windows have a prefix je_) memory usage is incorrectly reported for all block sizes. This is because sizeof(char) is always 1. sizeof() is calculated at compile time and *(char*) is char. The patch uses the size of the slice to fix that.
Fixes#4245
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4246
Differential Revision: D9233958
Pulled By: maysamyabandeh
fbshipit-source-id: 9646933b24504e2814c7379f06a31148829c6b4e
Summary:
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
`global_seqno`. Since random write is not supported in some non-POSIX compliant
file systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update `global_seqno` during
file ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172
Differential Revision: D8961465
Pulled By: riversand963
fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
Summary:
Added DataBlockIndexType option in BlockBasedTableOptions.
```
enum DataBlockIndexType : char {
kDataBlockBinarySearch = 0, // traditional block type
kDataBlockHashIndex = 1, // additional hash index appended to the end.
};
```
The default type is the traditional binary seek option: `kDataBlockBinarySearch`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150
Differential Revision: D8895958
Pulled By: fgwu
fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10
Summary:
92ee3350e0 introduces an out-of-bound check in BlockBasedTableIterator::Valid(). However, this flag is not reset when re-seeking in backward direction. This caused the iterator to be invalide by mistake. Fix it by always resetting the out-of-bound flag in every seek.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4187
Differential Revision: D8996600
Pulled By: siying
fbshipit-source-id: b6235ea614f71381e50e7904c4fb036300604ac1
Summary:
The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested.
`DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called.
`DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139
Reviewed By: sagar0
Differential Revision: D8866764
Pulled By: fgwu
fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10
Summary:
Currently in `Version::Get` when reporting ticker stats stored in `GetContext`, there is a big for-loop through all `Ticker` which adds unnecessary cost to overall CPU usage. We can optimize by storing only ticker values that are used in `Get()` calls in a new struct `GetContextStats` since only a small fraction of all tickers are used in `Get()` calls. For comparison, with the new approach we only need to visit 17 values while old approach will require visiting 100+ `Ticker`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3490
Differential Revision: D6969154
Pulled By: miasantreble
fbshipit-source-id: fc27072965a3a94125a3e6883d20dafcf5b84029
Summary:
Lint is not happy with some new code recently committed. Format them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161
Differential Revision: D8940582
Pulled By: siying
fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156
Differential Revision: D8916847
Pulled By: siying
fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
Summary:
The properties meta-block should come at the end since we always need to
read it when opening a file, unlike index/filter/other meta-blocks, which
are sometimes read depending on the user's configuration. This ordering
will allow us to (in a future PR) do a small readahead on the end of the file
to read properties and meta-index blocks with one I/O.
The bulk of this PR is a refactoring of the `BlockBasedTableBuilder::Finish`
function. It was previously too large with inconsistent error handling, which
made it difficult to change. So I broke it up into one function per meta-block
write, and tried to make error handling consistent within those functions.
Then reordering the metablocks was trivial -- just reorder the calls to these
helper functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4158
Differential Revision: D8921705
Pulled By: ajkr
fbshipit-source-id: 96c9cc3182eb1adf11af46adab79dbeba7b12fcc
Summary:
In all cases during `BlockBasedTable::Open`, we issue at least three read requests to the file's tail: (1) footer, (2) metaindex block, and (3) properties block. Depending on the config, we may also read other metablocks like filter and index.
This PR issues smaller readahead when we expect to do only the three necessary reads mentioned above. Then, 4KB should be enough (ignoring the case where there are lots of user-defined properties). We can keep doing 512KB readahead when additional reads are expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4159
Differential Revision: D8924002
Pulled By: ajkr
fbshipit-source-id: cfc713275de4d05ce11f18571f1d72e27ccd3356
Summary:
Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141
Differential Revision: D8866437
Pulled By: maysamyabandeh
fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b
Summary:
Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136
Reviewed By: maysamyabandeh
Differential Revision: D8859673
Pulled By: siying
fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e
Summary:
Fixes#3391.
This change adds a `DeleteRange` method to `SstFileWriter` and adds
support for ingesting SSTs with range deletion tombstones. This is
important for applications that need to atomically ingest SSTs while
clearing out any existing keys in a given key range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778
Differential Revision: D8821836
Pulled By: anand1976
fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135
Differential Revision: D8846653
Pulled By: maysamyabandeh
fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
Summary:
- Avoid `strdup` to use jemalloc on Windows
- Use `size_t` for consistency
- Add GCC 8 to Travis
- Add CMAKE_BUILD_TYPE=Release to Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433
Differential Revision: D6837948
Pulled By: sagar0
fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f
Summary:
BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121
Differential Revision: D8816832
Pulled By: maysamyabandeh
fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197
Summary:
Run the basic range deletion tests against the standard set of
configurations. This testing exposed that files with hash indexes and
partitioned indexes were not handling the case where the file contained
only range deletions--i.e., where the index was empty.
Additionally file a TODO about the fact that range deletions are broken
when allow_mmap_reads = true is set.
/cc ajkr nvanbenschoten
Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021
Differential Revision: D8811860
Pulled By: ajkr
fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343
Summary:
This was caught by crash test, and the following is a simple way to reproduce it and verify the fix.
One way to trigger this code path is to use the following configuration:
- Compress SST file
- Enable direct IO and prefetch buffer
- Do NOT use compressed block cache
Closes https://github.com/facebook/rocksdb/pull/4096
Differential Revision: D8742009
Pulled By: riversand963
fbshipit-source-id: f13381078bbb0dce92f60bd313a78ab602bcacd2
Summary:
Increase the size of each shard so that the number of cache hit/miss match
expectation. Otherwise FilterBlockInBlockCache test will fail.
Closes https://github.com/facebook/rocksdb/pull/4090
Differential Revision: D8736158
Pulled By: riversand963
fbshipit-source-id: 5cdbc06b02390389fd5b72a6d251d88949ad3d91
Summary:
We can have prefetch_index without prefetch_filter but not the other way around. The assert statement is fixed.
Closes https://github.com/facebook/rocksdb/pull/4077
Differential Revision: D8694472
Pulled By: maysamyabandeh
fbshipit-source-id: ccd2804d9d9cdafb1c3e65062c7bc38603e69004
Summary:
Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it.
Closes https://github.com/facebook/rocksdb/pull/4073
Differential Revision: D8686552
Pulled By: maysamyabandeh
fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
Summary:
with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
Closes https://github.com/facebook/rocksdb/pull/4067
Differential Revision: D8669002
Pulled By: miasantreble
fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
Summary:
…ression
For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
Closes https://github.com/facebook/rocksdb/pull/3985
Reviewed By: riversand963
Differential Revision: D8385911
Pulled By: zhichao-cao
fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
Summary:
https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
Closes https://github.com/facebook/rocksdb/pull/4053
Differential Revision: D8662546
Pulled By: maysamyabandeh
fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
Summary:
This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.
There are the places where reads had to be modified.
- Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
- Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
- Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
Closes https://github.com/facebook/rocksdb/pull/3955
Differential Revision: D8286688
Pulled By: lth
fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR #3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.
/cc ajkr nvanbenschoten
Closes https://github.com/facebook/rocksdb/pull/4016
Differential Revision: D8527575
Pulled By: ajkr
fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899
Differential Revision: D8157459
Pulled By: miasantreble
fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
Summary:
Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`.
In 7103559f49, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain.
Closes https://github.com/facebook/rocksdb/pull/4048
Differential Revision: D8601056
Pulled By: sagar0
fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037
Differential Revision: D8596218
Pulled By: maysamyabandeh
fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
```
Before:
seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884
Differential Revision: D8082143
Pulled By: sagar0
fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
Summary:
https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut.
This bug could result into false negatives, which means that DB could skip an existing key.
Closes https://github.com/facebook/rocksdb/pull/4024
Differential Revision: D8518866
Pulled By: maysamyabandeh
fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf
Summary:
b555ed30a4 makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound.
For example, if an SST file has a data block containing following keys : {a, z}
The user sets the upper bound to be "x", and it executed following queries:
Seek("b")
Seek("c")
Seek("d")
Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again.
To prevent this regression case, we keep the current data block iterator if it is upper bound.
Closes https://github.com/facebook/rocksdb/pull/4004
Differential Revision: D8463192
Pulled By: siying
fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81
Summary:
The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843.
This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy.
Closes https://github.com/facebook/rocksdb/pull/3996
Differential Revision: D8416186
Pulled By: miasantreble
fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28
Summary:
A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong:
Seek(key);
if (!Valid()) SeekToLast();
Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases.
This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev().
Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case.
Closes https://github.com/facebook/rocksdb/pull/3989
Differential Revision: D8385422
Pulled By: siying
fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4
Summary:
The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext*`. We were mistakenly passing the `GetContext*` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least.
Blame: 402b7aa0
Closes https://github.com/facebook/rocksdb/pull/3974
Differential Revision: D8344907
Pulled By: ajkr
fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e
Summary:
Property block will be read sequentially and cached in a heap located
object, so there's no need for restart points. Thus we set the restart
interval to infinity to save space.
Closes https://github.com/facebook/rocksdb/pull/3970
Differential Revision: D8332586
Pulled By: fgwu
fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134
Summary:
BadOptions test creates a temporary db path changed to
table_block_based_bad_options_test to avoid collide with that created by
the PrefixAndWholeKeyTest
Closes https://github.com/facebook/rocksdb/pull/3965
Differential Revision: D8316080
Pulled By: fgwu
fbshipit-source-id: bb8e0fdfdb9abf0e5ce94494b4388cd1622ee032
Summary:
format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters.
Closes https://github.com/facebook/rocksdb/pull/3958
Differential Revision: D8294615
Pulled By: maysamyabandeh
fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954
Differential Revision: D8272041
Pulled By: miasantreble
fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
Summary:
This fixes a regression in one of myrocks regression tests (readwhilewriting), introduced in 8bf555f487
This PR changes two lines of code: one of them actually fixes the observed regression, the other is a mostly unrelated small fix that I'm piggy-backing here. EDIT: Nevermind, it fixes one line. More details in inline comments.
Closes https://github.com/facebook/rocksdb/pull/3953
Differential Revision: D8270664
Pulled By: al13n321
fbshipit-source-id: a7d91e196807d1e816551591257c700f70e4ccac
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942
Differential Revision: D8238413
Pulled By: maysamyabandeh
fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.
For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.
The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838
Differential Revision: D8229794
Pulled By: miasantreble
fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
Summary:
This should resolve the performance regression caused by the unnecessary copying of the shared_ptr.
Closes https://github.com/facebook/rocksdb/pull/3937
Differential Revision: D8232330
Pulled By: miasantreble
fbshipit-source-id: 7885bf7cd190b6f87164c52d6edd328298c13f97
Summary:
DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests.
Closes https://github.com/facebook/rocksdb/pull/3935
Differential Revision: D8230748
Pulled By: maysamyabandeh
fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db
Summary:
The very old sst formats do not have table_properties and rep_->table_properties is thus nullptr. The recent patch in https://github.com/facebook/rocksdb/pull/3894 does not check for nullptr and hence makes it backward incompatible. This patch adds the check.
Closes https://github.com/facebook/rocksdb/pull/3918
Differential Revision: D8188638
Pulled By: maysamyabandeh
fbshipit-source-id: b1d986665ecf0b4d1c442adfa8a193b97707d47b
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894
Differential Revision: D8118775
Pulled By: maysamyabandeh
fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
Summary:
This was missed in a refactor of `ReadBlockContents` (2f1a3a4).
Closes https://github.com/facebook/rocksdb/pull/3906
Differential Revision: D8172648
Pulled By: ajkr
fbshipit-source-id: 27e453b19795fea974bfed4721105be6f3a12090
Summary:
Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.
To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.
Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.
Test plan
```
$ make -j16
$ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
```
Expected behavior:
All tests should pass.
Closes https://github.com/facebook/rocksdb/pull/3898
Differential Revision: D8149421
Pulled By: riversand963
fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
Summary:
Currently when ldb built from master tries to open a DB from version 2.2, there will be a segfault because table_properties didn't exist back then.
Closes https://github.com/facebook/rocksdb/pull/3890
Differential Revision: D8100914
Pulled By: miasantreble
fbshipit-source-id: b255e8aedc54695432be2e704839c857dabdd65a
Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601
Differential Revision: D7253114
Pulled By: miasantreble
fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
Summary:
`RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
Closes https://github.com/facebook/rocksdb/pull/3875
Differential Revision: D8063667
Pulled By: ajkr
fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
Summary:
Right now, every Block::NewIterator() reads num_restarts_ from the block, which is already read in Block::Block(). This sometimes cause a CPU cache miss. Although fetching this cacheline can usually benefit follow-up block restart offset reading, as they are close to each other, it's almost free to get ride of this read by storing it in the Block class.
Closes https://github.com/facebook/rocksdb/pull/3869
Differential Revision: D8052493
Pulled By: siying
fbshipit-source-id: 9c72360f0c2d7329f3c198ce4eaedd2bc14b87c1
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
Now BlockBasedTableIterator directly uses BlockIter. By making BlockIter final, we can prevent unintended virtual function overriding.
Closes https://github.com/facebook/rocksdb/pull/3828
Differential Revision: D7933816
Pulled By: siying
fbshipit-source-id: 026a08cb5c5b6d3d6f44743152b4251da4756f2c
Summary:
Currently HarnessTest.Randomized is already split but some of the splits are faster than the others. The reason is that each split takes a continuous range of the generated args and the test with later args takes longer to finish. The patch evenly split the args among splits in a round robin fashion.
Before:
```
[ OK ] HarnessTest.Randomized1n2 (2278 ms)
[ OK ] HarnessTest.Randomized3n4 (1095 ms)
[ OK ] HarnessTest.Randomized5 (658 ms)
[ OK ] HarnessTest.Randomized6 (1258 ms)
[ OK ] HarnessTest.Randomized7 (6476 ms)
[ OK ] HarnessTest.Randomized8 (8182 ms)
```
After
```
[ OK ] HarnessTest.Randomized1 (2649 ms)
[ OK ] HarnessTest.Randomized2 (2645 ms)
[ OK ] HarnessTest.Randomized3 (2577 ms)
[ OK ] HarnessTest.Randomized4 (2490 ms)
[ OK ] HarnessTest.Randomized5 (2553 ms)
[ OK ] HarnessTest.Randomized6 (2560 ms)
[ OK ] HarnessTest.Randomized7 (2501 ms)
[ OK ] HarnessTest.Randomized8 (2574 ms)
```
Closes https://github.com/facebook/rocksdb/pull/3808
Differential Revision: D7882663
Pulled By: maysamyabandeh
fbshipit-source-id: 09b749a9684b6d7d65466aa4b00c5334a49e833e
Summary:
When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless.
Closes https://github.com/facebook/rocksdb/pull/3688
Differential Revision: D7537741
Pulled By: miasantreble
fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6
Summary:
7a6353bd1c prevents empty filter blocks from being written for SST files containing range deletions only. However the assertion this PR removes is still a problem as we could be reading from a DB generated by a RocksDB build without the 7a6353bd1c patch. So remove the assertion. We already don't do this check when `cache_index_and_filter_blocks=false`, so it should be safe.
Closes https://github.com/facebook/rocksdb/pull/3773
Differential Revision: D7769964
Pulled By: ajkr
fbshipit-source-id: 7285762446f2cd2ccf16efd7a988a106fbb0d8d3
Summary:
bc0da4b512 optimized bloom filters by skipping duplicate entires when the whole key and prefixes are both added to the bloom. It however used empty string as the initial value of the last entry added to the bloom. This is incorrect since empty key/prefix are valid entires by themselves. This patch fixes that.
Closes https://github.com/facebook/rocksdb/pull/3776
Differential Revision: D7778803
Pulled By: maysamyabandeh
fbshipit-source-id: d5a065daebee17f9403cac51e9d5626aac87bfbc
Summary:
Currently we rely on FilterBitsBuilder to skip the duplicate keys. It does that by comparing that hash of the key to the hash of the last added entry. This logic breaks however when we have whole_key_filtering mixed with prefix blooms as their addition to FilterBitsBuilder will be interleaved. The patch fixes that by comparing the last whole key and last prefix with the whole key and prefix of the new key respectively and skip the call to FilterBitsBuilder if it is a duplicate.
Closes https://github.com/facebook/rocksdb/pull/3764
Differential Revision: D7744413
Pulled By: maysamyabandeh
fbshipit-source-id: 15df73bbbafdfd754d4e1f42ea07f47b03bc5eb8
Summary:
Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator.
Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache.
Closes https://github.com/facebook/rocksdb/pull/3739
Differential Revision: D7678308
Pulled By: maysamyabandeh
fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da