Summary:
Closes - https://github.com/facebook/rocksdb/issues/7710
I tested this on an Apple DTK (Developer Transition Kit) with an Apple A12Z Bionic CPU and macOS Big Sur (11.0.1).
Previously the arm64 specific CRC optimisations were limited to Linux only OS... Well now Apple Silicon is also arm64 but runs macOS ;-)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7714
Reviewed By: ltamasi
Differential Revision: D25287349
Pulled By: pdillinger
fbshipit-source-id: 639b168bf0ac2652907531e9604936ac4974b577
Summary:
Issue:https://github.com/facebook/rocksdb/issues/7042
No PMULL runtime check will lead to SIGILL on a Raspberry pi 4.
Leverage 'getauxval' to get Hardware-Cap to detect whether target
platform does support PMULL or not in runtime.
Consider the condition that the target platform does support crc32 but not support PMULL.
In this condition, the code should leverage the crc32 instruction
rather than skip all hardware crc32 instruction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7233
Reviewed By: jay-zhuang
Differential Revision: D23790116
fbshipit-source-id: a3ebd821fbd4a38dd2f59064adbb7c3013ee8140
Summary:
UBSAN shows following warning:
util/crc32c_arm64.cc:111:11: runtime error: load of misaligned address 0x00001afcda86 for type 'const uint64_t', which requires 8 byte alignment
0x00001afcda86: note: pointer points here
cc c1 2d 00 01 81 40 24 30 66 39 66 30 37 30 63 2d 32 36 63 34 2d 34 62 61 61 2d 38 35 33 31 2d
^
Suppress it just as what we do in x86 CRC.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6827
Test Plan: Run the same UBSAN and see it to pass now.
Reviewed By: ltamasi
Differential Revision: D21471838
fbshipit-source-id: 02943dd39a7030d2b03e5d894dcb23ed72b6c9c3
Summary:
Check for sys/auxv.h and getauxval before using them as they are not
always available (for example on uclibc)
Signed-off-by: Fabrice Fontaine <fontaine.fabrice@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6359
Differential Revision: D20239797
fbshipit-source-id: 175a098094d81545628c2372e7c388e70a32fd48
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830
Test Plan: Run all existing tests.
Differential Revision: D17488031
fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
Summary:
prefetch data for following block,avoid cache miss when doing crc caculate
I do performance test at kunpeng-920 server(arm-v8, 64core@2.6GHz)
./db_bench --benchmarks=crc32c --block_size=500000000
before optimise : 587313.500 micros/op 1 ops/sec; 811.9 MB/s (500000000 per op)
after optimise : 289248.500 micros/op 3 ops/sec; 1648.5 MB/s (500000000 per op)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5773
Differential Revision: D17347339
fbshipit-source-id: bfcd74f0f0eb4b322b959be68019ddcaae1e3341
Summary:
Crc32c Parallel computation coding optimization:
Macro unfolding removes the "for" loop and is good to decrease branch-miss in arm64 micro architecture
1024 Bytes is divided into 8(head) + 1008( 6 * 7 * 3 * 8 ) + 8(tail) three parts
Macro unfolding 42 loops to 6 CRC32C7X24BYTESs
1 CRC32C7X24BYTES containing 7 CRC32C24BYTESs
1, crc32c_test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CRC
[ RUN ] CRC.StandardResults
[ OK ] CRC.StandardResults (1 ms)
[ RUN ] CRC.Values
[ OK ] CRC.Values (0 ms)
[ RUN ] CRC.Extend
[ OK ] CRC.Extend (0 ms)
[ RUN ] CRC.Mask
[ OK ] CRC.Mask (0 ms)
[----------] 4 tests from CRC (1 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1 ms total)
[ PASSED ] 4 tests.
2, db_bench --benchmarks="crc32c"
crc32c : 0.218 micros/op 4595390 ops/sec; 17950.7 MB/s (4096 per op)
3, repeated crc32c_test case 60000 times
perf stat -e branch-miss -- ./crc32c_test
before optimization:
739,426,504 branch-miss
after optimization:
1,128,572 branch-miss
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5675
Differential Revision: D16989210
fbshipit-source-id: 7204e6069bb6ed066d49c2d1b3ac385065a98557
Summary:
Crc32c Parallel computation optimization:
Algorithm comes from Intel whitepaper: [crc-iscsi-polynomial-crc32-instruction-paper](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf)
Input data is divided into three equal-sized blocks
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes
One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes
1. crc32c_test:
```
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CRC
[ RUN ] CRC.StandardResults
[ OK ] CRC.StandardResults (1 ms)
[ RUN ] CRC.Values
[ OK ] CRC.Values (0 ms)
[ RUN ] CRC.Extend
[ OK ] CRC.Extend (0 ms)
[ RUN ] CRC.Mask
[ OK ] CRC.Mask (0 ms)
[----------] 4 tests from CRC (1 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1 ms total)
[ PASSED ] 4 tests.
```
2. RocksDB benchmark: db_bench --benchmarks="crc32c"
```
Linear Arm crc32c:
crc32c: 1.005 micros/op 995133 ops/sec; 3887.2 MB/s (4096 per op)
```
```
Parallel optimization with Armv8 crypto extension:
crc32c: 0.419 micros/op 2385078 ops/sec; 9316.7 MB/s (4096 per op)
```
It gets ~2.4x speedup compared to linear Arm crc32c instructions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5494
Differential Revision: D16340806
fbshipit-source-id: 95dae9a5b646fd20a8303671d82f17b2e162e945