Summary:
Crc32c Parallel computation optimization:
Algorithm comes from Intel whitepaper: [crc-iscsi-polynomial-crc32-instruction-paper](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf)
Input data is divided into three equal-sized blocks
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes
One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes
1. crc32c_test:
```
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CRC
[ RUN ] CRC.StandardResults
[ OK ] CRC.StandardResults (1 ms)
[ RUN ] CRC.Values
[ OK ] CRC.Values (0 ms)
[ RUN ] CRC.Extend
[ OK ] CRC.Extend (0 ms)
[ RUN ] CRC.Mask
[ OK ] CRC.Mask (0 ms)
[----------] 4 tests from CRC (1 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1 ms total)
[ PASSED ] 4 tests.
```
2. RocksDB benchmark: db_bench --benchmarks="crc32c"
```
Linear Arm crc32c:
crc32c: 1.005 micros/op 995133 ops/sec; 3887.2 MB/s (4096 per op)
```
```
Parallel optimization with Armv8 crypto extension:
crc32c: 0.419 micros/op 2385078 ops/sec; 9316.7 MB/s (4096 per op)
```
It gets ~2.4x speedup compared to linear Arm crc32c instructions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5494
Differential Revision: D16340806
fbshipit-source-id: 95dae9a5b646fd20a8303671d82f17b2e162e945