Optimize xxh32 and xxh64 with ARM SVE instructions #737

hzhuang1 · 2022-09-07T06:27:26Z

With pull request #713 , XXH3 is optimized by ARM SVE instructions. Since data is divided in blocks in XXH3, and vector instructions could handle data in parallel.

For XXH32 & XXH64, data is fetched with stream. So a new method (multi-buffer) could be used to adopt vector instructions. The implementation is in https://github.com/hzhuang1/isa-l_crypto/tree/debug_xxh32. Multi-buffer also means multiple jobs. With multiple jobs running in parallel, vector instructions could be used to accelerate.

The performance data fetched from two machines is above. One is SVE512 (fujitsu), and the other is SVE256 (AWS).

Cyan4973 · 2022-09-07T06:57:34Z

High level question : does the multi-buffer method generate the same final hash value as the regular one ?

hzhuang1 · 2022-09-07T07:03:20Z

High level question : does the multi-buffer method generate the same final hash value as the regular one ?

Only a part of code could be vectorized in multi-buffer, just like the block operation in XXH3. The left calculation is handled in each job. The final hash value must be same as the regular one.

Cyan4973 · 2022-09-07T07:43:45Z

The final hash value must be same as the regular one.

XXH3 was designed with parallel blocks in mind, so it works there,
but XXH32 and XXH64 were not,
so it's a bit less obvious to me that multi-buffer can generate the same final hash value as single stream.

Has it been already validated ?
I assume xxhash_mb_rand_test.c's role is to do that,
but I would welcome a confirmation.

A high level explanation would also be welcome,
as currently the implementation is in assembly format,
and the high-level logic is not self-obvious nor documented.

hzhuang1 · 2022-09-07T07:54:34Z

Has it been already validated ? I assume xxhash_mb_rand_test.c's role is to do that, but I would welcome a confirmation.

Yes, xxhash_mb_rand_test.c compare hash value with the result from regular xxhash.

A high level explanation would also be welcome, as currently the implementation is in assembly format, and the high-level logic is not self-obvious nor documented.

I'll try to add more explanation while I re-organize the patches.

Cyan4973 · 2023-07-15T17:48:28Z

This issue is primarily focused on a target implementation of XXH32 and XXH64 using SVE,

but a generic important claim made here is that
it's possible to process large inputs with XXH32 and XXH64 by breaking them into multiple buffers, and processing them in parallel (employing multiple threads), and still generate the same result as if they were processed as a single stream.

If that's confirmed, this would be a fairly important property, that could be applied beyond SVE across multiple instruction sets.
So a description of how to achieve this objective would be greatly appreciated.

klauspost · 2024-12-30T15:52:10Z

@Cyan4973 It seems to be a multi-lane approach, requiring coordinating several inputs to get any benefits.

I have tried making this approach feasible for MD5 and SHA256, and even with MD5 speeds, the practical task of coordinating several inputs makes it so there rarely is any practical benefits.

The only benefits I see is when you have several inputs lined up in memory that is ready to be processed, so you can dispatch many blocks without coordinating. In practice I've only seen that in benchmarks. The numbers seems to support that, with almost no benefit until you reach a high number of concurrent streams and big blocks.

So while numbers look good, the practicality makes it very rare to be a benefit, since you need to coordinate between your streams, which is more expensive than just doing single-lane processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize xxh32 and xxh64 with ARM SVE instructions #737

Optimize xxh32 and xxh64 with ARM SVE instructions #737

hzhuang1 commented Sep 7, 2022

Cyan4973 commented Sep 7, 2022

hzhuang1 commented Sep 7, 2022

Cyan4973 commented Sep 7, 2022 •

edited

Loading

hzhuang1 commented Sep 7, 2022

Cyan4973 commented Jul 15, 2023

klauspost commented Dec 30, 2024

Optimize xxh32 and xxh64 with ARM SVE instructions #737

Optimize xxh32 and xxh64 with ARM SVE instructions #737

Comments

hzhuang1 commented Sep 7, 2022

Cyan4973 commented Sep 7, 2022

hzhuang1 commented Sep 7, 2022

Cyan4973 commented Sep 7, 2022 • edited Loading

hzhuang1 commented Sep 7, 2022

Cyan4973 commented Jul 15, 2023

klauspost commented Dec 30, 2024

Cyan4973 commented Sep 7, 2022 •

edited

Loading