aarch64_neon and target neoverse_n1 expand #163

gtsoul-tech · 2025-02-27T10:41:44Z

This PR was made in the context of optimizing a new neon blake2b implementation using Slothy.
Unfortunately no performance gain was achieved—only regression, possibly due to the nature of the Blake2b algorithm.

I left out stp d8, d9, [sp, #-80]! as I couldn't expand it with sp as it is,
and I couldn't implement .rodata section recognition (for example ldr <Qa>, .rodata + \<imm\>),
so I modified my loads to existing patterns.
Additionally, I added tbl, mov_hh, and mov_hl to aarch64neon and to the target microarchitecture neoverse_n1,
along with missing entries for
VShiftImmediateBasic,
d_stp_stack_with_inc, q_stp_with_postinc, q_stp_with_inc, q_ldp_with_inc

hanno-becker · 2025-02-27T10:49:21Z

Hi @gtsoul-tech! Can you provide more background on your effort to optimize blake2b? Is the implementation available somehwere? How did you invoke SLOTHY? In my experience, Neoverse N1 is quite amenable to speedup through good scheduling.

gtsoul-tech · 2025-02-27T11:04:19Z

Hi @gtsoul-tech! Can you provide more background on your effort to optimize blake2b? What processor were you targeting? Is the implementation available somehwere?
Hello @hanno-becker,

The processor I was targeting is Ampere Neoverse-N1, as Slothy has some support for it. If the optimization worked, I planned to extend it to Neoverse-V2 (Graviton4).

The implementation is available here: https://github.com/VectorCamp/libsodium/blob/devel/blake2b-neon-implementation/src/libsodium/crypto_generichash/blake2b/ref/blake2b-compress-neon-assembly.S

hanno-becker · 2025-02-27T11:07:15Z

@gtsoul-tech How did you invoke SLOTHY? Did you use heuristics? If so, is there a way to roll the code back into a loop, and then use SW pipelining?

gtsoul-tech · 2025-02-27T11:30:18Z

@hanno-becker I invoked Slothy using the Neoverse-N1 experimental target, using heuristics

Τhe current Blake2b implementation is fully unrolled, so I didn't roll it back into a loop. Since the performance regressed, it might be worth experimenting with a looped version.

hanno-becker · 2025-02-27T11:34:00Z

@gtsoul-tech Yes, if you roll the code back into a loop, you may have a shot at running SLOTHY with SW pipelining, but without heuristics.

How did you use the split heuristics? How big was the optimization 'window' compared to the whole size of the code?

gtsoul-tech · 2025-02-27T12:04:17Z

@hanno-becker It was the whole size of code
First pass: Functional verification and register allocation
Second pass: split heuristic with tuning parameters
(split_heuristic_stepsize = 0.05,
split_heuristic_factor = 10,
split_heuristic_repeat = 2)

…diateBasic and q_ldp_with_inc,q_stp_with_postinc,q_stp_with_inc

gtsoul-tech added 2 commits March 5, 2025 13:19

expand aarch64 neon v_tbl_16b,mov_hh,mov_hl

4d8391e

add missing entries in neoverse_n1 mov_hl,mov_hh,v_tbl_16b,VShiftImme…

3671ab2

…diateBasic and q_ldp_with_inc,q_stp_with_postinc,q_stp_with_inc

gtsoul-tech force-pushed the feature/aarch64_neon_n1_expand branch from aa20c4f to 3671ab2 Compare March 5, 2025 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64_neon and target neoverse_n1 expand #163

aarch64_neon and target neoverse_n1 expand #163

gtsoul-tech commented Feb 27, 2025 •

edited

Loading

hanno-becker commented Feb 27, 2025 •

edited

Loading

gtsoul-tech commented Feb 27, 2025

hanno-becker commented Feb 27, 2025

gtsoul-tech commented Feb 27, 2025

hanno-becker commented Feb 27, 2025

gtsoul-tech commented Feb 27, 2025 •

edited

Loading

aarch64_neon and target neoverse_n1 expand #163

Are you sure you want to change the base?

aarch64_neon and target neoverse_n1 expand #163

Conversation

gtsoul-tech commented Feb 27, 2025 • edited Loading

hanno-becker commented Feb 27, 2025 • edited Loading

gtsoul-tech commented Feb 27, 2025

hanno-becker commented Feb 27, 2025

gtsoul-tech commented Feb 27, 2025

hanno-becker commented Feb 27, 2025

gtsoul-tech commented Feb 27, 2025 • edited Loading

gtsoul-tech commented Feb 27, 2025 •

edited

Loading

hanno-becker commented Feb 27, 2025 •

edited

Loading

gtsoul-tech commented Feb 27, 2025 •

edited

Loading