Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64_neon and target neoverse_n1 expand #163

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gtsoul-tech
Copy link

@gtsoul-tech gtsoul-tech commented Feb 27, 2025

This PR was made in the context of optimizing a new neon blake2b implementation using Slothy.
Unfortunately no performance gain was achieved—only regression, possibly due to the nature of the Blake2b algorithm.

I left out stp d8, d9, [sp, #-80]! as I couldn't expand it with sp as it is,
and I couldn't implement .rodata section recognition (for example ldr <Qa>, .rodata + \<imm\>),
so I modified my loads to existing patterns.
Additionally, I added tbl, mov_hh, and mov_hl to aarch64neon and to the target microarchitecture neoverse_n1,
along with missing entries for
VShiftImmediateBasic,
d_stp_stack_with_inc, q_stp_with_postinc, q_stp_with_inc, q_ldp_with_inc

@hanno-becker
Copy link
Collaborator

hanno-becker commented Feb 27, 2025

Hi @gtsoul-tech! Can you provide more background on your effort to optimize blake2b? Is the implementation available somehwere? How did you invoke SLOTHY? In my experience, Neoverse N1 is quite amenable to speedup through good scheduling.

@gtsoul-tech
Copy link
Author

Hi @gtsoul-tech! Can you provide more background on your effort to optimize blake2b? What processor were you targeting? Is the implementation available somehwere?
Hello @hanno-becker,

The processor I was targeting is Ampere Neoverse-N1, as Slothy has some support for it. If the optimization worked, I planned to extend it to Neoverse-V2 (Graviton4).

The implementation is available here: https://github.com/VectorCamp/libsodium/blob/devel/blake2b-neon-implementation/src/libsodium/crypto_generichash/blake2b/ref/blake2b-compress-neon-assembly.S

@hanno-becker
Copy link
Collaborator

@gtsoul-tech How did you invoke SLOTHY? Did you use heuristics? If so, is there a way to roll the code back into a loop, and then use SW pipelining?

@gtsoul-tech
Copy link
Author

@hanno-becker I invoked Slothy using the Neoverse-N1 experimental target, using heuristics

Τhe current Blake2b implementation is fully unrolled, so I didn't roll it back into a loop. Since the performance regressed, it might be worth experimenting with a looped version.

@hanno-becker
Copy link
Collaborator

@gtsoul-tech Yes, if you roll the code back into a loop, you may have a shot at running SLOTHY with SW pipelining, but without heuristics.

How did you use the split heuristics? How big was the optimization 'window' compared to the whole size of the code?

@gtsoul-tech
Copy link
Author

gtsoul-tech commented Feb 27, 2025

@hanno-becker It was the whole size of code
First pass: Functional verification and register allocation
Second pass: split heuristic with tuning parameters
(split_heuristic_stepsize = 0.05,
split_heuristic_factor = 10,
split_heuristic_repeat = 2)

@gtsoul-tech gtsoul-tech force-pushed the feature/aarch64_neon_n1_expand branch from aa20c4f to 3671ab2 Compare March 5, 2025 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants