-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aarch64_neon and target neoverse_n1 expand #163
base: main
Are you sure you want to change the base?
aarch64_neon and target neoverse_n1 expand #163
Conversation
Hi @gtsoul-tech! Can you provide more background on your effort to optimize blake2b? Is the implementation available somehwere? How did you invoke SLOTHY? In my experience, Neoverse N1 is quite amenable to speedup through good scheduling. |
The processor I was targeting is Ampere Neoverse-N1, as Slothy has some support for it. If the optimization worked, I planned to extend it to Neoverse-V2 (Graviton4). The implementation is available here: https://github.com/VectorCamp/libsodium/blob/devel/blake2b-neon-implementation/src/libsodium/crypto_generichash/blake2b/ref/blake2b-compress-neon-assembly.S |
@gtsoul-tech How did you invoke SLOTHY? Did you use heuristics? If so, is there a way to roll the code back into a loop, and then use SW pipelining? |
@hanno-becker I invoked Slothy using the Neoverse-N1 experimental target, using heuristics Τhe current Blake2b implementation is fully unrolled, so I didn't roll it back into a loop. Since the performance regressed, it might be worth experimenting with a looped version. |
@gtsoul-tech Yes, if you roll the code back into a loop, you may have a shot at running SLOTHY with SW pipelining, but without heuristics. How did you use the split heuristics? How big was the optimization 'window' compared to the whole size of the code? |
@hanno-becker It was the whole size of code |
…diateBasic and q_ldp_with_inc,q_stp_with_postinc,q_stp_with_inc
aa20c4f
to
3671ab2
Compare
This PR was made in the context of optimizing a new neon blake2b implementation using Slothy.
Unfortunately no performance gain was achieved—only regression, possibly due to the nature of the Blake2b algorithm.
I left out
stp d8, d9, [sp, #-80]!
as I couldn't expand it with sp as it is,and I couldn't implement .rodata section recognition (for example
ldr <Qa>, .rodata + \<imm\>
),so I modified my loads to existing patterns.
Additionally, I added
tbl
,mov_hh
, andmov_hl
to aarch64neon and to the target microarchitecture neoverse_n1,along with missing entries for
VShiftImmediateBasic
,d_stp_stack_with_inc
,q_stp_with_postinc
,q_stp_with_inc
,q_ldp_with_inc