[Enhancement] Optimize code in arm (backport #55072) #55488

mergify · 2025-01-27T03:19:28Z

Why I'm doing:

arm is slower than x86 in some cases

What I'm doing:

vectorize rf's insert_hash using Neon intrinsics
streamvbyte's cmakelist is wrong, which cause performance downgrade in arm because vectorization cannot work properly
arm's int128_mul_overflow is super slow becase of divide operation, __builtin_mul_overflow(int128_t a, int128_t b, int128_t* c) is fast enough when compile with gcc. But gcc's __builtin_mul_overflow is at least 5 times faster then clang in arm, we already reported it to the community: [clang++][aarch64] help optimize __builtin_mul_overflow performance llvm/llvm-project#123262. So we still use gcc as default compiler and use __builtin_mul_overflow to replace original int128_mul_overflow implementation
arm's cast int128 to double is super slow in arm with gcc because the bad implementation of __floattidf, clang runtime-rt's implementation is 20 times faster then gcc, so I used clang compiler-rt's implementation to replace gcc's version

after this pr, arm is faster then gcc in the most of cases.

| Query   | arm-opt | x86 |
|---------|--------|--------|
| QUERY01 | 36     | 61     |
| QUERY02 | 39     | 62     |
| QUERY14 | 1510   | 1514   |
| QUERY15 | 1407   | 1496   |
| QUERY17 | 21     | 88     |
| QUERY20 | 151    | 279    |
| QUERY21 | 1526   | 1529   |
| QUERY24 | 1399   | 1504   |
| QUERY26 | 32     | 122    |
| QUERY27 | 1493   | 1519   |
| QUERY90 | 3399   | 4030   |
| QUERY97 | 3859   | 4776   |
| QUERY98 | 2763   | 3208   |
| QUERY99 | 868    | 1259   |

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
I have added documentation for my new feature or new function
This is a backport pr

## Why I'm doing: arm is slower than x86 in some cases ## What I'm doing: 1. vectorize rf's insert_hash using Neon intrinsics 2. streamvbyte's cmakelist is wrong, which cause performance downgrade in arm because vectorization cannot work properly 3. arm's int128_mul_overflow is super slow becase of divide operation, __builtin_mul_overflow(int128_t a, int128_t b, int128_t* c) is fast enough when compile with gcc. But gcc's __builtin_mul_overflow is at least 5 times faster then clang in arm, we already reported it to the community: llvm/llvm-project#123262. So we still use gcc as default compiler and use __builtin_mul_overflow to replace original int128_mul_overflow implementation 4. arm's cast int128 to double is super slow in arm with gcc because the bad implementation of __floattidf, clang runtime-rt's implementation is 20 times faster then gcc, so I used clang compiler-rt's implementation to replace gcc's version after this pr, arm is faster then gcc in the most of cases. ``` | Query | arm-opt | x86 | |---------|--------|--------| | QUERY01 | 36 | 61 | | QUERY02 | 39 | 62 | | QUERY14 | 1510 | 1514 | | QUERY15 | 1407 | 1496 | | QUERY17 | 21 | 88 | | QUERY20 | 151 | 279 | | QUERY21 | 1526 | 1529 | | QUERY24 | 1399 | 1504 | | QUERY26 | 32 | 122 | | QUERY27 | 1493 | 1519 | | QUERY90 | 3399 | 4030 | | QUERY97 | 3859 | 4776 | | QUERY98 | 2763 | 3208 | | QUERY99 | 868 | 1259 | ``` Signed-off-by: before-Sunrise <[email protected]> (cherry picked from commit e88bb85)

mergify bot mentioned this pull request Jan 27, 2025

[Enhancement] Optimize code in arm #55072

Merged

24 tasks

github-actions bot assigned before-Sunrise Jan 27, 2025

github-actions bot added the automerge label Jan 27, 2025

wanpengfei-git enabled auto-merge (squash) January 27, 2025 03:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Optimize code in arm (backport #55072) #55488

[Enhancement] Optimize code in arm (backport #55072) #55488

mergify bot commented Jan 27, 2025 •

edited by wanpengfei-git

Loading

[Enhancement] Optimize code in arm (backport #55072) #55488

Are you sure you want to change the base?

[Enhancement] Optimize code in arm (backport #55072) #55488

Conversation

mergify bot commented Jan 27, 2025 • edited by wanpengfei-git Loading

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

mergify bot commented Jan 27, 2025 •

edited by wanpengfei-git

Loading