-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the codegen of the vector accelerated System.Numerics.*
types
#81335
Conversation
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsThis improves the codegen for more scenarios involving Note about Quaternion and PlaneIn an ideal world, particularly for What this means is that even though So what ends up happening is that we explicitly track I considered limiting things to a couple "bitcast" intrinsics, Note about other changesThe other APIs are split between intrinsic and managed handling. For the intrinsic handling, I limited it to APIs which already had intrinsic handling (e.g. Other APIs such as
|
Many diffs are cases such as: -; 0 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
+; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data or -;* V04 tmp2 [V04 ] ( 0, 0 ) simd8 -> zero-ref ld-addr-op "NewObj constructor temp"
-;* V05 tmp3 [V05 ] ( 0, 0 ) simd8 -> zero-ref ld-addr-op "NewObj constructor temp"
-;* V06 tmp4 [V06 ] ( 0, 0 ) simd8 -> zero-ref V01.X(offs=0x00) P-INDEP "field V01.X (fldOffset=0x0)"
-;* V07 tmp5 [V07 ] ( 0, 0 ) simd8 -> zero-ref V01.Y(offs=0x08) P-INDEP "field V01.Y (fldOffset=0x8)"
-;* V08 tmp6 [V08 ] ( 0, 0 ) simd8 -> zero-ref V01.Z(offs=0x10) P-INDEP "field V01.Z (fldOffset=0x10)"
-;* V09 tmp7 [V09,T01] ( 0, 0 ) simd8 -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
-;* V10 tmp8 [V10,T02] ( 0, 0 ) simd8 -> zero-ref V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
-;* V11 tmp9 [V11,T03] ( 0, 0 ) simd8 -> zero-ref V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)"
+;* V04 tmp2 [V04 ] ( 0, 0 ) simd8 -> zero-ref V01.X(offs=0x00) P-INDEP "field V01.X (fldOffset=0x0)"
+;* V05 tmp3 [V05 ] ( 0, 0 ) simd8 -> zero-ref V01.Y(offs=0x08) P-INDEP "field V01.Y (fldOffset=0x8)"
+;* V06 tmp4 [V06 ] ( 0, 0 ) simd8 -> zero-ref V01.Z(offs=0x10) P-INDEP "field V01.Z (fldOffset=0x10)"
+;* V07 tmp5 [V07,T01] ( 0, 0 ) simd8 -> zero-ref V03.X(offs=0x00) P-INDEP "field V03.X (fldOffset=0x0)"
+;* V08 tmp6 [V08,T02] ( 0, 0 ) simd8 -> zero-ref V03.Y(offs=0x08) P-INDEP "field V03.Y (fldOffset=0x8)"
+;* V09 tmp7 [V09,T03] ( 0, 0 ) simd8 -> zero-ref V03.Z(offs=0x10) P-INDEP "field V03.Z (fldOffset=0x10)" Both of these represent less work for the JIT to be doing. |
Some of the instance methods do show small regressions. These are because the instance method takes the first parameter as a In practice, such methods are intrinsic themselves and so this code isn't actually hit and doesn't represent any real regression. |
The change to morph to recognize - vdpps xmm3, xmm2, xmm2, 113
- vsqrtss xmm3, xmm3
- vbroadcastss xmm3, xmm3
+ vdpps xmm3, xmm2, xmm2, 127
+ vsqrtps xmm3, xmm3 Other functions have "regressions" which are just code size increases due to methods that were supposed to already be getting inlined actually being inlined now thanks to them being intrinsic. And then of course APIs involving ; Assembly listing for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 2 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
; V00 this [V00,T01] ( 3, 3 ) byref -> rcx this single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V02 tmp1 [V02,T00] ( 5, 10 ) struct (16) [rsp+08H] do-not-enreg[SF] "impAppendStmt"
;* V03 tmp2 [V03 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[SF] "struct address for call/obj"
;* V04 tmp3 [V04 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "NewObj constructor temp"
; V05 tmp4 [V05,T02] ( 3, 2 ) bool -> rax "Inline return value spill temp"
;* V06 tmp5 [V06 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V07 tmp6 [V07 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V08 tmp7 [V08,T07] ( 0, 0 ) float -> zero-ref V04.X(offs=0x00) P-INDEP "field V04.X (fldOffset=0x0)"
;* V09 tmp8 [V09,T08] ( 0, 0 ) float -> zero-ref V04.Y(offs=0x04) P-INDEP "field V04.Y (fldOffset=0x4)"
;* V10 tmp9 [V10,T09] ( 0, 0 ) float -> zero-ref V04.Z(offs=0x08) P-INDEP "field V04.Z (fldOffset=0x8)"
;* V11 tmp10 [V11,T10] ( 0, 0 ) float -> zero-ref V04.W(offs=0x0c) P-INDEP "field V04.W (fldOffset=0xc)"
; V12 tmp11 [V12,T03] ( 2, 2 ) float -> mm0 V06.X(offs=0x00) P-INDEP "field V06.X (fldOffset=0x0)"
; V13 tmp12 [V13,T04] ( 2, 1.50) float -> mm1 V06.Y(offs=0x04) P-INDEP "field V06.Y (fldOffset=0x4)"
; V14 tmp13 [V14,T05] ( 2, 1.50) float -> mm2 V06.Z(offs=0x08) P-INDEP "field V06.Z (fldOffset=0x8)"
; V15 tmp14 [V15,T06] ( 2, 1.50) float -> mm3 V06.W(offs=0x0c) P-INDEP "field V06.W (fldOffset=0xc)"
;* V16 tmp15 [V16 ] ( 0, 0 ) float -> zero-ref V07.X(offs=0x00) P-INDEP "field V07.X (fldOffset=0x0)"
;* V17 tmp16 [V17,T11] ( 0, 0 ) float -> zero-ref V07.Y(offs=0x04) P-INDEP "field V07.Y (fldOffset=0x4)"
;* V18 tmp17 [V18,T12] ( 0, 0 ) float -> zero-ref V07.Z(offs=0x08) P-INDEP "field V07.Z (fldOffset=0x8)"
;* V19 tmp18 [V19,T13] ( 0, 0 ) float -> zero-ref V07.W(offs=0x0c) P-INDEP "field V07.W (fldOffset=0xc)"
;
; Lcl frame size = 24
G_M51825_IG01:
sub rsp, 24
vzeroupper
;; size=7 bbWeight=1 PerfScore 1.25
G_M51825_IG02:
vmovups xmm0, xmmword ptr [rcx]
vmovups xmmword ptr [rsp+08H], xmm0
vmovss xmm0, dword ptr [rsp+08H]
vmovss xmm1, dword ptr [rsp+0CH]
vmovss xmm2, dword ptr [rsp+10H]
vmovss xmm3, dword ptr [rsp+14H]
vxorps xmm4, xmm4
vucomiss xmm0, xmm4
jp SHORT G_M51825_IG05
jne SHORT G_M51825_IG05
;; size=46 bbWeight=1 PerfScore 21.33
G_M51825_IG03:
vxorps xmm0, xmm0
vucomiss xmm1, xmm0
jp SHORT G_M51825_IG05
jne SHORT G_M51825_IG05
vxorps xmm0, xmm0
vucomiss xmm2, xmm0
jp SHORT G_M51825_IG05
jne SHORT G_M51825_IG05
xor eax, eax
vucomiss xmm3, dword ptr [reloc @RWD00]
setnp al
jp SHORT G_M51825_IG04
sete al
;; size=42 bbWeight=0.50 PerfScore 7.96
G_M51825_IG04:
jmp SHORT G_M51825_IG06
;; size=2 bbWeight=0.50 PerfScore 1.00
G_M51825_IG05:
xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
G_M51825_IG06:
add rsp, 24
ret
;; size=5 bbWeight=1 PerfScore 1.25
RWD00 dd 3F800000h ; 1
; Total bytes of code 104, prolog size 7, PerfScore 43.32, instruction count 29, allocated bytes for code 104 (MethodHash=46a9358e) for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; ============================================================ But is now: ; Assembly listing for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 this [V00,T00] ( 3, 3 ) byref -> rcx this single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M51825_IG01:
vzeroupper
;; size=3 bbWeight=1 PerfScore 1.00
G_M51825_IG02:
vmovups xmm0, xmmword ptr [rcx]
vcmpps xmm0, xmm0, xmmword ptr [reloc @RWD00], 0
vmovmskps xrax, xmm0
cmp eax, 15
sete al
movzx rax, al
;; size=26 bbWeight=1 PerfScore 10.50
G_M51825_IG03:
ret
;; size=1 bbWeight=1 PerfScore 1.00
RWD00 dq 0000000000000000h, 3F80000000000000h
; Total bytes of code 30, prolog size 3, PerfScore 15.50, instruction count 8, allocated bytes for code 30 (MethodHash=46a9358e) for method System.Numerics.Quaternion:get_IsIdentity():bool:this
; ============================================================ |
There are still a few APIs that can be optimized further, but they are all on the managed side of things and involve a more complex refactoring of their code. I plan on handling that in a separate PR |
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop |
Azure Pipelines successfully started running 3 pipeline(s). |
This just needs sign-off on the Mono side before it can be merged. |
Just tracking some of the Mono issues logged related to this PR:
As discussed with @fanyang-mono offline. My current plan, after this PR is merged, is to put up a PR which changes |
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of Mono.
else | ||
{ | ||
MonoClass *klass = mono_class_from_mono_type_internal (vector_type); | ||
g_assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we want to assert here, since mono won't disable assertion in release build. I think it is better to return false when the class is not one of those expected ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, didn't realize Mono asserts were preserved in release code.
Yes, looks like I didn't add the change to my commit. I'll fix this and the formatting issues after extra-platforms finishes to avoid additional CI churn. |
Link to extra platforms failures is here: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=156061 Other than a timeout due to an offline machine, the test failures match what the scheduled runs are hitting: https://dev.azure.com/dnceng-public/public/_build/results?buildId=155434&view=ms.vss-test-web.build-test-results-tab |
Brought in latest main to pickup the CI fix |
This improves the codegen for more scenarios involving
Vector2/3/4
,Quaternion
, andPlane
.Note about Quaternion and Plane
In an ideal world, particularly for
Quaternion
andPlane
, a large portion of the JIT side changes of this PR would be unnecessary. However, we don't live in that world and these types were all exposed with public fields which limits the ability to change their internal implementation details.What this means is that even though
Quaternion
should be wrapping a singleVector4
field and deferring most of its operations toVector4
, it can't. And unlikeMatrix4x4
where it is sufficiently large and therefore the limitation could be worked around usingUnsafe.As
to convert thebyref
, avoid the copy, and still end up getting promoted; the same is not trivially true forQuaternion
orPlane
which are enregisterable.So what ends up happening is that we explicitly track
Quaternion
andPlane
as being aTYP_SIMD16
much as we do forVector4
. This allows us to light up all the same methods with all the same intrinsic handling.I considered limiting things to a couple "bitcast" intrinsics,
AsVector4()
andAsQuaternion/Plane
, as this would allow implementingoperator +
as(value1.AsVector4() + value2.AsVector4()).AsQuaternion
. However, given we already have the intrinsic handling foroperator +
for Vector4 directly and this avoids the need to handle 3 intrinsics + inlining + eliding the copies, I opted to do it this way instead.Note about other changes
The other APIs are split between intrinsic and managed handling.
For the intrinsic handling, I limited it to APIs which already had intrinsic handling (e.g.
Add
andoperator +
can both be trivially done) and otherwise to APIs which are functionally "primitive" operations for these types such asLength
. This allows us to avoid inlining, extra intrinsic recognition/matching, and better handle problematic scenarios such as these beinginstance methods
(when ideally they'd bestatic extensions
like they are forVector64/128/256/512<T>
) or returning scalar types just to immediately convert back to a vector.Other APIs such as
Transform
,Cross
,Reflect
, etc are considered "complex" and are left entirely to managed. They do not optimize down to 1-2 instruction sequences and are expected to be more expensive and so are left to managed code.