Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimization opportunities in common pixel formats. #2232

Closed
4 tasks done
JimBobSquarePants opened this issue Sep 16, 2022 · 13 comments · Fixed by #2645
Closed
4 tasks done

Performance optimization opportunities in common pixel formats. #2232

JimBobSquarePants opened this issue Sep 16, 2022 · 13 comments · Fixed by #2645

Comments

@JimBobSquarePants
Copy link
Member

Prerequisites

  • I have written a descriptive issue title
  • I have verified that I am running the latest version of ImageSharp
  • I have verified if the problem exist in both DEBUG and RELEASE mode
  • I have searched open and closed issues to ensure it has not already been reported

ImageSharp version

v3 alpha +

Other ImageSharp packages and versions

NA

Environment (Operating system, version and so on)

NA

.NET Framework version

NA

Description

As described here there are several performance opportunities can be implemented in many of our pixel format types. This should be fairly low hanging fruit with good return.

Notably on .NET 6/7, you could make this even more efficient by doing something like:

public void Pack(Vector4 vector)
{
    vector *= MaxBytes;
    vector += Half;
    vector = Vector4.Clamp(vector, Vector4.Zero, MaxBytes);

    Vector128<byte> result = Sse2.ConvertToVector128Int32WithTruncation(vector.AsVector128()).AsByte();
    // In .NET 7+ the above can be `result = Vector128.ConvertToInt32(vector.AsVector128()).AsByte()` so it works on Arm64 too

    R = result.GetElement(0);
    G = result.GetElement(4);
    B = result.GetElement(8);
    A = result.GetElement(12);
}

This converts all 4 elements at once and then extracts the truncated bytes directly:

vzeroupper
vmovupd xmm0, [0x7ffd160105c0]
vmovaps xmm1, xmm0
vmulps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [0x7ffd160105d0]
vaddps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [rdx]
vxorps xmm2, xmm2, xmm2
vmaxps xmm1, xmm1, xmm2
vminps xmm0, xmm1, xmm0
vmovupd [rdx], xmm0
vcvttps2dq xmm0, [rdx]
vpextrb eax, xmm0, 0
mov [rcx+2], al
vpextrb eax, xmm0, 4
mov [rcx+1], al
vpextrb eax, xmm0, 8
mov [rcx], al
vpextrb eax, xmm0, 0xc
mov [rcx+3], al
ret
  • We'll also be improving the codegen around vpextrb more in the future so it can be just vpextrb [rcx+2], xmm0, 0 instead of vpextrb eax, xmm0, 0 followed by mov [rcx+2], al.

You can also optimize in .NET 6+ by directly using Vector128.Create(). This creates a method local constant and avoids the static initializer entirely:

    private static Vector4 MaxBytes => Vector128.Create(255f).AsVector4();
    private static Vector4 Half => Vector128.Create(0.5f).AsVector4();

Steps to Reproduce

NA

Images

No response

@tannergooding
Copy link
Contributor

ImageSharp is .NET 6+ now, correct?

If you want to assign this to me, I can work on getting a fix up and take a peek at some of the other SIMD code for other .NET 6+ specific improvements.

@brianpopow
Copy link
Collaborator

ImageSharp is .NET 6+ now, correct?

If you want to assign this to me, I can work on getting a fix up and take a peek at some of the other SIMD code for other .NET 6+ specific improvements.

@tannergooding yes it's .NET6+ now. I have assigned you to this task, thanks!

@JimBobSquarePants
Copy link
Member Author

@tannergooding Any and all suggestions will be deeply appreciated 😄

@tannergooding
Copy link
Contributor

Few questions...

  1. How would you like me to structure the PR(s)?
  2. What's the consideration for #ifdef vs providing helper methods?
  3. What's the recommended way to benchmark and present numbers as part of a PR?
  4. How much "freedom" is there in changing internal implementation details of structs/methods?

For 1, I'm mostly asking as I can try and do several "smallish" PRs each targeting a specific type or I can do large PRs that try and cover changes more holistically across the repo.

For 2, mostly asking what the preferred way of handling cases such as where Arm64 support might be added. For .NET 6, we have to directly use AdvSimd. While for .NET 7, you can just use the methods/operators directly on Vector128<float> and the JIT will do the "optimal" thing for x64 or AdvSimd or WASM or other future platforms.

For 3, naturally I'd like to show numbers actually indicating changes are wins and in a format expected by the repo. I didn't see any callouts in the contributing docs, but maybe I missed it.

For 4, as an example there are a number of structs that contain 2/3/4 float fields (often exposed via auto-props). Due to the JIT specializing Vector2/3/4 but not user-defined structs today (hopefully I can get this fixed eventually, so all qualifying types get this same specialization), you will likely get better codegen by wrapping a single Vector2/3/4 instead.

Additionally there is the potential to wrap Vector128<float> for "even better" perf. However, the last one comes with a consideration that the type can no longer be used for interop, if that's a consideration at all. -- There is an easy workaround of getting the backing Vector128<float> field and using .AsVector4() if interop is a consideration, but it is something to think about.

There's also a slightly more complex consideration in that while Vector2/3/4 are currently accelerated and generally performant, types like Matrix3x2, Matrix4x4, and other neighboring types are not (I'm going to try and improve that in .NET 8, but no guarantees). I could write-up more performant helpers and utilize Unsafe.As to achieve significant perf wins for .NET 6/7 without changing the public surface area from taking the System.Numerics types.

Also some cases where Vector128<float> is a better choice than Vector4 in .NET 6, but where they are effectively the same in .NET 7. The .AsVector4() and .AsVector128() APIs allow zero-cost conversion in .NET 6+, so using Vector128<float> in internal methods can provide wins.

@JimBobSquarePants
Copy link
Member Author

Apologies for the slow reply, was offine this weekend.

I'll try to summarise a response to each question for you.

  1. Since the focus is mostly on pixel formats here, I think we can err towards the larger side. I can see a lot of repitition on the horizon so it should be ok to review.
  2. I'd like to avoid #ifdef where possible since we are only targetting the single framework. If we add Arm64 (_something we really need to get a performant build process in place for - QEMU is far too slow!) we can use AdvSimd and when .NET 8 comes around, we'll switch target to that and incrementally replace the methods to use the new APIs as and when we can.
  3. Benchmarking is a little adhoc. There is a folder in the benchmark project under General/PixelConversion where you can place benchmarks in whatever format you see fit in.
  4. You have absolute freedom there, as long as the changes do not make maintaining the codebase more difficult. I'd personally love to put a lot of focus into streamlining and simplifying some of the internal code but there's still some other chunky tasks - Optimize pixel blending with integer arithmetics #1433 - that need to be done.

Thanks again for your help here!

@tannergooding
Copy link
Contributor

Thanks a lot for the responses, they all make sense to me!

@tannergooding
Copy link
Contributor

tannergooding commented Oct 4, 2022

Just wanted to give an update on this. I've been familiarizing myself with the codebase and the various SIMD code, making notes of possible improvements as I go.

I'll be closing on my first house and moving over the next couple weeks so will endup taking a short break, but hope to have a PR up and a bullet list of potential additional improvements closer to the end of the month.

@JimBobSquarePants
Copy link
Member Author

Congratulations on the house!

No worries, enjoy your break. Looking forward to seeing what you come up with!

@tannergooding
Copy link
Contributor

-- Am still working on this. Codegen wasn't doing what I expected so I spent some time fixing that up in the JIT (believe you've seen the PR already 😄).

With the changes I've made, the codegen on my local changes is significantly better. So should provide some nice wins off the bat and even better gains once 8.0 comes around.

@JimBobSquarePants
Copy link
Member Author

Yeah, I saw that PR 😃 awesome stuff!

NET 8 seems like it's going to allow massive improvements to the ImageSharp codebase. Simplified SIMD with out of the box ARM support will be a gamechanger.

@tannergooding
Copy link
Contributor

Also got a rewrite of Matrix4x4 and Matrix3x2 in: dotnet/runtime#80091

This resulted in perf improvements of 2x up to 48x and should have a huge positive impact on ImageSharp. There is also opportunity for me to rewrite/improve Plane and Quaternion, but I didn't see any broad impact in my initial profile captures.

I plan on rerunning ImageSharp perf benchmarks once I have a dotnet/installer build available, but that represents the largest point of unexpected perf I was seeing. After that, I can finish up on the changes I've been working on here and it should be even better.

@JimBobSquarePants
Copy link
Member Author

Haha.... You're a whole new level of awesome. I was expecting a few small changes; you're rewriting the runtime for the benefit of all. Truly amazing!

@JimBobSquarePants
Copy link
Member Author

I think our ColorMatrix type can benefit from copying those changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants