Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize and optimize pixel format operations across platforms. #2645

Merged
merged 14 commits into from
Jan 30, 2024

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Jan 15, 2024

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Fixes #2232 and #594

  • Updates API to use static interface methods for improved usability and extensibility.
  • Optimizes and modernizes SIMD pixel shuffle operations, adding ARM support.
  • Optimizes and modernizes SIMD Vector4 packing operations. I'm leaving this for a follow up PR

I don't expect anyone to reasonably find the time to analyze this line-by-line as it's a massive piece of work. However, changes on the whole are simplistic and the developer benefit massive.

API change Elevator Pitch

The abomination that is.

TPixel pixel = default;
pixel.FromScaledVector(vector)

Becomes, as nature intended.

TPixel pixel = TPixel.FromScaledVector(vector);

All other conversion methods follow suit where applicable.

@JimBobSquarePants JimBobSquarePants added this to the v4.0.0 milestone Jan 15, 2024
/// <param name="numBytes">The number of bytes to shift by.</param>
/// <returns>The <see cref="Vector128{Byte}"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<byte> ShiftRightBytesInVector(Vector128<byte> value, [ConstantExpected(Max = (byte)15)] byte numBytes)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something not right in one of the methods below on Arm64.

@tannergooding @saucecontrol Would either of you be able to do a quick readthrough and set me right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured it out. It was left shift. I'd forgotten to offset.

/// Calculates <paramref name="x"/> % 4
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static nuint Modulo4(nuint x) => x & 3;
Copy link
Contributor

@tannergooding tannergooding Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that x % 4 should already get optimized to x & 3 by the JIT since x is nuint (and therefore definitely unsigned)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks yeah, I just kept it the same for consistency.

Comment on lines +260 to +266
ref Vector512<float> vs0 = ref Unsafe.Add(ref sourceBase, i);
ref Vector512<float> vd0 = ref Unsafe.Add(ref destinationBase, i);

vd0 = Avx.Permute(vs0, control);
Unsafe.Add(ref vd0, 1) = Avx.Permute(Unsafe.Add(ref vs0, 1), control);
Unsafe.Add(ref vd0, 2) = Avx.Permute(Unsafe.Add(ref vs0, 2), control);
Unsafe.Add(ref vd0, 3) = Avx.Permute(Unsafe.Add(ref vs0, 3), control);
vd0 = Vector512Utilities.Shuffle(vs0, control);
Unsafe.Add(ref vd0, (nuint)1) = Vector512Utilities.Shuffle(Unsafe.Add(ref vs0, (nuint)1), control);
Unsafe.Add(ref vd0, (nuint)2) = Vector512Utilities.Shuffle(Unsafe.Add(ref vs0, (nuint)2), control);
Unsafe.Add(ref vd0, (nuint)3) = Vector512Utilities.Shuffle(Unsafe.Add(ref vs0, (nuint)3), control);
Copy link
Contributor

@tannergooding tannergooding Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that you could do this like the following:

Vector512Utilities.Shuffle(Vector512.LoadUnsafe(ref sourceBase, i + 0), control).StoreUnsafe(ref destinationBase, i + 0).
Vector512Utilities.Shuffle(Vector512.LoadUnsafe(ref sourceBase, i + 1), control).StoreUnsafe(ref destinationBase, i + 1).
Vector512Utilities.Shuffle(Vector512.LoadUnsafe(ref sourceBase, i + 2), control).StoreUnsafe(ref destinationBase, i + 2).
Vector512Utilities.Shuffle(Vector512.LoadUnsafe(ref sourceBase, i + 3), control).StoreUnsafe(ref destinationBase, i + 3).

It shouldn't result in any codegen differences, but avoids needing to manipulate the byref and uses direct hardware intrinsic APIs, rather than the Unsafe helpers, so can be easier to read/understand at least IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourceBase is Vector512<float> so Vector512.LoadUnsafe(ref sourceBase, i + 0) would yield Vector512<Vector512<float>>. I could use float as the base ref type but that complicates the offsetting which confuses me and causes me to make mistakes.

Copy link
Contributor

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just left a couple comments on newer APIs that might help with readability

/// </summary>
/// <returns>The <see cref="PixelTypeInfo"/>.</returns>
#pragma warning disable CA1000
static abstract PixelTypeInfo GetPixelTypeInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. Is there a specific reason to kick it from the interface and only have it in the implementations? Probably it was never called from the interface and only from specific implementations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is just temporary. I had to hack to get the solution to build because we were misusing ApproximateFloatComparer in the tests.

@@ -1,6 +1,7 @@
<?xml version="1.0" encoding="utf-8"?>
<RuleSet Name="ImageSharp" ToolsVersion="17.0">
<Include Path="..\shared-infrastructure\sixlabors.ruleset" Action="Default" />
<Rules AnalyzerId="StyleCop.Analyzers" RuleNamespace="StyleCop.Analyzers">
<Rules AnalyzerId="Microsoft.CodeAnalysis.CSharp.Features" RuleNamespace="Microsoft.CodeAnalysis.CSharp.Features">
<Rule Id="IDE0290" Action="None" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be suppressible via .editorconfig using csharp_style_prefer_primary_constructors = false:none (or a different severity level if you want to block use of primary constructors altogether, such as silent, suggestion, warning, or error)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah, I’ll do that upstream. Not sold on primary constructors. The syntax doesn’t sit right with me.

@JimBobSquarePants JimBobSquarePants marked this pull request as ready for review January 21, 2024 12:17
@JimBobSquarePants JimBobSquarePants changed the title WIP : Modernize and optimize pixel format operations across platforms. Modernize and optimize pixel format operations across platforms. Jan 21, 2024
@JimBobSquarePants JimBobSquarePants requested review from stefannikolei and tocsoft and removed request for stefannikolei January 21, 2024 12:17
@@ -492,7 +468,7 @@ private void UncompressRle4(BufferedReadStream stream, int w, Span<byte> buffer,
int max = cmd[1];
int bytesToRead = (int)(((uint)max + 1) / 2);

Span<byte> run = bytesToRead <= 128 ? scratchBuffer.Slice(0, bytesToRead) : new byte[bytesToRead];
Span<byte> run = bytesToRead <= 128 ? scratchBuffer[..bytesToRead] : new byte[bytesToRead];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious if there is a Analyzer to enforce this style. I know that Rider has a editoconfig setting for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -185,7 +185,7 @@ public ImageInfo Identify(BufferedReadStream stream, CancellationToken cancellat
{
uint frameCount = 0;
ImageFrameMetadata? previousFrame = null;
List<ImageFrameMetadata> framesMetadata = new();
List<ImageFrameMetadata> framesMetadata = [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want I can look at enforcing this style in a seperate pr

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice thanks. The IDE has been making the suggestion for me, that and range usage also.

Vector4 ToVector4();
/// <param name="source">The <see cref="Abgr32"/> value.</param>
/// <returns>The <typeparamref name="TSelf"/>.</returns>
static abstract TSelf FromAbgr32(Abgr32 source);
Copy link
Member

@antonfirsov antonfirsov Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is creating a copy the JIT may not be able(?) to eliminate, which can hurt performance, ie. I expect pixelSpan[i] = TPixel.FromArgb(source) to be slower than pixelSpan[i].FromArgb(source).

Have you benchmarked the impact on (non-optimized) batch pixel span conversion before/after?

Edit: The issue might be nonexistant for 32/64bit pixel types but present for stuff like RgbaVector. We have to benchmark and/or check codegen to see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the benchmarks in PixelConversion_ConvertFromRgba32 to test the static byval implementation vs byref and byval for TestRgba, TestArgb, and TestRgbaVector. All benchmarks look good with the JIT doing a great job of normalizing the differences.

  BenchmarkDotNet v0.13.10, Windows 11 (10.0.22631.3007/23H2/2023Update/SunValley3)
  11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
  .NET SDK 8.0.200-preview.23624.5
    [Host]     : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
    DefaultJob : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2

PixelConversion_ConvertFromRgba32_Compatible

Method Count Mean Error StdDev Ratio
ByRef 256 103.4 ns 0.52 ns 0.46 ns 1.00
ByVal 256 103.3 ns 1.48 ns 1.38 ns 1.00
StaticByVal 256 104.0 ns 0.36 ns 0.30 ns 1.01
FromBytes 256 201.8 ns 1.30 ns 1.15 ns 1.95
Inline 256 106.6 ns 0.40 ns 0.34 ns 1.03
ByRef 2048 771.5 ns 3.68 ns 3.27 ns 1.00
ByVal 2048 769.7 ns 3.39 ns 2.83 ns 1.00
StaticByVal 2048 773.2 ns 3.95 ns 3.50 ns 1.00
FromBytes 2048 1,555.3 ns 9.24 ns 8.19 ns 2.02
Inline 2048 799.5 ns 5.91 ns 4.93 ns 1.04

PixelConversion_ConvertFromRgba32_Permuted_RgbaToArgb

Method Count Mean Error StdDev Ratio RatioSD
ByRef 256 203.48 ns 3.318 ns 3.104 ns 1.00 0.00
ByVal 256 201.46 ns 2.242 ns 1.872 ns 0.99 0.02
StaticByVal 256 201.45 ns 0.791 ns 0.701 ns 0.99 0.02
FromBytes 256 200.76 ns 1.365 ns 1.140 ns 0.99 0.01
InlineShuffle 256 221.65 ns 2.104 ns 1.968 ns 1.09 0.02
PixelConverter_Rgba32_ToArgb32 256 26.23 ns 0.277 ns 0.231 ns 0.13 0.00
ByRef 2048 1,561.54 ns 11.208 ns 8.751 ns 1.00 0.00
ByVal 2048 1,554.26 ns 9.607 ns 8.517 ns 1.00 0.01
StaticByVal 2048 1,562.48 ns 8.937 ns 8.360 ns 1.00 0.01
FromBytes 2048 1,552.68 ns 7.445 ns 5.812 ns 0.99 0.01
InlineShuffle 2048 1,711.28 ns 7.559 ns 6.312 ns 1.10 0.01
PixelConverter_Rgba32_ToArgb32 2048 94.43 ns 0.363 ns 0.322 ns 0.06 0.00

PixelConversion_ConvertFromRgba32_RgbaToRgbaVector

Method Count Mean Error StdDev Ratio RatioSD
ByRef 256 448.5 ns 4.86 ns 4.06 ns 1.00 0.00
ByVal 256 447.0 ns 1.55 ns 1.21 ns 1.00 0.01
StaticByVal 256 447.4 ns 1.67 ns 1.30 ns 1.00 0.01
ByRef 2048 3,577.7 ns 53.80 ns 47.69 ns 1.00 0.00
ByVal 2048 3,590.5 ns 43.59 ns 36.40 ns 1.00 0.02
StaticByVal 2048 3,604.6 ns 16.19 ns 14.36 ns 1.01 0.01

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antonfirsov Are there any additional concerns here or can I move on? I'm keen to keep moving.

@@ -14,14 +14,14 @@ namespace SixLabors.ImageSharp.PixelFormats;
public interface IPixel<TSelf> : IPixel, IEquatable<TSelf>
where TSelf : unmanaged, IPixel<TSelf>
{
#pragma warning disable CA1000 // Do not declare static members on generic types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to open an issue for this on dotnet/roslyn-analyzers. Declaring static members on generic types is kind-of the purpose of static abstracts in interfaces 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have an issue already: dotnet/roslyn-analyzers#6424

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason that exists is to avoid unnecessary binary size bloat for when you eg. accidentally declare members that don't actually depend on type parameters at all. We had a few of those in ImageSharp, where the bokeh blur processor had a couple static fields with some readonly buffers (like, constants), and a bunch of pixel agnostic processing methods. We saved a nice amount of binary size moving those out to a non-generic type. It would be nice if the analyzer was smarter and just avoided warning in cases where you do in fact use the type parameter(s), but kept warning otherwise.

Copy link
Contributor

@tannergooding tannergooding Jan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, the analyzer exists not because of binary bloat but because MyClass.Foo(x) (given Foo<T>(T x)) works with generic inference while MyClass<T>.Foo(x) does not and since static members historically couldn't be abstract it didn't make sense to ever have them there.

So the analyzer really just hasn't been updated to account for the existence of static abstract in interfaces. Once it's been made aware, I'd expect it suggest such members be explicitly implemented on the type and to defer to the MyClass.Foo<T>(...) implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve upvoted on the linked issue anyway

@JimBobSquarePants JimBobSquarePants mentioned this pull request Jan 29, 2024
4 tasks
@JimBobSquarePants
Copy link
Member Author

Let's merge this. It's awesome.

@JimBobSquarePants JimBobSquarePants merged commit 2d979f9 into main Jan 30, 2024
6 checks passed
@JimBobSquarePants JimBobSquarePants deleted the js/pixelsformats branch January 30, 2024 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API arch:arm64 area:pixelformats breaking Signifies a binary breaking change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance optimization opportunities in common pixel formats.
6 participants