Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected performance difference with strange hack #109884

Open
Lanayx opened this issue Nov 16, 2024 · 1 comment
Open

Unexpected performance difference with strange hack #109884

Lanayx opened this issue Nov 16, 2024 · 1 comment
Labels
needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner

Comments

@Lanayx
Copy link

Lanayx commented Nov 16, 2024

Description

The following hack leads to 25% perf improvement

let mutable i = 0
while i < N do
    ...
    i <- i + 1

can be changed to

let mutable i = 0
while i < N do
    ...
    let mutable x = 1
    i <- i + x
// without hack:
// BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4391/23H2/2023Update/SunValley3)
// AMD Ryzen 5 5600H with Radeon Graphics, 1 CPU, 12 logical and 6 physical cores
// .NET SDK 9.0.100
//   [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2 DEBUG
//   DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
//
//
// | Method     | Mean     | Error   | StdDev  | Gen0     | Gen1     | Gen2     | Allocated |
// |----------- |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
// | WebUtility | 695.5 us | 4.14 us | 3.87 us | 332.0313 | 332.0313 | 332.0313 |   1370 KB |
// | Tools      | 668.1 us | 2.86 us | 2.39 us |  83.9844 |  68.3594 |        - |  691.6 KB |

// with hack:
// BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4391/23H2/2023Update/SunValley3)
// AMD Ryzen 5 5600H with Radeon Graphics, 1 CPU, 12 logical and 6 physical cores
// .NET SDK 9.0.100
//   [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2 DEBUG
//   DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
//
//
// | Method     | Mean     | Error   | StdDev  | Gen0     | Gen1     | Gen2     | Allocated |
// |----------- |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
// | WebUtility | 696.0 us | 2.37 us | 2.10 us | 332.0313 | 332.0313 | 332.0313 |   1370 KB |
// | Tools      | 498.0 us | 3.03 us | 2.83 us |  83.9844 |  68.3594 |        - |  691.6 KB |

Configuration

.NET SDK:
 Version:           9.0.100
 Commit:            59db016f11
 Workload version:  9.0.100-manifests.c6f19616
 MSBuild version:   17.12.7+5b8665660

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22631
 OS Platform: Windows
 RID:         win-x64
 Base Path:   C:\Program Files\dotnet\sdk\9.0.100\

.NET workloads installed:
There are no installed workloads to display.
Configured to use loose manifests when installing new manifests.

Host:
  Version:      9.0.0
  Architecture: x64
  Commit:       9d5a6a9aa4

Regression?

No, it's reproduced both with .NET8 and .NET9

Data

Repro is located here: Oxpecker, perfhack branch

  1. Clone repository
  2. Run tests/PerfTest project
  3. Observe results
  4. Remove dirty hack
  5. Run test again and observe results

Analysis

I don't understand why difference occurs, some people say it's variable hoisting, but I couldn't find detailed explanation.

@Lanayx Lanayx added the tenet-performance Performance related issue label Nov 16, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 16, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 16, 2024
@etki
Copy link

etki commented Nov 16, 2024

Sharplab reports the following difference for a simple example:

  IL_001e: call int32 _::get_i()
  IL_0023: ldc.i4.1
  IL_0024: add
  IL_0025: call void _::set_i(int32)

vs.

  IL_0048: ldc.i4.1
> IL_0049: stloc.0
  IL_004a: call int32 _::get_i()
> IL_004f: ldloc.0
  IL_0050: add
  IL_0051: call void _::set_i(int32)

Given that only one of the examples above gets improvement, and that the JIT output is identical, i think this could be just a budgeting issue

Reference loop:

    L0017: cmp dword ptr [eax+4], 0xc9
    L001e: jge short L0039
    L0020: mov edx, [eax+8]
    L0023: add edx, [eax+4]
    L0026: mov [eax+8], edx
    L0029: mov edx, [eax+4]
    L002c: inc edx
    L002d: mov [eax+4], edx
    L0030: cmp dword ptr [eax+4], 0xc9
    L0037: jl short L0020

Only labels and limit are different in the hacked loop placed below the reference one:

    L0039: cmp dword ptr [eax+4], 0x192
    L0040: jge short L005b
    L0042: mov edx, [eax+8]
    L0045: add edx, [eax+4]
    L0048: mov [eax+8], edx
    L004b: mov edx, [eax+4]
    L004e: inc edx
    L004f: mov [eax+4], edx
    L0052: cmp dword ptr [eax+4], 0x192
    L0059: jl short L0042

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

2 participants