Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DenseMap perfomance 'old' intel CPU #43

Closed
check4game opened this issue Jan 11, 2025 · 23 comments
Closed

DenseMap perfomance 'old' intel CPU #43

check4game opened this issue Jan 11, 2025 · 23 comments

Comments

@check4game
Copy link

Hi, again:)

HardwareIntrinsics=AVX,AES,PCLMUL,POPCNT VectorSize=128

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4170/22H2/2022Update)
Intel Core i7-2700K CPU 3.50GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX
  Job-LMREHN : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  
Method Length Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
DenseMap 80000000 12.581 s 0.9408 s 0.2443 s 1.41 0.03 592 B 64 B 0.57
DenseMapCPU 80000000 7.525 s 0.3704 s 0.0962 s 0.85 0.01 561 B 112 B 1.00
Dictionary 80000000 8.906 s 0.3936 s 0.1022 s 1.00 0.01 227 B 112 B 1.00
HardwareIntrinsics=AVX,AES,PCLMUL,POPCNT VectorSize=128

BenchmarkDotNet v0.14.0, Windows 8 (6.2.9200.0)
Intel Core i7-3770 CPU 3.40GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
Frequency: 3330092 Hz, Resolution: 300.292 ns, Timer: TSC
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX
  Job-RHKLBE : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  
Method Length Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
DenseMap 80000000 13.047 s 0.4102 s 0.1065 s 1.40 0.16 592 B 112 B 1.00
DenseMapCPU 80000000 7.952 s 0.3423 s 0.0889 s 0.85 0.10 561 B 112 B 1.00
Dictionary 80000000 9.493 s 5.4299 s 1.4101 s 1.02 0.18 227 B 112 B 1.00
HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256

BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-XXDNRI : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  
Method Length Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
DenseMap 80000000 8.737 s 0.3173 s 0.0824 s 1.60 0.04 589 B 112 B 1.75
DenseMapCPU 80000000 4.790 s 0.3992 s 0.1037 s 0.87 0.03 561 B 112 B 1.75
Dictionary 80000000 5.479 s 0.4981 s 0.1293 s 1.00 0.03 227 B 64 B 1.00
HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT,AvxVnni VectorSize=256

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 7 9700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-AXOGOJ : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  
Method Length Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
DenseMap 80000000 2.366 s 0.3362 s 0.0873 s 0.58 0.02 585 B 112 B 1.75
DenseMapCPU 80000000 3.824 s 0.1602 s 0.0416 s 0.94 0.01 561 B 112 B 1.75
Dictionary 80000000 4.059 s 0.0739 s 0.0192 s 1.00 0.01 227 B 64 B 1.00

MyVer.zip

@Wsm2110
Copy link
Owner

Wsm2110 commented Jan 11, 2025

Have to say, I love your resilience and dedication :)

The main idea is to use vectorization and make sure it runs well on any CPU. This scalar method works fine on older CPUs, but it can’t match the performance of vectorization on newer ones.

@check4game
Copy link
Author

It seems to me that simd on old processors looks like a marketing ploy to sell them to ordinary people at a high price

yes it works but with a lot of restrictions and getting the code to actually be more productive is not so easy

it’s like with a gpu, load the data and process it quickly, but if you need to constantly download new ones, then the performance is hiding somewhere

It seems to me that the problem is that in our case a byte array is processed, and not an array of words, I’ll try to check this hypothesis

@check4game
Copy link
Author

DenseMapCPU, jumpDistance += 4; // Increase the jump distance by 16 to probe the next cluster.

Reducing the distance for EmplaceCPU gives these results
looks cool, hope I didn't make any mistake

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 7 9700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-YXPCFP : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  

Method Length Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
DenseMap 80000000 2.449 s 0.3059 s 0.0794 s 0.61 0.02 585 B 112 B 1.75
DenseMapCPU 80000000 2.330 s 0.3142 s 0.0816 s 0.58 0.02 561 B 112 B 1.75
Dictionary 80000000 4.035 s 0.0457 s 0.0119 s 1.00 0.00 227 B 64 B 1.00

@Wsm2110
Copy link
Owner

Wsm2110 commented Jan 12, 2025

I've a feeling that the performance hit of loading unaligned data on older CPUs is huge. I was considering using Vector128.ReadAligned() where each index corresponds to the nearest group of 16( less or equal to index).

@check4game
Copy link
Author

I've a feeling that the performance hit of loading unaligned data on older CPUs is huge. I was considering using Vector128.ReadAligned() where each index corresponds to the nearest group of 16( less or equal to index).

the problem is not in array alignment, I found out what code on older processors slows down the insertion of new elements, but so far I don’t understand why (:


BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-IKFKMC : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  

Method Length Mean Error StdDev Ratio RatioSD
DenseMap 80000000 8.654 s 0.5200 s 0.1350 s 1.55 0.10
DenseMapCPU 80000000 5.110 s 0.3751 s 0.0974 s 0.91 0.06
DenseMapVFix 80000000 3.776 s 0.4238 s 0.1101 s 0.68 0.05
Dictionary 80000000 5.617 s 1.6337 s 0.4243 s 1.00 0.09

@Wsm2110
Copy link
Owner

Wsm2110 commented Jan 14, 2025

Mind benchmark the read-vector-aligned branch :)

Wondering if there is any difference.

Anyways your benchmark looks promising.

@check4game
Copy link
Author

.net aligns all memory to 8 bytes, but we need 16, but this is not enough (: we also need to align index to 16 bytes

_lengthMinusOne = (Length-1) & 0xFFFFFFF0

I wrote a test suite just for the Add function (80_000_000)


BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-TPXRFP : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  

Method Alignment Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
TestCPU False 2.558 s 0.0580 s 0.0151 s 1.00 0.01 325 B 112 B 1.00
TestCPUV False 4.485 s 0.0494 s 0.0128 s 1.75 0.01 365 B 112 B 1.00
TestCPUV2 False 4.384 s 0.0303 s 0.0079 s 1.71 0.01 338 B 112 B 1.00
TestV False 8.250 s 0.2031 s 0.0528 s 3.23 0.03 364 B 112 B 1.00
TestVFix False 2.726 s 0.0482 s 0.0125 s 1.07 0.01 364 B 112 B 1.00
TestVPtr False 8.307 s 0.1741 s 0.0452 s 3.25 0.02 356 B 400 B 3.57
TestVPtrFix False 2.675 s 0.0610 s 0.0158 s 1.05 0.01 353 B 112 B 1.00
TestVPtrA False NA NA NA ? ? NA NA ?
TestVPtrAFix False NA NA NA ? ? NA NA ?
TestCPU True 3.534 s 0.0300 s 0.0078 s 1.00 0.00 325 B 400 B 1.00
TestCPUV True 5.394 s 0.0567 s 0.0147 s 1.53 0.00 365 B 112 B 0.28
TestCPUV2 True 4.878 s 0.0334 s 0.0087 s 1.38 0.00 338 B 112 B 0.28
TestV True 8.300 s 0.1851 s 0.0481 s 2.35 0.01 364 B 400 B 1.00
TestVFix True 3.615 s 0.0213 s 0.0055 s 1.02 0.00 364 B 400 B 1.00
TestVPtr True 8.204 s 0.0706 s 0.0183 s 2.32 0.01 356 B 112 B 0.28
TestVPtrFix True 3.858 s 0.0580 s 0.0151 s 1.09 0.00 353 B 400 B 1.00
TestVPtrA True 8.196 s 0.0968 s 0.0251 s 2.32 0.01 356 B 400 B 1.00
TestVPtrAFix True 4.254 s 0.0307 s 0.0080 s 1.20 0.00 353 B 400 B 1.00

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 7 9700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-IQFXWS : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  

Method Alignment Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
TestCPU False 1.655 s 0.0656 s 0.0170 s 1.00 0.01 325 B 112 B 1.00
TestCPUV False 3.680 s 0.0308 s 0.0080 s 2.22 0.02 365 B 112 B 1.00
TestCPUV2 False 3.551 s 0.1097 s 0.0285 s 2.15 0.03 338 B 400 B 3.57
TestV False 1.501 s 0.0361 s 0.0094 s 0.91 0.01 360 B 400 B 3.57
TestVFix False 1.618 s 0.0265 s 0.0069 s 0.98 0.01 360 B 400 B 3.57
TestVPtr False 1.351 s 0.0331 s 0.0086 s 0.82 0.01 352 B 400 B 3.57
TestVPtrFix False 1.465 s 0.0147 s 0.0038 s 0.88 0.01 349 B 400 B 3.57
TestVPtrA False NA NA NA ? ? NA NA ?
TestVPtrAFix False NA NA NA ? ? NA NA ?
TestCPU True 3.047 s 0.0794 s 0.0206 s 1.00 0.01 325 B 112 B 1.00
TestCPUV True 4.533 s 0.0843 s 0.0219 s 1.49 0.01 365 B 400 B 3.57
TestCPUV2 True 4.377 s 0.0807 s 0.0210 s 1.44 0.01 338 B 112 B 1.00
TestV True 1.507 s 0.0486 s 0.0126 s 0.49 0.00 360 B 400 B 3.57
TestVFix True 2.462 s 0.1705 s 0.0443 s 0.81 0.01 360 B 112 B 1.00
TestVPtr True 1.311 s 0.0170 s 0.0044 s 0.43 0.00 352 B 112 B 1.00
TestVPtrFix True 2.247 s 0.0485 s 0.0126 s 0.74 0.01 349 B 112 B 1.00
TestVPtrA True 1.309 s 0.0227 s 0.0059 s 0.43 0.00 352 B 112 B 1.00
TestVPtrAFix True 2.239 s 0.0256 s 0.0066 s 0.73 0.00 349 B 112 B 1.00

Intrinsics.zip

@check4game
Copy link
Author

I don’t know why this happens(: I think it’s a .net bug

                var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();
                // Check for empty buckets in the current vector.
                if (emptyMask != 0)
                {

#if !SUPER_FAST_ON_OLD_CPU

                    for (var pos = index; pos <= (index + (uint)BitOperations.TrailingZeroCount(emptyMask)); pos++)
                    {
                        if (_emptyBucket == Find(_controlBytes, pos))
                        {
                            Find(_controlBytes, pos) = h2;

                            Find(_entries, pos) = key;

                            Count++;

                            return;
                        }
                    }

#elif FAST_ON_OLD_CPU

                    if (_emptyBucket == Find(_controlBytes, index))
                    {
                        Find(_controlBytes, index) = h2;
                        Find(_entries, index) = key;

                        Count++;

                        return;
                    }

                    index += (uint)BitOperations.TrailingZeroCount(emptyMask);

                    Find(_controlBytes, index) = h2;
                    Find(_entries, index) = key;

                    Count++;

                    return;
#else
                    // slow on OLD_CPU (: but fast on intel 12XXX+ & amd 9700X

                    index += (uint)BitOperations.TrailingZeroCount(emptyMask);

                    Find(_controlBytes, index) = h2;
                    Find(_entries, index) = key;

                    Count++;

                    return;

#endif
                }

@check4game
Copy link
Author

simplified the code a bit

                var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();
                // Check for empty buckets in the current vector.
                if (emptyMask != 0)
                {

#if !SUPER_FAST_ON_OLD_CPU
                    while (_emptyBucket != Find(_controlBytes, index)) index++;

                    Find(_controlBytes, index) = h2;

                    Find(_entries, index) = key;

                    Count++;

                    return;

#elif FAST_ON_OLD_CPU

                    if (_emptyBucket != Find(_controlBytes, index))
                    {
                        index += (uint)BitOperations.TrailingZeroCount(emptyMask);
                    }

                    Find(_controlBytes, index) = h2;
                    Find(_entries, index) = key;

                    Count++;

                    return;
#else
                    // slow on OLD_CPU (: but fast on intel 12XXX+ & amd 9700X

                    index += (uint)BitOperations.TrailingZeroCount(emptyMask);

                    Find(_controlBytes, index) = h2;
                    Find(_entries, index) = key;

                    Count++;

                    return;
#endif
                }

@Wsm2110
Copy link
Owner

Wsm2110 commented Jan 14, 2025

Wondering what will happen if you increase the load to 80-85%

@check4game
Copy link
Author

Wondering what will happen if you increase the load to 80-85%

easy, but I no longer have a 9700X(: collected for a friend

only i7-10700K

@check4game
Copy link
Author


BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-HRQMGA : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  

Method Load Capacity Mean Error StdDev Ratio
TestCPUV 0.5 16777216 319.9 ms 11.52 ms 2.99 ms 1.00
TestVSuperFast 0.5 16777216 194.3 ms 15.62 ms 4.06 ms 0.61
TestVFixFor 0.5 16777216 193.7 ms 16.67 ms 4.33 ms 0.61
TestCPUV 0.6 16777216 395.9 ms 9.45 ms 2.46 ms 1.00
TestVSuperFast 0.6 16777216 244.8 ms 11.96 ms 3.11 ms 0.62
TestVFixFor 0.6 16777216 247.4 ms 20.73 ms 5.38 ms 0.62
TestCPUV 0.7 16777216 480.4 ms 14.74 ms 3.83 ms 1.00
TestVSuperFast 0.7 16777216 303.7 ms 9.73 ms 2.53 ms 0.63
TestVFixFor 0.7 16777216 312.0 ms 10.76 ms 2.79 ms 0.65
TestCPUV 0.8 16777216 578.6 ms 7.77 ms 2.02 ms 1.00
TestVSuperFast 0.8 16777216 380.2 ms 22.25 ms 5.78 ms 0.66
TestVFixFor 0.8 16777216 396.3 ms 17.55 ms 4.56 ms 0.68
TestCPUV 0.85 16777216 633.5 ms 7.13 ms 1.85 ms 1.00
TestVSuperFast 0.85 16777216 420.8 ms 22.56 ms 5.86 ms 0.66
TestVFixFor 0.85 16777216 441.9 ms 19.49 ms 5.06 ms 0.70
TestCPUV 0.9 16777216 701.4 ms 10.85 ms 2.82 ms 1.00
TestVSuperFast 0.9 16777216 463.3 ms 11.57 ms 3.01 ms 0.66
TestVFixFor 0.9 16777216 497.6 ms 23.95 ms 6.22 ms 0.71

@check4game
Copy link
Author


BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-HTIZKP : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2  

Method Load Capacity Mean Error StdDev Ratio
TestCPUV 0.9 16777216 705.5 ms 29.32 ms 7.61 ms 1.00
TestVSuperFast 0.9 16777216 492.7 ms 16.95 ms 4.40 ms 0.70
TestVFixFor 0.9 16777216 502.8 ms 19.71 ms 5.12 ms 0.71
TestCPUV 0.91 16777216 717.0 ms 11.09 ms 2.88 ms 1.00
TestVSuperFast 0.91 16777216 474.0 ms 13.75 ms 3.57 ms 0.66
TestVFixFor 0.91 16777216 517.4 ms 11.44 ms 2.97 ms 0.72
TestCPUV 0.92 16777216 734.3 ms 12.65 ms 3.29 ms 1.00
TestVSuperFast 0.92 16777216 476.3 ms 15.46 ms 4.02 ms 0.65
TestVFixFor 0.92 16777216 523.7 ms 13.81 ms 3.59 ms 0.71
TestCPUV 0.93 16777216 747.7 ms 15.09 ms 3.92 ms 1.00
TestVSuperFast 0.93 16777216 495.4 ms 24.84 ms 6.45 ms 0.66
TestVFixFor 0.93 16777216 541.2 ms 14.64 ms 3.80 ms 0.72
TestCPUV 0.94 16777216 772.6 ms 8.32 ms 2.16 ms 1.00
TestVSuperFast 0.94 16777216 495.2 ms 9.75 ms 2.53 ms 0.64
TestVFixFor 0.94 16777216 559.5 ms 11.59 ms 3.01 ms 0.72
TestCPUV 0.95 16777216 792.3 ms 8.05 ms 2.09 ms 1.00
TestVSuperFast 0.95 16777216 516.8 ms 8.62 ms 2.24 ms 0.65
TestVFixFor 0.95 16777216 578.0 ms 10.02 ms 2.60 ms 0.73

@check4game
Copy link
Author

can you run benchmark on intel 12xxx? https://github.com/check4game/DotNetBug2

I want to send the problem to MS

@Wsm2110
Copy link
Owner

Wsm2110 commented Jan 15, 2025

Care to explain the problem first? Not really sure what im looking at.

Trying to find some time :)

@check4game
Copy link
Author

check4game commented Jan 15, 2025

the problem is the same :) what I started in this topic

BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.102
  [Host]     : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2
  Job-PNDHCV : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2

InvocationCount=1  IterationCount=5  LaunchCount=1  
RunStrategy=Monitoring  UnrollFactor=1  WarmupCount=2  

Method Load Mean Error StdDev Ratio
AddOptimal 0.5 1,413.3 ms 33.91 ms 8.81 ms 1.00
AddOptimalFix1 0.5 567.7 ms 13.66 ms 3.55 ms 0.40
AddOptimalFix2 0.5 570.7 ms 9.24 ms 2.40 ms 0.40

AddOptimal 1413ms, it's problem! On modern CPUs AddOptimal is faster than AddOptimalFix1 or AddOptimalFix2

https://github.com/check4game/DotNetBug2/blob/d5062aadf60fbefe4351b4ae48acad3e5de88448/DotNetBug2.cs#L159-L172

var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();

// Check for empty buckets in the current vector.

if (emptyMask != 0)
{
    index += (uint)BitOperations.TrailingZeroCount(emptyMask);

    Find(_controlBytes, index) = h2;
    Find(_entries, index) = key;

    Count++;

    return;
}

AddOptimalFix1 567ms
https://github.com/check4game/DotNetBug2/blob/d5062aadf60fbefe4351b4ae48acad3e5de88448/DotNetBug2.cs#L217-L233

var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();

// Check for empty buckets in the current vector.

if (emptyMask != 0)
{
    if (_emptyBucket != Find(_controlBytes, index))
    {
        index += (uint)BitOperations.TrailingZeroCount(emptyMask);
    }

    Find(_controlBytes, index) = h2;
    Find(_entries, index) = key;

    Count++;

    return;
}

AddOptimalFix2 570ms
https://github.com/check4game/DotNetBug2/blob/d5062aadf60fbefe4351b4ae48acad3e5de88448/DotNetBug2.cs#L278-L291

var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();

// Check for empty buckets in the current vector.

if (emptyMask != 0)
{
    while (_emptyBucket != Find(_controlBytes, index)) index++;

    Find(_controlBytes, index) = h2;
    Find(_entries, index) = key;

    Count++;

    return;
}

@check4game
Copy link
Author

I don't have a modern processor at the moment (:

@Wsm2110
Copy link
Owner

Wsm2110 commented Jan 15, 2025

Not having to time to analyze what you did, but here are the results

Method Load Mean Error StdDev Ratio RatioSD
AddOptimal 0.5 377.1 ms 42.50 ms 11.04 ms 1.00 0.04
AddOptimalFix1 0.5 361.1 ms 17.20 ms 4.47 ms 0.96 0.03
AddOptimalFix2 0.5 359.8 ms 3.69 ms 0.96 ms 0.95 0.02

@check4game
Copy link
Author

Not having to time to analyze what you did, but here are the results

Method Load Mean Error StdDev Ratio RatioSD
AddOptimal 0.5 377.1 ms 42.50 ms 11.04 ms 1.00 0.04
AddOptimalFix1 0.5 361.1 ms 17.20 ms 4.47 ms 0.96 0.03
AddOptimalFix2 0.5 359.8 ms 3.69 ms 0.96 ms 0.95 0.02

please add BenchmarkDotNet header with info

@check4game
Copy link
Author

it's vaild?

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4602/23H2/2023Update/SunValley3)
12th Gen Intel Core i5-12500H, 1 CPU, 16 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
[Host] : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2
Job-UIXUPV : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

InvocationCount=1 IterationCount=5 LaunchCount=1
RunStrategy=Monitoring UnrollFactor=1 WarmupCount=2

@Wsm2110
Copy link
Owner

Wsm2110 commented Jan 15, 2025 via email

@check4game
Copy link
Author

thanks

@Wsm2110
Copy link
Owner

Wsm2110 commented Jan 22, 2025

Releasing a Hybrid solution soonish... with some promising results

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4751/23H2/2023Update/SunValley3)
12th Gen Intel Core i5-12500H, 1 CPU, 16 logical and 12 physical cores
.NET SDK 9.0.200-preview.0.24575.35
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
Job-DFGIDV : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

InvocationCount=1 IterationCount=5 LaunchCount=1
RunStrategy=Monitoring UnrollFactor=1 WarmupCount=5

| Method   | Length   | Mean    | Error    | StdDev   | Code Size | Allocated |
|--------- |--------- |--------:|---------:|---------:|----------:|----------:|
| BlitzMap | 80000000 | 1.969 s | 0.3128 s | 0.0812 s |     666 B |     400 B |
| DenseMap | 80000000 | 2.577 s | 0.4469 s | 0.1161 s |     593 B |     400 B |

@Wsm2110 Wsm2110 closed this as completed Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants