DenseMap perfomance 'old' intel CPU #43

check4game · 2025-01-11T14:21:53Z

Hi, again:)

HardwareIntrinsics=AVX,AES,PCLMUL,POPCNT VectorSize=128

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4170/22H2/2022Update)
Intel Core i7-2700K CPU 3.50GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX
  Job-LMREHN : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Length	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Allocated	Alloc Ratio
DenseMap	80000000	12.581 s	0.9408 s	0.2443 s	1.41	0.03	592 B	64 B	0.57
DenseMapCPU	80000000	7.525 s	0.3704 s	0.0962 s	0.85	0.01	561 B	112 B	1.00
Dictionary	80000000	8.906 s	0.3936 s	0.1022 s	1.00	0.01	227 B	112 B	1.00

HardwareIntrinsics=AVX,AES,PCLMUL,POPCNT VectorSize=128

BenchmarkDotNet v0.14.0, Windows 8 (6.2.9200.0)
Intel Core i7-3770 CPU 3.40GHz (Ivy Bridge), 1 CPU, 8 logical and 4 physical cores
Frequency: 3330092 Hz, Resolution: 300.292 ns, Timer: TSC
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX
  Job-RHKLBE : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Length	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Allocated	Alloc Ratio
DenseMap	80000000	13.047 s	0.4102 s	0.1065 s	1.40	0.16	592 B	112 B	1.00
DenseMapCPU	80000000	7.952 s	0.3423 s	0.0889 s	0.85	0.10	561 B	112 B	1.00
Dictionary	80000000	9.493 s	5.4299 s	1.4101 s	1.02	0.18	227 B	112 B	1.00

HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT VectorSize=256

BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-XXDNRI : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Length	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Allocated	Alloc Ratio
DenseMap	80000000	8.737 s	0.3173 s	0.0824 s	1.60	0.04	589 B	112 B	1.75
DenseMapCPU	80000000	4.790 s	0.3992 s	0.1037 s	0.87	0.03	561 B	112 B	1.75
Dictionary	80000000	5.479 s	0.4981 s	0.1293 s	1.00	0.03	227 B	64 B	1.00

HardwareIntrinsics=AVX-512F+CD+BW+DQ+VL+VBMI,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT,AvxVnni VectorSize=256

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 7 9700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-AXOGOJ : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Length	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Allocated	Alloc Ratio
DenseMap	80000000	2.366 s	0.3362 s	0.0873 s	0.58	0.02	585 B	112 B	1.75
DenseMapCPU	80000000	3.824 s	0.1602 s	0.0416 s	0.94	0.01	561 B	112 B	1.75
Dictionary	80000000	4.059 s	0.0739 s	0.0192 s	1.00	0.01	227 B	64 B	1.00

MyVer.zip

The text was updated successfully, but these errors were encountered:

Wsm2110 · 2025-01-11T20:28:07Z

Have to say, I love your resilience and dedication :)

The main idea is to use vectorization and make sure it runs well on any CPU. This scalar method works fine on older CPUs, but it can’t match the performance of vectorization on newer ones.

check4game · 2025-01-11T21:08:07Z

It seems to me that simd on old processors looks like a marketing ploy to sell them to ordinary people at a high price

yes it works but with a lot of restrictions and getting the code to actually be more productive is not so easy

it’s like with a gpu, load the data and process it quickly, but if you need to constantly download new ones, then the performance is hiding somewhere

It seems to me that the problem is that in our case a byte array is processed, and not an array of words, I’ll try to check this hypothesis

check4game · 2025-01-11T22:29:03Z

DenseMapCPU, jumpDistance += 4; // Increase the jump distance by 16 to probe the next cluster.

Reducing the distance for EmplaceCPU gives these results
looks cool, hope I didn't make any mistake

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 7 9700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-YXPCFP : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Length	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Allocated	Alloc Ratio
DenseMap	80000000	2.449 s	0.3059 s	0.0794 s	0.61	0.02	585 B	112 B	1.75
DenseMapCPU	80000000	2.330 s	0.3142 s	0.0816 s	0.58	0.02	561 B	112 B	1.75
Dictionary	80000000	4.035 s	0.0457 s	0.0119 s	1.00	0.00	227 B	64 B	1.00

Wsm2110 · 2025-01-12T05:38:32Z

I've a feeling that the performance hit of loading unaligned data on older CPUs is huge. I was considering using Vector128.ReadAligned() where each index corresponds to the nearest group of 16( less or equal to index).

check4game · 2025-01-14T12:38:25Z

I've a feeling that the performance hit of loading unaligned data on older CPUs is huge. I was considering using Vector128.ReadAligned() where each index corresponds to the nearest group of 16( less or equal to index).

the problem is not in array alignment, I found out what code on older processors slows down the insertion of new elements, but so far I don’t understand why (:


BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-IKFKMC : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Length	Mean	Error	StdDev	Ratio	RatioSD
DenseMap	80000000	8.654 s	0.5200 s	0.1350 s	1.55	0.10
DenseMapCPU	80000000	5.110 s	0.3751 s	0.0974 s	0.91	0.06
DenseMapVFix	80000000	3.776 s	0.4238 s	0.1101 s	0.68	0.05
Dictionary	80000000	5.617 s	1.6337 s	0.4243 s	1.00	0.09

Wsm2110 · 2025-01-14T12:49:57Z

Mind benchmark the read-vector-aligned branch :)

Wondering if there is any difference.

Anyways your benchmark looks promising.

check4game · 2025-01-14T15:18:17Z

.net aligns all memory to 8 bytes, but we need 16, but this is not enough (: we also need to align index to 16 bytes

_lengthMinusOne = (Length-1) & 0xFFFFFFF0

I wrote a test suite just for the Add function (80_000_000)


BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-TPXRFP : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Alignment	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Allocated	Alloc Ratio
TestCPU	False	2.558 s	0.0580 s	0.0151 s	1.00	0.01	325 B	112 B	1.00
TestCPUV	False	4.485 s	0.0494 s	0.0128 s	1.75	0.01	365 B	112 B	1.00
TestCPUV2	False	4.384 s	0.0303 s	0.0079 s	1.71	0.01	338 B	112 B	1.00
TestV	False	8.250 s	0.2031 s	0.0528 s	3.23	0.03	364 B	112 B	1.00
TestVFix	False	2.726 s	0.0482 s	0.0125 s	1.07	0.01	364 B	112 B	1.00
TestVPtr	False	8.307 s	0.1741 s	0.0452 s	3.25	0.02	356 B	400 B	3.57
TestVPtrFix	False	2.675 s	0.0610 s	0.0158 s	1.05	0.01	353 B	112 B	1.00
TestVPtrA	False	NA	NA	NA	?	?	NA	NA	?
TestVPtrAFix	False	NA	NA	NA	?	?	NA	NA	?

TestCPU	True	3.534 s	0.0300 s	0.0078 s	1.00	0.00	325 B	400 B	1.00
TestCPUV	True	5.394 s	0.0567 s	0.0147 s	1.53	0.00	365 B	112 B	0.28
TestCPUV2	True	4.878 s	0.0334 s	0.0087 s	1.38	0.00	338 B	112 B	0.28
TestV	True	8.300 s	0.1851 s	0.0481 s	2.35	0.01	364 B	400 B	1.00
TestVFix	True	3.615 s	0.0213 s	0.0055 s	1.02	0.00	364 B	400 B	1.00
TestVPtr	True	8.204 s	0.0706 s	0.0183 s	2.32	0.01	356 B	112 B	0.28
TestVPtrFix	True	3.858 s	0.0580 s	0.0151 s	1.09	0.00	353 B	400 B	1.00
TestVPtrA	True	8.196 s	0.0968 s	0.0251 s	2.32	0.01	356 B	400 B	1.00
TestVPtrAFix	True	4.254 s	0.0307 s	0.0080 s	1.20	0.00	353 B	400 B	1.00


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.2605)
AMD Ryzen 7 9700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-IQFXWS : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Alignment	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Allocated	Alloc Ratio
TestCPU	False	1.655 s	0.0656 s	0.0170 s	1.00	0.01	325 B	112 B	1.00
TestCPUV	False	3.680 s	0.0308 s	0.0080 s	2.22	0.02	365 B	112 B	1.00
TestCPUV2	False	3.551 s	0.1097 s	0.0285 s	2.15	0.03	338 B	400 B	3.57
TestV	False	1.501 s	0.0361 s	0.0094 s	0.91	0.01	360 B	400 B	3.57
TestVFix	False	1.618 s	0.0265 s	0.0069 s	0.98	0.01	360 B	400 B	3.57
TestVPtr	False	1.351 s	0.0331 s	0.0086 s	0.82	0.01	352 B	400 B	3.57
TestVPtrFix	False	1.465 s	0.0147 s	0.0038 s	0.88	0.01	349 B	400 B	3.57
TestVPtrA	False	NA	NA	NA	?	?	NA	NA	?
TestVPtrAFix	False	NA	NA	NA	?	?	NA	NA	?

TestCPU	True	3.047 s	0.0794 s	0.0206 s	1.00	0.01	325 B	112 B	1.00
TestCPUV	True	4.533 s	0.0843 s	0.0219 s	1.49	0.01	365 B	400 B	3.57
TestCPUV2	True	4.377 s	0.0807 s	0.0210 s	1.44	0.01	338 B	112 B	1.00
TestV	True	1.507 s	0.0486 s	0.0126 s	0.49	0.00	360 B	400 B	3.57
TestVFix	True	2.462 s	0.1705 s	0.0443 s	0.81	0.01	360 B	112 B	1.00
TestVPtr	True	1.311 s	0.0170 s	0.0044 s	0.43	0.00	352 B	112 B	1.00
TestVPtrFix	True	2.247 s	0.0485 s	0.0126 s	0.74	0.01	349 B	112 B	1.00
TestVPtrA	True	1.309 s	0.0227 s	0.0059 s	0.43	0.00	352 B	112 B	1.00
TestVPtrAFix	True	2.239 s	0.0256 s	0.0066 s	0.73	0.00	349 B	112 B	1.00

Intrinsics.zip

check4game · 2025-01-14T18:17:08Z

I don’t know why this happens(: I think it’s a .net bug

                var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();
                // Check for empty buckets in the current vector.
                if (emptyMask != 0)
                {

#if !SUPER_FAST_ON_OLD_CPU

                    for (var pos = index; pos <= (index + (uint)BitOperations.TrailingZeroCount(emptyMask)); pos++)
                    {
                        if (_emptyBucket == Find(_controlBytes, pos))
                        {
                            Find(_controlBytes, pos) = h2;

                            Find(_entries, pos) = key;

                            Count++;

                            return;
                        }
                    }

#elif FAST_ON_OLD_CPU

                    if (_emptyBucket == Find(_controlBytes, index))
                    {
                        Find(_controlBytes, index) = h2;
                        Find(_entries, index) = key;

                        Count++;

                        return;
                    }

                    index += (uint)BitOperations.TrailingZeroCount(emptyMask);

                    Find(_controlBytes, index) = h2;
                    Find(_entries, index) = key;

                    Count++;

                    return;
#else
                    // slow on OLD_CPU (: but fast on intel 12XXX+ & amd 9700X

                    index += (uint)BitOperations.TrailingZeroCount(emptyMask);

                    Find(_controlBytes, index) = h2;
                    Find(_entries, index) = key;

                    Count++;

                    return;

#endif
                }

check4game · 2025-01-14T18:28:43Z

simplified the code a bit

                var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();
                // Check for empty buckets in the current vector.
                if (emptyMask != 0)
                {

#if !SUPER_FAST_ON_OLD_CPU
                    while (_emptyBucket != Find(_controlBytes, index)) index++;

                    Find(_controlBytes, index) = h2;

                    Find(_entries, index) = key;

                    Count++;

                    return;

#elif FAST_ON_OLD_CPU

                    if (_emptyBucket != Find(_controlBytes, index))
                    {
                        index += (uint)BitOperations.TrailingZeroCount(emptyMask);
                    }

                    Find(_controlBytes, index) = h2;
                    Find(_entries, index) = key;

                    Count++;

                    return;
#else
                    // slow on OLD_CPU (: but fast on intel 12XXX+ & amd 9700X

                    index += (uint)BitOperations.TrailingZeroCount(emptyMask);

                    Find(_controlBytes, index) = h2;
                    Find(_entries, index) = key;

                    Count++;

                    return;
#endif
                }

Wsm2110 · 2025-01-14T19:05:17Z

Wondering what will happen if you increase the load to 80-85%

check4game · 2025-01-14T20:26:01Z

Wondering what will happen if you increase the load to 80-85%

easy, but I no longer have a 9700X(: collected for a friend

only i7-10700K

check4game · 2025-01-14T21:02:44Z


BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-HRQMGA : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Load	Capacity	Mean	Error	StdDev	Ratio
TestCPUV	0.5	16777216	319.9 ms	11.52 ms	2.99 ms	1.00
TestVSuperFast	0.5	16777216	194.3 ms	15.62 ms	4.06 ms	0.61
TestVFixFor	0.5	16777216	193.7 ms	16.67 ms	4.33 ms	0.61

TestCPUV	0.6	16777216	395.9 ms	9.45 ms	2.46 ms	1.00
TestVSuperFast	0.6	16777216	244.8 ms	11.96 ms	3.11 ms	0.62
TestVFixFor	0.6	16777216	247.4 ms	20.73 ms	5.38 ms	0.62

TestCPUV	0.7	16777216	480.4 ms	14.74 ms	3.83 ms	1.00
TestVSuperFast	0.7	16777216	303.7 ms	9.73 ms	2.53 ms	0.63
TestVFixFor	0.7	16777216	312.0 ms	10.76 ms	2.79 ms	0.65

TestCPUV	0.8	16777216	578.6 ms	7.77 ms	2.02 ms	1.00
TestVSuperFast	0.8	16777216	380.2 ms	22.25 ms	5.78 ms	0.66
TestVFixFor	0.8	16777216	396.3 ms	17.55 ms	4.56 ms	0.68

TestCPUV	0.85	16777216	633.5 ms	7.13 ms	1.85 ms	1.00
TestVSuperFast	0.85	16777216	420.8 ms	22.56 ms	5.86 ms	0.66
TestVFixFor	0.85	16777216	441.9 ms	19.49 ms	5.06 ms	0.70

TestCPUV	0.9	16777216	701.4 ms	10.85 ms	2.82 ms	1.00
TestVSuperFast	0.9	16777216	463.3 ms	11.57 ms	3.01 ms	0.66
TestVFixFor	0.9	16777216	497.6 ms	23.95 ms	6.22 ms	0.71

check4game · 2025-01-14T21:08:17Z


BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
  Job-HTIZKP : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

Force=True  Server=True  InvocationCount=1  
IterationCount=5  LaunchCount=1  RunStrategy=Monitoring  
UnrollFactor=1  WarmupCount=2

Method	Load	Capacity	Mean	Error	StdDev	Ratio
TestCPUV	0.9	16777216	705.5 ms	29.32 ms	7.61 ms	1.00
TestVSuperFast	0.9	16777216	492.7 ms	16.95 ms	4.40 ms	0.70
TestVFixFor	0.9	16777216	502.8 ms	19.71 ms	5.12 ms	0.71

TestCPUV	0.91	16777216	717.0 ms	11.09 ms	2.88 ms	1.00
TestVSuperFast	0.91	16777216	474.0 ms	13.75 ms	3.57 ms	0.66
TestVFixFor	0.91	16777216	517.4 ms	11.44 ms	2.97 ms	0.72

TestCPUV	0.92	16777216	734.3 ms	12.65 ms	3.29 ms	1.00
TestVSuperFast	0.92	16777216	476.3 ms	15.46 ms	4.02 ms	0.65
TestVFixFor	0.92	16777216	523.7 ms	13.81 ms	3.59 ms	0.71

TestCPUV	0.93	16777216	747.7 ms	15.09 ms	3.92 ms	1.00
TestVSuperFast	0.93	16777216	495.4 ms	24.84 ms	6.45 ms	0.66
TestVFixFor	0.93	16777216	541.2 ms	14.64 ms	3.80 ms	0.72

TestCPUV	0.94	16777216	772.6 ms	8.32 ms	2.16 ms	1.00
TestVSuperFast	0.94	16777216	495.2 ms	9.75 ms	2.53 ms	0.64
TestVFixFor	0.94	16777216	559.5 ms	11.59 ms	3.01 ms	0.72

TestCPUV	0.95	16777216	792.3 ms	8.05 ms	2.09 ms	1.00
TestVSuperFast	0.95	16777216	516.8 ms	8.62 ms	2.24 ms	0.65
TestVFixFor	0.95	16777216	578.0 ms	10.02 ms	2.60 ms	0.73

check4game · 2025-01-15T15:27:08Z

can you run benchmark on intel 12xxx? https://github.com/check4game/DotNetBug2

I want to send the problem to MS

Wsm2110 · 2025-01-15T16:15:14Z

Care to explain the problem first? Not really sure what im looking at.

Trying to find some time :)

check4game · 2025-01-15T17:12:18Z

the problem is the same :) what I started in this topic

BenchmarkDotNet v0.14.0, Windows 10 (10.0.20348.2849)
Intel Core i7-10700K CPU 3.80GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.102
  [Host]     : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2
  Job-PNDHCV : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2

InvocationCount=1  IterationCount=5  LaunchCount=1  
RunStrategy=Monitoring  UnrollFactor=1  WarmupCount=2

Method	Load	Mean	Error	StdDev	Ratio
AddOptimal	0.5	1,413.3 ms	33.91 ms	8.81 ms	1.00
AddOptimalFix1	0.5	567.7 ms	13.66 ms	3.55 ms	0.40
AddOptimalFix2	0.5	570.7 ms	9.24 ms	2.40 ms	0.40

AddOptimal 1413ms, it's problem! On modern CPUs AddOptimal is faster than AddOptimalFix1 or AddOptimalFix2

https://github.com/check4game/DotNetBug2/blob/d5062aadf60fbefe4351b4ae48acad3e5de88448/DotNetBug2.cs#L159-L172

var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();

// Check for empty buckets in the current vector.

if (emptyMask != 0)
{
    index += (uint)BitOperations.TrailingZeroCount(emptyMask);

    Find(_controlBytes, index) = h2;
    Find(_entries, index) = key;

    Count++;

    return;
}

AddOptimalFix1 567ms
https://github.com/check4game/DotNetBug2/blob/d5062aadf60fbefe4351b4ae48acad3e5de88448/DotNetBug2.cs#L217-L233

var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();

// Check for empty buckets in the current vector.

if (emptyMask != 0)
{
    if (_emptyBucket != Find(_controlBytes, index))
    {
        index += (uint)BitOperations.TrailingZeroCount(emptyMask);
    }

    Find(_controlBytes, index) = h2;
    Find(_entries, index) = key;

    Count++;

    return;
}

AddOptimalFix2 570ms
https://github.com/check4game/DotNetBug2/blob/d5062aadf60fbefe4351b4ae48acad3e5de88448/DotNetBug2.cs#L278-L291

var emptyMask = Vector128.Equals(source, _emptyBucketVector).ExtractMostSignificantBits();

// Check for empty buckets in the current vector.

if (emptyMask != 0)
{
    while (_emptyBucket != Find(_controlBytes, index)) index++;

    Find(_controlBytes, index) = h2;
    Find(_entries, index) = key;

    Count++;

    return;
}

check4game · 2025-01-15T17:36:48Z

I don't have a modern processor at the moment (:

Wsm2110 · 2025-01-15T17:52:16Z

Not having to time to analyze what you did, but here are the results

Method	Load	Mean	Error	StdDev	Ratio	RatioSD
AddOptimal	0.5	377.1 ms	42.50 ms	11.04 ms	1.00	0.04
AddOptimalFix1	0.5	361.1 ms	17.20 ms	4.47 ms	0.96	0.03
AddOptimalFix2	0.5	359.8 ms	3.69 ms	0.96 ms	0.95	0.02

check4game · 2025-01-15T18:14:33Z

Not having to time to analyze what you did, but here are the results

Method Load Mean Error StdDev Ratio RatioSD
AddOptimal 0.5 377.1 ms 42.50 ms 11.04 ms 1.00 0.04
AddOptimalFix1 0.5 361.1 ms 17.20 ms 4.47 ms 0.96 0.03
AddOptimalFix2 0.5 359.8 ms 3.69 ms 0.96 ms 0.95 0.02

please add BenchmarkDotNet header with info

check4game · 2025-01-15T18:17:19Z

it's vaild?

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4602/23H2/2023Update/SunValley3)
12th Gen Intel Core i5-12500H, 1 CPU, 16 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
[Host] : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2
Job-UIXUPV : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

InvocationCount=1 IterationCount=5 LaunchCount=1
RunStrategy=Monitoring UnrollFactor=1 WarmupCount=2

Wsm2110 · 2025-01-15T18:20:17Z

It is valid Op wo 15 jan 2025, 19:18 schreef check4game ***@***.***>:

…

it's vaild? BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4602/23H2/2023Update/SunValley3) 12th Gen Intel Core i5-12500H, 1 CPU, 16 logical and 12 physical cores .NET SDK 9.0.100-rc.2.24474.11 [Host] : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2 Job-UIXUPV : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2 InvocationCount=1 IterationCount=5 LaunchCount=1 RunStrategy=Monitoring UnrollFactor=1 WarmupCount=2 — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD4KISNSA26LGEHZ4ANCRBT2K2Q6XAVCNFSM6AAAAABVAARXTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJTGY2DGMJYGI> . You are receiving this because you commented.Message ID: <Wsm2110/Faster. ***@***.***>

check4game · 2025-01-15T18:35:55Z

thanks

Wsm2110 · 2025-01-22T19:06:50Z

Releasing a Hybrid solution soonish... with some promising results

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4751/23H2/2023Update/SunValley3)
12th Gen Intel Core i5-12500H, 1 CPU, 16 logical and 12 physical cores
.NET SDK 9.0.200-preview.0.24575.35
[Host] : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2
Job-DFGIDV : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX2

InvocationCount=1 IterationCount=5 LaunchCount=1
RunStrategy=Monitoring UnrollFactor=1 WarmupCount=5

| Method   | Length   | Mean    | Error    | StdDev   | Code Size | Allocated |
|--------- |--------- |--------:|---------:|---------:|----------:|----------:|
| BlitzMap | 80000000 | 1.969 s | 0.3128 s | 0.0812 s |     666 B |     400 B |
| DenseMap | 80000000 | 2.577 s | 0.4469 s | 0.1161 s |     593 B |     400 B |

Wsm2110 closed this as completed Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DenseMap perfomance 'old' intel CPU #43

DenseMap perfomance 'old' intel CPU #43

check4game commented Jan 11, 2025

Wsm2110 commented Jan 11, 2025

check4game commented Jan 11, 2025

check4game commented Jan 11, 2025

Wsm2110 commented Jan 12, 2025

check4game commented Jan 14, 2025

Wsm2110 commented Jan 14, 2025

check4game commented Jan 14, 2025

check4game commented Jan 14, 2025

check4game commented Jan 14, 2025

Wsm2110 commented Jan 14, 2025 •

edited

Loading

check4game commented Jan 14, 2025

check4game commented Jan 14, 2025

check4game commented Jan 14, 2025

check4game commented Jan 15, 2025

Wsm2110 commented Jan 15, 2025

check4game commented Jan 15, 2025 •

edited

Loading

check4game commented Jan 15, 2025

Wsm2110 commented Jan 15, 2025

check4game commented Jan 15, 2025

check4game commented Jan 15, 2025

Wsm2110 commented Jan 15, 2025 via email

check4game commented Jan 15, 2025

Wsm2110 commented Jan 22, 2025 •

edited

Loading

DenseMap perfomance 'old' intel CPU #43

DenseMap perfomance 'old' intel CPU #43

Comments

check4game commented Jan 11, 2025

Wsm2110 commented Jan 11, 2025

check4game commented Jan 11, 2025

check4game commented Jan 11, 2025

Wsm2110 commented Jan 12, 2025

check4game commented Jan 14, 2025

Wsm2110 commented Jan 14, 2025

check4game commented Jan 14, 2025

check4game commented Jan 14, 2025

check4game commented Jan 14, 2025

Wsm2110 commented Jan 14, 2025 • edited Loading

check4game commented Jan 14, 2025

check4game commented Jan 14, 2025

check4game commented Jan 14, 2025

check4game commented Jan 15, 2025

Wsm2110 commented Jan 15, 2025

check4game commented Jan 15, 2025 • edited Loading

check4game commented Jan 15, 2025

Wsm2110 commented Jan 15, 2025

check4game commented Jan 15, 2025

check4game commented Jan 15, 2025

Wsm2110 commented Jan 15, 2025 via email

check4game commented Jan 15, 2025

Wsm2110 commented Jan 22, 2025 • edited Loading

Wsm2110 commented Jan 14, 2025 •

edited

Loading

check4game commented Jan 15, 2025 •

edited

Loading

Wsm2110 commented Jan 22, 2025 •

edited

Loading