Reason for the subpar performance of the lookup validation algorithm in Java [discussion] #10

lemire · 2024-06-18T16:16:26Z

This Java code can be 'fast' but not nearly at the speed of a C implementation.

I believe that the reason is this code section:

    ByteVector byte1High = prev1.lanewise(LSHR, 4).selectFrom(lut.byte1HighLookup());
    ByteVector byte1Low = prev1.and((byte) 0x0f).selectFrom(lut.byte1LowLookup());
    ByteVector byte2High = input.lanewise(LSHR, 4).selectFrom(lut.byte2HighLookup());

At a glance, selectFrom looks like a standard vectorized table lookup like vpshufb or vtbl. But I think it is not, unfortunately. It appears to generate a long flow of instructions.

So the people working on the Java Vector API appear to have assumed that instructions like vpshufb or vtbl were of secondary importance and did not need to be exposed to the programmer. I believe that it is a mistake.

The C# approach is quite different. In .NET 8, you have Ssse3.Shuffle or AdvSimd.Arm64.VectorTableLookup for example.

See https://mail.openjdk.org/pipermail/panama-dev/2024-June/020476.html

The text was updated successfully, but these errors were encountered:

AugustNagro · 2024-06-19T19:07:21Z

Thanks for posting on the panama-dev channel, I'll be following the discussion

lemire · 2024-06-19T19:18:03Z

Thanks for posting on the panama-dev channel, I'll be following the discussion

I encourage you to chime in. If I am the only voice, I might not carry enough weight.

Just try to hack your code and I bet you will soon find that selectFrom carry a significant penalty.

They are more likely to listen to people who have been actively trying to use the Vector API.

lemire · 2024-06-19T21:38:55Z

BTW I ported the algorithm to C#, I am about the make it public. The .NET library has already a very fast UTF-8 validation function (it is internal but we can get at it with copy-paste). As you can see, I get about 25 GB/s per second on Twitter.json. It wasn't more difficult that the work you did.

Results (x64)

On an Intel Ice Lake system...

data set	AVX2 (GB/s)	.NET speed (GB/s)
Twitter.json	24	12
Arabic-Lipsum	9.0	2.3
Chinese-Lipsum	9.0	3.9
Emoji-Lipsum	7.1	0.9
Hebrew-Lipsum	8.0	2.3
Hindi-Lipsum	8.0	2.1
Japanese-Lipsum	8.0	3.5
Korean-Lipsum	8.0	1.3
Latin-Lipsum	76	96
Russian-Lipsum	8.0	1.2

Results (ARM)

On an Apple M2 system...

data set	ARM NEON (GB/s)	.NET speed (GB/s)
Twitter.json	25	14
Arabic-Lipsum	7.4	3.5
Chinese-Lipsum	7.4	4.8
Emoji-Lipsum	7.4	2.5
Hebrew-Lipsum	7.4	3.5
Hindi-Lipsum	7.3	3.0
Japanese-Lipsum	7.3	4.6
Korean-Lipsum	7.4	1.8
Latin-Lipsum	87	38
Russian-Lipsum	7.4	2.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reason for the subpar performance of the lookup validation algorithm in Java [discussion] #10

Reason for the subpar performance of the lookup validation algorithm in Java [discussion] #10

lemire commented Jun 18, 2024 •

edited

Loading

AugustNagro commented Jun 19, 2024

lemire commented Jun 19, 2024

lemire commented Jun 19, 2024

Reason for the subpar performance of the lookup validation algorithm in Java [discussion] #10

Reason for the subpar performance of the lookup validation algorithm in Java [discussion] #10

Comments

lemire commented Jun 18, 2024 • edited Loading

AugustNagro commented Jun 19, 2024

lemire commented Jun 19, 2024

lemire commented Jun 19, 2024

Results (x64)

Results (ARM)

lemire commented Jun 18, 2024 •

edited

Loading