Non-surrogate characters being incorrectly combined when `JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8` is enabled #1359

jgosselin-accesso · 2024-11-13T21:33:16Z

Version: 2.18.0+

Hello! I believe we've found a bug in the encoding of UTF-8 JSON when the COMBINE_SURROGATES_AS_UTF8 feature is enabled whereby some characters in the BMP, not belonging to the surrogate blocks, are incorrectly marked as such and therefore combined. This leads to invalid encoding results around those characters.

This happens with some characters that are specifically in the BMP but beyond the surrogate blocks. Notably this includes half- and full-width characters. Characters in the surrogate blocks of D800 - DFFF are correctly treated as surrogates and properly combined when the feature is enabled, and characters before those blocks are also encoded correctly as usual.

Here is a reproducible example:

JsonFactory factory = new JsonFactory();  
ByteArrayOutputStream out = new ByteArrayOutputStream();  
JsonGenerator gen = factory.createGenerator(out);  
  
gen.writeStartObject();  
gen.enable(JsonGenerator.Feature.COMBINE_UNICODE_SURROGATES_IN_UTF8);  
  
// Inside the BMP, beyond surrogate block; 0xFF0C - full-width comma 
gen.writeStringField("test_full_width", "foo" + new String(Character.toChars(0xFF0C)) + "bar");  
  
// Inside the BMP, beyond surrogate block; 0xFE6A - small form percent
gen.writeStringField("test_small_form", "foo" + new String(Character.toChars(0xFE6A)) + "bar");  
  
// Inside the BMP, before the surrogate block; 0x3042 - Hiragana A
gen.writeStringField("test_hiragana", "foo" + new String(Character.toChars(0x3042)) + "bar");  
  
// Outside the BMP; 0x1F60A - emoji
gen.writeStringField("test_emoji", new String(Character.toChars(0x1F60A)));  
  
gen.writeEndObject();  
gen.close();  
  
System.out.write(out.toByteArray());

This prints the following (image to show the broken encoding):

Notice that in the first two cases, the characters are treated as surrogates and attempted to be combined with the "b" in "bar" as it is the character immediately to the right, which results in the broken encoding we observe here. The case of a character before the surrogate block is properly encoded, and the emoji is properly combined, as expected.

Disabling the feature, of course, gives us the correct results with the emoji encoded to the escaped surrogate pair:

{"test_full_width":"foo，bar","test_small_form":"foo﹪bar","test_hiragana":"fooあbar","test_emoji":"\uD83D\uDE0A"}

This appears to be caused by an incorrect mask used to check if a character is a surrogate:

jackson-core/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java

Lines 2251 to 2254 in 93335d4

    
           // @since 2.18 
        
           private boolean _isSurrogateChar(int ch) { 
        
               return (ch & 0xD800) == 0xD800; 
        
           }

Since the surrogate blocks are in the range D800 - DFFF, they all have their first 5 bits as 11011. In the mask above, however, the 3rd bit can yield an incorrect truth statement on the conditional check. For example, consider U+FF08, the full-width open parentheses character. This would be 1111 1111 0000 1000, and when the mask is applied, we get 1101 1000 0000 0000 or D800. However, this character does not belong to any of the surrogate blocks and its first five bits are 11111. The issue is that the 3rd bit is 0 on the mask, which results in D800 after the mask is applied, even though FF08 is not a surrogate.

A quick fix might be to use 0xF800 as the mask so that the 3rd bit may be accounted for properly.

The text was updated successfully, but these errors were encountered:

pjfanning · 2024-11-13T21:42:28Z

Thanks @jgosselin-accesso - would you be interested in doing a PR?

pjfanning · 2024-11-13T21:45:05Z

I would also suggest that _isSurrogateChar is not a great name. Maybe _isFirstByteOfSurrogatePair is better?

I would like to make this method package private so that we can unit test it.

cowtowncoder · 2024-11-13T23:24:30Z

Assuming this is related to/caused by #223 ?

pjfanning · 2024-11-13T23:47:01Z

Assuming this is related to/caused by #223 ?

#1335 brought in the COMBINE_SURROGATES_AS_UTF8 feature

cowtowncoder · 2024-11-14T00:48:22Z

Right, that's PR to resolve #223. And yes, mask should probably be 0xF800 to check for 0xD800 prefix.

EDIT: realized it's 0xF800 for any surrogate, but 0xFC00 to specifically require starting surrogate.

cowtowncoder added the 2.18 Issues planned at earliest for 2.18 label Nov 13, 2024

pjfanning mentioned this issue Nov 14, 2024

fix issue with incorrect matches for some surrogate chars #1360

Merged

cowtowncoder changed the title ~~Non-surrogate characters being incorrectly combined when feature is enabled~~ Non-surrogate characters being incorrectly combined when JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8 is enabled Nov 15, 2024

cowtowncoder modified the milestones: 2.18.0, 2.18.2 Nov 15, 2024

cowtowncoder closed this as completed Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-surrogate characters being incorrectly combined when `JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8` is enabled #1359

Non-surrogate characters being incorrectly combined when `JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8` is enabled #1359

jgosselin-accesso commented Nov 13, 2024

pjfanning commented Nov 13, 2024

pjfanning commented Nov 13, 2024

cowtowncoder commented Nov 13, 2024

pjfanning commented Nov 13, 2024

cowtowncoder commented Nov 14, 2024 •

edited

Loading

Non-surrogate characters being incorrectly combined when JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8 is enabled #1359

Non-surrogate characters being incorrectly combined when JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8 is enabled #1359

Comments

jgosselin-accesso commented Nov 13, 2024

pjfanning commented Nov 13, 2024

pjfanning commented Nov 13, 2024

cowtowncoder commented Nov 13, 2024

pjfanning commented Nov 13, 2024

cowtowncoder commented Nov 14, 2024 • edited Loading

Non-surrogate characters being incorrectly combined when `JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8` is enabled #1359

Non-surrogate characters being incorrectly combined when `JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8` is enabled #1359

cowtowncoder commented Nov 14, 2024 •

edited

Loading