no tests for unicode codepoints vs. utf8-encoded characters #592

karenetheridge · 2022-08-28T23:54:43Z

no time to create a PR right now, so dropping this here to come back to, and for discussion in case for some reason this is controversial:

schema: { "maxLength": 3 }

passing test: "ಠ_ಠ"
failing test: "ಠ__ಠ"

In unicode, ಠ is 0x0CA0 (\x{0ca0}), and is utf8-encoded to 3 bytes: 0xE0 0xB2 0xA0 (\x{e0}\x{b2}\x{a0}).

(and also a similar test for minLength)

The text was updated successfully, but these errors were encountered:

OptimumCode · 2025-02-07T16:14:12Z

Hi, my eye caught this task while listing through the open issues. @karenetheridge @Julian could you please clarify how UTF-8 is related to the number of codepoints in the string? Just want to understand the idea behind that issue.

There are already tests for minLength and maxLength using 💩 (https://www.compart.com/en/unicode/U+1F4A9) that is encoded in UTF-8 as 4 bytes but still a single codepoint. Don't they cover the described case?

Julian · 2025-02-07T16:17:05Z

That would seem like it covers what was asked for here to me, yes.

karenetheridge · 2025-02-07T18:47:39Z

Yes, that test covers what I was talking about. However for completeness, we could test characters in each of the UTF-8 pages, as different
languages/architectures may handle these differently:

U+0000 to U+007F
U+0080 to U+07FF
U+0800 to U+FFFF
U+010000 to U+10FFFF

OptimumCode · 2025-02-07T20:46:12Z

Just a thought on "completeness": because the tests use a code point that is encoded with maximum possible number of bytes (4 to be exact), if an implementation passes these tests it would probably pass tests with other code point groups (with 1, 2 and 3 bytes length in UTF-8)

karenetheridge self-assigned this Aug 28, 2022

Julian added missing test A request to add a test to the suite that is currently not covered elsewhere. enhancement An enhancement to the tooling or structure of the suite (as opposed to a new test). labels Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no tests for unicode codepoints vs. utf8-encoded characters #592

no tests for unicode codepoints vs. utf8-encoded characters #592

karenetheridge commented Aug 28, 2022 •

edited

Loading

OptimumCode commented Feb 7, 2025 •

edited

Loading

Julian commented Feb 7, 2025

karenetheridge commented Feb 7, 2025

OptimumCode commented Feb 7, 2025

no tests for unicode codepoints vs. utf8-encoded characters #592

no tests for unicode codepoints vs. utf8-encoded characters #592

Comments

karenetheridge commented Aug 28, 2022 • edited Loading

OptimumCode commented Feb 7, 2025 • edited Loading

Julian commented Feb 7, 2025

karenetheridge commented Feb 7, 2025

OptimumCode commented Feb 7, 2025

karenetheridge commented Aug 28, 2022 •

edited

Loading

OptimumCode commented Feb 7, 2025 •

edited

Loading