Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no tests for unicode codepoints vs. utf8-encoded characters #592

Open
karenetheridge opened this issue Aug 28, 2022 · 4 comments
Open

no tests for unicode codepoints vs. utf8-encoded characters #592

karenetheridge opened this issue Aug 28, 2022 · 4 comments
Assignees
Labels
enhancement An enhancement to the tooling or structure of the suite (as opposed to a new test). missing test A request to add a test to the suite that is currently not covered elsewhere.

Comments

@karenetheridge
Copy link
Member

karenetheridge commented Aug 28, 2022

no time to create a PR right now, so dropping this here to come back to, and for discussion in case for some reason this is controversial:

schema: { "maxLength": 3 }

passing test: "ಠ_ಠ"
failing test: "ಠ__ಠ"

In unicode, is 0x0CA0 (\x{0ca0}), and is utf8-encoded to 3 bytes: 0xE0 0xB2 0xA0 (\x{e0}\x{b2}\x{a0}).

(and also a similar test for minLength)

@karenetheridge karenetheridge self-assigned this Aug 28, 2022
@Julian Julian added missing test A request to add a test to the suite that is currently not covered elsewhere. enhancement An enhancement to the tooling or structure of the suite (as opposed to a new test). labels Nov 8, 2022
@OptimumCode
Copy link
Contributor

OptimumCode commented Feb 7, 2025

Hi, my eye caught this task while listing through the open issues. @karenetheridge @Julian could you please clarify how UTF-8 is related to the number of codepoints in the string? Just want to understand the idea behind that issue.

There are already tests for minLength and maxLength using 💩 (https://www.compart.com/en/unicode/U+1F4A9) that is encoded in UTF-8 as 4 bytes but still a single codepoint. Don't they cover the described case?

@Julian
Copy link
Member

Julian commented Feb 7, 2025

That would seem like it covers what was asked for here to me, yes.

@karenetheridge
Copy link
Member Author

Yes, that test covers what I was talking about. However for completeness, we could test characters in each of the UTF-8 pages, as different
languages/architectures may handle these differently:

  • U+0000 to U+007F
  • U+0080 to U+07FF
  • U+0800 to U+FFFF
  • U+010000 to U+10FFFF

@OptimumCode
Copy link
Contributor

Just a thought on "completeness": because the tests use a code point that is encoded with maximum possible number of bytes (4 to be exact), if an implementation passes these tests it would probably pass tests with other code point groups (with 1, 2 and 3 bytes length in UTF-8)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to the tooling or structure of the suite (as opposed to a new test). missing test A request to add a test to the suite that is currently not covered elsewhere.
Projects
None yet
Development

No branches or pull requests

3 participants