Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused about confusables #12

Open
kengruven opened this issue Mar 3, 2021 · 1 comment
Open

Confused about confusables #12

kengruven opened this issue Mar 3, 2021 · 1 comment

Comments

@kengruven
Copy link

I'm reading about confusable characters in CE and it warns "Note: This list is not guaranteed complete! Use it as a guide only. The Unicode character set will change at times in the future, so it's on you to keep up."

I think this is a noble goal, but I'm just not sure how it's going to work, especially as specified. Where is the list? I wasn't aware of a Unicode property called "confusable". Is there an ICU function to return the current set? (Sort of: they call it "spoof detection".)

So I googled and the first hit was this page, which says that (for example) letter O and digit 0 are confusable. Does that mean I can't use "O" or "0" in a CTE key without escaping it?

I dug around in the CE source code for a while (I'm not a Go programmer) and found this. That seems to match what's in the CE documentation, but it's very different from UTS#39.

I agree in principle that allowing "spoofing"/"confusables" can be problematic for human-readable text formats, but the rules I see here are so vague I've been researching this for an hour and I still can't tell if "O" is a valid string key or not.

@kstenerud
Copy link
Owner

kstenerud commented Mar 13, 2021

The issue I'm trying to solve is problems where otherwise valid symbols make it difficult for humans to see what's going on:

c1
{
    something = [abc]
}

Is something pointing to a list or a string? Because != [, it's a string. And since it doesn't contain any reserved symbol characters, it can be printed without quotes, leading to the above situation where a computer understands what it is, but a human doesn't.

This leads to the unfortunate problem of locking down a continually changing spec (Unicode) and deciding which characters are perceptually too close to characters that alter the structure of the document... I suppose maybe I'll have to go on an exhaustive search of all possible problematic characters, which will complicate all encoders, but the alternative is to allow confusing documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants