Replies: 1 comment
-
Whoops, I overlooked 'range', exemplified by a..z. For example, (0xC00..0xC7F) will serve to denote the characters of the Telugu block. That deals with the problem I wrongly highlighted in the second paragraph. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
How is one meant to include Unicode characters? Are they only supported as an arbitrary 32-bit alphabet as opposed to a 16- or 8-bit alphabet, or is there special support for them?
Is there some syntax for including Unicode characters in regular expressions? Mostly one can just include them as machines recognising a single character e.g. /a/ 0x0302 /.*/ as opposed to NFD /â.*/, but I can't work out how to specify a range of characters by codepoint, other than by exhaustively listing the entire range.
Is there any sanity-preserving way of combining Unicode and semantic conditions? Unicode actually only needs 21 bits, so there are 11 bits left over for use in semantic conditions. One could use UTF-8 or UTF-16, but UTF-8 is unpleasant and UTF-16 tends to preserve obscure bugs with lone surrogates.
Beta Was this translation helpful? Give feedback.
All reactions