Basic Unicode support #83

Richard57 · 2022-05-04T14:55:57Z

Richard57
May 4, 2022

How is one meant to include Unicode characters? Are they only supported as an arbitrary 32-bit alphabet as opposed to a 16- or 8-bit alphabet, or is there special support for them?

Is there some syntax for including Unicode characters in regular expressions? Mostly one can just include them as machines recognising a single character e.g. /a/ 0x0302 /.*/ as opposed to NFD /â.*/, but I can't work out how to specify a range of characters by codepoint, other than by exhaustively listing the entire range.

Is there any sanity-preserving way of combining Unicode and semantic conditions? Unicode actually only needs 21 bits, so there are 11 bits left over for use in semantic conditions. One could use UTF-8 or UTF-16, but UTF-8 is unpleasant and UTF-16 tends to preserve obscure bugs with lone surrogates.

Richard57 · 2022-05-04T19:48:58Z

Richard57
May 4, 2022
Author

Whoops, I overlooked 'range', exemplified by a..z. For example, (0xC00..0xC7F) will serve to denote the characters of the Telugu block. That deals with the problem I wrongly highlighted in the second paragraph.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Unicode support #83

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Basic Unicode support #83

Richard57 May 4, 2022

Replies: 1 comment

Richard57 May 4, 2022 Author

Richard57
May 4, 2022

Richard57
May 4, 2022
Author