library for unicode text segmentation #16

mmozeiko · 2022-06-29T04:00:13Z

mmozeiko
Jun 29, 2022

Unicode text can contain codepoints that must be grouped together with other codepoints next to it when processing text as characters for display. When you're processing text for rendering or otherwise you must know where the boundaries graphemes are.

It would be nice to have nice API in C for incremental processing text and determining these boundaries.
Unicode spec for algorithm doing this is here: https://unicode.org/reports/tr29/

API should offer state machine that accepts individual unicode codepoints on input and returns when boundary starts/ends. Don't worry about encodings, leave that to user's code.

Existing open source libraries offering this kind of functionality:

harfbuzz - https://harfbuzz.github.io/clusters.html
icu - https://unicode-org.github.io/icu/userguide/boundaryanalysis/

Good reference on how to compactly store properties for codepoints here: https://www.strchr.com/multi-stage_tables

As a bonus for more complete unicode text processing solution you can think of offering these features too:

bidirectional algorithm: https://unicode.org/reports/tr9/
line breaking algorithm: https://www.unicode.org/reports/tr14/
normalization: https://www.unicode.org/reports/tr15/

calebarg · 2022-07-21T17:24:23Z

calebarg
Jul 21, 2022

I'm interested in working on this.

0 replies

lukekasz · 2022-08-14T19:17:19Z

lukekasz
Aug 14, 2022

C99 library that provides this functionality: https://libs.suckless.org/libgrapheme/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

library for unicode text segmentation #16

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

library for unicode text segmentation #16

mmozeiko Jun 29, 2022

Replies: 2 comments

calebarg Jul 21, 2022

lukekasz Aug 14, 2022

mmozeiko
Jun 29, 2022

calebarg
Jul 21, 2022

lukekasz
Aug 14, 2022