This repository has been archived by the owner on Aug 31, 2024. It is now read-only.
library for unicode text segmentation #16
mmozeiko
started this conversation in
Libraries & Tools
Replies: 2 comments
-
I'm interested in working on this. |
Beta Was this translation helpful? Give feedback.
0 replies
-
C99 library that provides this functionality: https://libs.suckless.org/libgrapheme/ |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Unicode text can contain codepoints that must be grouped together with other codepoints next to it when processing text as characters for display. When you're processing text for rendering or otherwise you must know where the boundaries graphemes are.
It would be nice to have nice API in C for incremental processing text and determining these boundaries.
Unicode spec for algorithm doing this is here: https://unicode.org/reports/tr29/
API should offer state machine that accepts individual unicode codepoints on input and returns when boundary starts/ends. Don't worry about encodings, leave that to user's code.
Existing open source libraries offering this kind of functionality:
Good reference on how to compactly store properties for codepoints here: https://www.strchr.com/multi-stage_tables
As a bonus for more complete unicode text processing solution you can think of offering these features too:
Beta Was this translation helpful? Give feedback.
All reactions