Text processing can use some TLC #3217

dodexahedron · 2024-01-27T17:51:14Z

dodexahedron
Jan 27, 2024
Collaborator

A lot of very hot code paths in the library live in the StringExtensions, RuneExtensions, and TextFormatter classes.

That code - which does great stuff and you guys deserve plenty of kudos for - needs to be cleaned up before RTM of v2.

The biggest things:

There are quite a few methods that only ever get touched in unit test code, which is fine if they're intentionally there to be stuff for consumers to make use of, but otherwise should be removed.
Performance of a lot of the code that is hot even just in basic usage of the library, and which is also public and does useful stuff that consumers are likely to call, is... Let's just say that 700x execution time I mentioned in another issue did not come close to describing the memory or CPU cost of some of what these classes do, sometimes in tight loops. A single method call, from a consumer or incidentally because of something they did in TG that called some of this stuff can end up with CPU and memory (garbage, specifically, so also extra GC work) complexity of anything from steep linear to quadratic, cubic, quartic, and beyond.

I'm really tempted to do at least some minor work on some very low hanging fruit there, while waiting on the formatting stuff. There is even a hot-looped string concatenation that, all by itself, is capable of wasting shocking amounts of memory very quickly, even for fairly small inputs, for example. And that method is called by a ton of others, sometimes even doing the work more than once.

Wanted to get some feedback on those classes before I convert this to an issue.

dodexahedron · 2024-01-27T17:57:08Z

dodexahedron
Jan 27, 2024
Collaborator Author

Also, something that is relevant to wide characters:

Has normalizing strings before working on them or while enumerating their runes ever been explored, here?

If it hasn't (some of that stuff is pretty new)....

It takes care of the issues that otherwise arise from the fact that some characters can be represented by multiple code points, so we wouldn't have to explicitly handle those contingencies ourselves.

0 replies

dodexahedron · 2024-01-27T18:01:56Z

dodexahedron
Jan 27, 2024
Collaborator Author

Also...

Yes, this is closely related to #3214, but that one is kind of a master issue, in lieu of a project.

This is more targeted at a specific area.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text processing can use some TLC #3217

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Text processing can use some TLC #3217

dodexahedron Jan 27, 2024 Collaborator

Replies: 2 comments

dodexahedron Jan 27, 2024 Collaborator Author

dodexahedron Jan 27, 2024 Collaborator Author

dodexahedron
Jan 27, 2024
Collaborator

dodexahedron
Jan 27, 2024
Collaborator Author

dodexahedron
Jan 27, 2024
Collaborator Author