Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected #94
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While using the library for a project, I noticed a strange behaviour when I was looking for errors and applying the suggestions given by LanguageTool in a text that included characters encoded on 4 bytes.
Take this code for example:
At present, in v2.8, the result is as follows:
The sun was setting 🌅, casting a warm glow over the park. Birds cchippedsoftly 🐦 as the day slowly fade into night.
Why does it produce this result? Because the two emojis in the sentence are encoded on 4 bytes, and it seems that LanguageTool, when calculating the offsets, counts the characters encoded on 4 bytes as 2 characters and not 1.
So the offsets after the first emoji have been shifted by 1. This means that the application of the second correction (chirpped -> chipped) has been shifted by one character to the right.
The first correction (seting -> setting) was correctly made because, being positioned before any character encoded on 4 bytes, there was no offset in the offsets.
With my update, here's the result:
The sun was setting 🌅, casting a warm glow over the park. Birds chipped softly 🐦 as the day slowly fade into night.
I added a function to find the position of all the characters encoded on 4 bytes and I corrected the offsets using the result of the previous function in the correction function.