Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected #94

mdevolde · 2024-08-22T13:55:10Z

While using the library for a project, I noticed a strange behaviour when I was looking for errors and applying the suggestions given by LanguageTool in a text that included characters encoded on 4 bytes.

Take this code for example:

import language_tool_python

def patch_text(text):
    with language_tool_python.LanguageTool('en-US') as tool:
        errors = tool.check(text)
    patched_text = language_tool_python.utils.correct(text, errors)
    return patched_text


if __name__ == '__main__':
    text = """
The sun was seting 🌅, casting a warm glow over the park. Birds chirpped softly 🐦 as the day slowly fade into night.
    """
    print(patch_text(text))

At present, in v2.8, the result is as follows:
The sun was setting 🌅, casting a warm glow over the park. Birds cchippedsoftly 🐦 as the day slowly fade into night.
Why does it produce this result? Because the two emojis in the sentence are encoded on 4 bytes, and it seems that LanguageTool, when calculating the offsets, counts the characters encoded on 4 bytes as 2 characters and not 1.
So the offsets after the first emoji have been shifted by 1. This means that the application of the second correction (chirpped -> chipped) has been shifted by one character to the right.
The first correction (seting -> setting) was correctly made because, being positioned before any character encoded on 4 bytes, there was no offset in the offsets.

With my update, here's the result:
The sun was setting 🌅, casting a warm glow over the park. Birds chipped softly 🐦 as the day slowly fade into night.

I added a function to find the position of all the characters encoded on 4 bytes and I corrected the offsets using the result of the previous function in the correction function.

…haracters encoded on 4 bytes in the text to be corrected

jxmorris12 · 2024-08-22T16:21:30Z

Thank you!

mdevolde · 2024-08-22T18:10:46Z

@jxmorris12 In fact, this issue should be resolved with the corrections applied:

Offset position "longer" than text #83

Correction of incorrect offsets to apply corrections when there are c…

a1fdbc1

…haracters encoded on 4 bytes in the text to be corrected

jxmorris12 approved these changes Aug 22, 2024

View reviewed changes

jxmorris12 merged commit 75fbc2c into jxmorris12:master Aug 22, 2024
3 checks passed

mdevolde deleted the patch-4-bytes-encoded branch August 22, 2024 16:26

jxmorris12 mentioned this pull request Aug 22, 2024

Offset position "longer" than text #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected #94

Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected #94

mdevolde commented Aug 22, 2024

jxmorris12 commented Aug 22, 2024

mdevolde commented Aug 22, 2024

Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected #94

Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected #94

Conversation

mdevolde commented Aug 22, 2024

jxmorris12 commented Aug 22, 2024

mdevolde commented Aug 22, 2024