Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correction of incorrect offsets to apply corrections when there are characters encoded on 4 bytes in the text to be corrected #94

Merged
merged 1 commit into from
Aug 22, 2024

Conversation

mdevolde
Copy link
Contributor

While using the library for a project, I noticed a strange behaviour when I was looking for errors and applying the suggestions given by LanguageTool in a text that included characters encoded on 4 bytes.

Take this code for example:

import language_tool_python

def patch_text(text):
    with language_tool_python.LanguageTool('en-US') as tool:
        errors = tool.check(text)
    patched_text = language_tool_python.utils.correct(text, errors)
    return patched_text


if __name__ == '__main__':
    text = """
The sun was seting 🌅, casting a warm glow over the park. Birds chirpped softly 🐦 as the day slowly fade into night.
    """
    print(patch_text(text))

At present, in v2.8, the result is as follows:
The sun was setting 🌅, casting a warm glow over the park. Birds cchippedsoftly 🐦 as the day slowly fade into night.
Why does it produce this result? Because the two emojis in the sentence are encoded on 4 bytes, and it seems that LanguageTool, when calculating the offsets, counts the characters encoded on 4 bytes as 2 characters and not 1.
So the offsets after the first emoji have been shifted by 1. This means that the application of the second correction (chirpped -> chipped) has been shifted by one character to the right.
The first correction (seting -> setting) was correctly made because, being positioned before any character encoded on 4 bytes, there was no offset in the offsets.

With my update, here's the result:
The sun was setting 🌅, casting a warm glow over the park. Birds chipped softly 🐦 as the day slowly fade into night.

I added a function to find the position of all the characters encoded on 4 bytes and I corrected the offsets using the result of the previous function in the correction function.

…haracters encoded on 4 bytes in the text to be corrected
@jxmorris12
Copy link
Owner

Thank you!

@jxmorris12 jxmorris12 merged commit 75fbc2c into jxmorris12:master Aug 22, 2024
3 checks passed
@mdevolde mdevolde deleted the patch-4-bytes-encoded branch August 22, 2024 16:26
@mdevolde
Copy link
Contributor Author

@jxmorris12 In fact, this issue should be resolved with the corrections applied:

Offset position "longer" than text #83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants