Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize_extracts: inserting spaces around embedded dashes is not always appropriate #650

Open
mmartin9684-sil opened this issue Feb 10, 2025 · 1 comment
Assignees
Labels
invalid This doesn't seem right pipeline 2: extract Issue related to extracting parallel corpora pipeline 3: preprocess Issue related to preprocessing.

Comments

@mmartin9684-sil
Copy link
Collaborator

It's not always appropriate to normalize a word with embedded punctuation by inserting spaces before + after the embedded punctuation character.

A couple of counter examples from some of the target sentences in recent XRI datasets:

Original sentence: Niri poki pule kua-kua dabe edieng wao pihak ruha ihi partai nbe tenama tule hu'a gu mege wai.
Normalized sentence: Niri poki pule kua - kua dabe edieng wao pihak ruha ihi partai nbe tenama tule hu'a gu mege wai.

Original sentence: Ge a bi? nulu-waleng nu tenama dia wai dabe soro hulu mata nbe
Normalized sentence: Ge a bi? nulu - waleng nu tenama dia wai dabe soro hulu mata nbe

Normalizing the word 'kua-kua' to 'kua - kua', or the word 'nulu-waleng' to 'nulu - waleng' is not correct.

@mmartin9684-sil mmartin9684-sil added invalid This doesn't seem right pipeline 2: extract Issue related to extracting parallel corpora pipeline 3: preprocess Issue related to preprocessing. labels Feb 10, 2025
@mmartin9684-sil
Copy link
Collaborator Author

A valuable related improvement would be to remove extra spacing in these hyphenated words. For example:

Original sentence: Ape nbe ta hodi wao -wao hare tan ga ea nera gahu, peke hia ula meha.
Normalized sentence: Ape nbe ta hodi wao - wao hare tan ga ea nera gahu, peke hia ula meha.

Rather than normalizing 'wao -wao' to 'wao - wao', it would be valuable to be able to normalize the word to 'wao-wao'.

@mmartin9684-sil mmartin9684-sil changed the title normalize_extracts: inserting spaces around embedded punctuation is not always appropriate normalize_extracts: inserting spaces around embedded dashes is not always appropriate Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right pipeline 2: extract Issue related to extracting parallel corpora pipeline 3: preprocess Issue related to preprocessing.
Projects
None yet
Development

No branches or pull requests

2 participants