normalize_extracts: inserting spaces around embedded dashes is not always appropriate #650

mmartin9684-sil · 2025-02-10T18:26:43Z

It's not always appropriate to normalize a word with embedded punctuation by inserting spaces before + after the embedded punctuation character.

A couple of counter examples from some of the target sentences in recent XRI datasets:

Original sentence: Niri poki pule kua-kua dabe edieng wao pihak ruha ihi partai nbe tenama tule hu'a gu mege wai.
Normalized sentence: Niri poki pule kua - kua dabe edieng wao pihak ruha ihi partai nbe tenama tule hu'a gu mege wai.

Original sentence: Ge a bi? nulu-waleng nu tenama dia wai dabe soro hulu mata nbe
Normalized sentence: Ge a bi? nulu - waleng nu tenama dia wai dabe soro hulu mata nbe

Normalizing the word 'kua-kua' to 'kua - kua', or the word 'nulu-waleng' to 'nulu - waleng' is not correct.

The text was updated successfully, but these errors were encountered:

mmartin9684-sil · 2025-02-10T18:28:54Z

A valuable related improvement would be to remove extra spacing in these hyphenated words. For example:

Original sentence: Ape nbe ta hodi wao -wao hare tan ga ea nera gahu, peke hia ula meha.
Normalized sentence: Ape nbe ta hodi wao - wao hare tan ga ea nera gahu, peke hia ula meha.

Rather than normalizing 'wao -wao' to 'wao - wao', it would be valuable to be able to normalize the word to 'wao-wao'.

mmartin9684-sil added invalid This doesn't seem right pipeline 2: extract Issue related to extracting parallel corpora pipeline 3: preprocess Issue related to preprocessing. labels Feb 10, 2025

mmartin9684-sil assigned rminsil Feb 10, 2025

mmartin9684-sil changed the title ~~normalize_extracts: inserting spaces around embedded punctuation is not always appropriate~~ normalize_extracts: inserting spaces around embedded dashes is not always appropriate Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize_extracts: inserting spaces around embedded dashes is not always appropriate #650

normalize_extracts: inserting spaces around embedded dashes is not always appropriate #650

mmartin9684-sil commented Feb 10, 2025

mmartin9684-sil commented Feb 10, 2025

normalize_extracts: inserting spaces around embedded dashes is not always appropriate #650

normalize_extracts: inserting spaces around embedded dashes is not always appropriate #650

Comments

mmartin9684-sil commented Feb 10, 2025

mmartin9684-sil commented Feb 10, 2025