normalize_extracts: inserting spaces around embedded dashes is not always appropriate #650
Labels
invalid
This doesn't seem right
pipeline 2: extract
Issue related to extracting parallel corpora
pipeline 3: preprocess
Issue related to preprocessing.
It's not always appropriate to normalize a word with embedded punctuation by inserting spaces before + after the embedded punctuation character.
A couple of counter examples from some of the target sentences in recent XRI datasets:
Original sentence:
Niri poki pule kua-kua dabe edieng wao pihak ruha ihi partai nbe tenama tule hu'a gu mege wai.
Normalized sentence:
Niri poki pule kua - kua dabe edieng wao pihak ruha ihi partai nbe tenama tule hu'a gu mege wai.
Original sentence:
Ge a bi? nulu-waleng nu tenama dia wai dabe soro hulu mata nbe
Normalized sentence:
Ge a bi? nulu - waleng nu tenama dia wai dabe soro hulu mata nbe
Normalizing the word 'kua-kua' to 'kua - kua', or the word 'nulu-waleng' to 'nulu - waleng' is not correct.
The text was updated successfully, but these errors were encountered: