You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Toolbox data can get misaligned (where tokens don't line up vertically in a column) for many reasons. Some corpora do well with the ratio method of realignment that groups tokens with the nearest token in the previous line, while other corpora do well with the reanalyze method that uses morpheme delimiters (-, =, etc.) to find groupings. But some corpora will need some combination of both, and perhaps other criteria, to obtain the best alignment.
A cost function for misalignments would help with creating a method to find the optimum alignment. This cost function might consider, for each token:
the distance to the nearest column boundary
whether it and the possible alignments share morpheme delimiters
edit distance or relative length with possible alignments
There might be a separate cost function for the line as a whole, with criteria like:
percentage of unaligned tokens
deviation from uniform distribution of alignments
These cost functions would be used in an algorithm that tries to spread out the tokens to find the lowest cost.
The text was updated successfully, but these errors were encountered:
Toolbox data can get misaligned (where tokens don't line up vertically in a column) for many reasons. Some corpora do well with the
ratio
method of realignment that groups tokens with the nearest token in the previous line, while other corpora do well with thereanalyze
method that uses morpheme delimiters (-
,=
, etc.) to find groupings. But some corpora will need some combination of both, and perhaps other criteria, to obtain the best alignment.A cost function for misalignments would help with creating a method to find the optimum alignment. This cost function might consider, for each token:
There might be a separate cost function for the line as a whole, with criteria like:
These cost functions would be used in an algorithm that tries to spread out the tokens to find the lowest cost.
The text was updated successfully, but these errors were encountered: