Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Min-cost realigner #1

Open
goodmami opened this issue Nov 23, 2015 · 0 comments
Open

Min-cost realigner #1

goodmami opened this issue Nov 23, 2015 · 0 comments

Comments

@goodmami
Copy link
Owner

Toolbox data can get misaligned (where tokens don't line up vertically in a column) for many reasons. Some corpora do well with the ratio method of realignment that groups tokens with the nearest token in the previous line, while other corpora do well with the reanalyze method that uses morpheme delimiters (-, =, etc.) to find groupings. But some corpora will need some combination of both, and perhaps other criteria, to obtain the best alignment.

A cost function for misalignments would help with creating a method to find the optimum alignment. This cost function might consider, for each token:

  • the distance to the nearest column boundary
  • whether it and the possible alignments share morpheme delimiters
  • edit distance or relative length with possible alignments

There might be a separate cost function for the line as a whole, with criteria like:

  • percentage of unaligned tokens
  • deviation from uniform distribution of alignments

These cost functions would be used in an algorithm that tries to spread out the tokens to find the lowest cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant