Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Units of alignment in Python vs. Toolbox #7

Open
chiarcos opened this issue Dec 20, 2022 · 4 comments
Open

Units of alignment in Python vs. Toolbox #7

chiarcos opened this issue Dec 20, 2022 · 4 comments

Comments

@chiarcos
Copy link

It seems that the alignment operates over different units than Toolbox visual alignment. This is a systematic problem in large swaths of out data. This may be due to variable-width fonts or combining characters in UTF-8 (which are separate bytes, but not separate characters). Sample data under https://github.com/acoli-repo/toolbox_py/blob/master/example/sliekkas_sample.txt (\id DzP_1503 \ref 06_Pater_noster_6).

@goodmami
Copy link
Owner

Hi, sorry I don't maintain this code anymore but there are a couple of things that might help.

I recall that we had switched from using string positions to using byte positions, which seemed to more closely reproduce what SIL's Toolbox was doing. Unfortunately that change wasn't made in this repository but as a preprocessing step in Xigt. See this comment on a PR that introduced the change for a description of the fix: xigt/xigt#48 (comment); and this relevant commit: xigt/xigt@90c87dc

If such a solution does not work, this toolbox module has some code to try and recover alignments via the errors parameter of the align_fields() function. See https://github.com/goodmami/toolbox/blob/master/tests.md#aligning-interlinear-columns-with-align_fields for a description and examples.

@chiarcos
Copy link
Author

Thank you, the error mode works to a large extent. In case I manage to do the byte encoding, I'll give you a pull request.

@goodmami
Copy link
Owner

Thanks, @chiarcos, I'm glad that the errors solution works for you. I'd be happy to have a PR implementing the bytes representation, but I can't guarantee that I'd review and merge it as I no longer maintain this repo. It might make sense for you to maintain your own fork if you use the code regularly?

@chiarcos
Copy link
Author

chiarcos commented Jan 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants