-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong mapping with non-matching sentences #9
Comments
Hi, Thanks for the interest! In the demo, we only print the aligned word pairs, thus most words in the second English sentence are now showing up because our model does not find any corresponding target words for them (which is a good thing). "I" appears twice because the model thinks it is aligned to two words in the target sentence, and "today." is mapped to "(there)." because there are "." in both of the words (remember that the inputs should be tokenized). I am not sure if I did test a few ways to generate alignment scores, but it is still worth investigating if the scores make sense or are well-calibrated. The demo provides a (poor) visualization for the mappings and I will try to make it nicer :) |
I have a similar question: what if a src token is not aligned to any target token (or tgt token not aligned to any src token)? If so, how should we preprocess the gold alignment, and will such token be printed in the hypothesis? How will the AER be calculated? |
Hi @jinyiyang-jhu, the reference/outputs only contain aligned word pairs. If the i-th source word is not aligned to any target words, there would be no i-* in the reference/outputs. |
I encapsulate alignment calculation into a separate method using simple harmonic mean of aligned tokens rate on both sides. Comments are welcome for the implementation. @mzeidhassan |
Thanks a million @juncaofish ! I will give it a try when I have a chance. |
Hi awesome-align team,
First, thanks for the great tool. It has really great potential.
I am following your Colab demo, and I tried to align English to Arabic.
Here are the 2 sentences:
src = 'I will meet you there. It is a very cool weather today.'
tgt = 'سوف أقابلك هناك.'
The Arabic sentence matches the first English sentence in src, i.e. "I will meet you there".
The second sentence in src "It is a very cool weather today." doesn't exist in Arabic.
When I run the code, I get a very strange result, and I am not sure where the culprit is.
This is what I get:
For some reason, most of the second English sentence is not showing up, plus there are now wrong mappings "(today) is wrongly mapped to (there)' for example.
If I remove the second sentence in src, the result looks really good.
I want to use Awesome-Align to detect non-matching strings in a bilingual dataset, so I can exclude the wrong and non-aligned sentences.
Is there a way to add alignment scores, so it is easy to filter out bad aligned sentences?
Also, is there a way to visualize the mapping? Something similar to SimAlign mapping.
After all, it could be that Awesome-Align is not designed for my purpose, but I hope you consider this idea in a future release.
Thanks in advance for your support, and thanks for the awesome tool :-)
The text was updated successfully, but these errors were encountered: