-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a constituency model for Icelandic #1389
base: dev
Are you sure you want to change the base?
Conversation
Hey, thanks for sending this. Very helpful. Are you familiar with squashing? If so, would you be up for squashing commit #1 with #2 and #3 with #4? If not, I can do it on my end. In your change comments, you talk about how there is a version which is simplified already. Is it necessary to use that for the parser? If yes, did you use a script to do those simplifications? |
…tuency parser Create convert_icepahc.py Update default_packages.py Added a BERT model for Icelandic Update default_packages.py
Hi! I have now squashed the commits! I tried using the original IcePaHC file for the parser but that didn't seem to work. I also removed some things that shouldn't be a part of the parsing like empty nodes (traces, comments etc.) and some of it was by code and some by hand so there is unfortunately not a script for that, I'm sorry. Also one other thing, you added * to the list of functional tags to cut off since * is used as a separator in the treebank file that is used for the training. I would actually prefer if it is not cut off! I removed the ending of most constituents, like ADJPOC, but I would like to keep some of them that I think would be useful, for example IPMAT. I trained a model where * was cut off and then also when it is not cut off and the difference in scoring is only like 0,25% (from 90,63% to 90,38%). If it would then be possible to change the * to - in the output from the parser, that would be greatly appreciated (or change * to - beforehand and not cutting - off for the Icelandic parser?). |
The tree reading code automatically removes empty nodes. Perhaps it would be possible to recreate the conversion by hand. I'm not too opposed to using editing files, but it does look like they update this corpus fairly often: https://github.com/antonkarl/icecorpus which suggests that being able to automatically convert their work to the internal Stanza format would be useful.
I see how that could be useful, in that English also has some tags such as |
Okay! So would I have to do (or rather update) the conversion script in order to make the Icelandic constituency model a part of Stanza or could that come as an update later on? |
In my ideal world, it would show up all at once :) If that doesn't work with you, we can also start from your converted version of the treebank. If you can send us the partially finished script, I can also help as needed, since you're already donating a bunch of work to us. |
Any room for me to help with the conversion script? |
Sorry for the late answer, I haven't been able to work on this for the past weeks. In the conversion script (since the tree reading code automatically removes empty nodes) it is mostly a matter of removing lemmas (so (NS-D löndum-land) becomes (NS-D löndum)) and combining word parts that start/end with $ (in most cases a noun and its determiner), for example should (NS-N tungur$-tunga) (D-N $nar-hinn) become (NS-N tungurnar) and (RP upp$-upp) (VBPI $heldur-halda) should become (VBPI uppheldur). I don't know exactly when I'll be able to continue working on this (hopefully next week) but if you have the time you could for sure help move it along. |
It's a little surprising to me that we would want to combine nodes that
have determiners with their nouns. Then again, I do see in the UD version
that there's a token uppheldur and no situation where the two tokens are
split in pieces.
Either way, these two operations should be pretty straightforward to do
with the Tsurgeon interface
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/tsurgeon/Tsurgeon.html
... yuck, it seems I haven't written up how to use tsurgeon anywhere. I'll
do that today and write up a brief example of the lemma removal. The node
combination might be a little trickier, but if you tell me which tag to
choose when combining two tags, that should be possible as well
…On Fri, May 17, 2024 at 1:33 AM Ingunn Jóhanna Kristjánsdóttir < ***@***.***> wrote:
Sorry for the late answer, I haven't been able to work on this for the
past weeks. In the conversion script (since the tree reading code
automatically removes empty nodes) it is mostly a matter of removing lemmas
(so *(NS-D löndum-land)* becomes *(NS-D löndum)*) and combining word
parts that start/end with $ (in most cases a noun and its determiner), for
example should *(NS-N tungur$-tunga) (D-N $nar-hinn)* become *(NS-N
tungurnar)* and *(RP upp$-upp) (VBPI $heldur-halda)* should become *(VBPI
uppheldur)*. I don't know exactly when I'll be able to continue working
on this (hopefully next week) but if you have the time you could for sure
help move it along.
—
Reply to this email directly, view it on GitHub
<#1389 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWJZBIAOOG6ZG3HNKITZCW6HLAVCNFSM6AAAAABGXGSW26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJXGAZTAOBYGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I checked in a script which uses Tsurgeon to convert a single tree: If you have a list of cases where I put a brief introduction using this example on the github.io https://stanfordnlp.github.io/stanza/tsurgeon.html It doesn't really say anything other than the example I just checked in, though. PS sorry for reusing your script's name and therefore making it slightly more annoying to do this PR |
Any thoughts on the tregex / tsurgeon solution for this issue? |
@ingunnjk any thoughts on how best to proceed with the conversion of the Icelandic dataset? If you have a set of heuristics on how to combine the |
Can I follow up on this? Happy to add the model, but need some insight on adapting the conversion. Looking to make a new release in the next couple weeks |
e469357
to
a17029c
Compare
Hi, very sorry for the late answer! I wont be able to work on this in November but I will have time in the beginning of December. I will do my very best to finish this then, it would be great to be able to add the model.
From: John Bauer ***@***.***>
Date: Thursday, 24 October 2024 at 07:39
To: stanfordnlp/stanza ***@***.***>
Cc: Ingunn Jóhanna Kristjánsdóttir - HI ***@***.***>, Mention ***@***.***>
Subject: Re: [stanfordnlp/stanza] Add a constituency model for Icelandic (PR #1389)
Can I follow up on this? Happy to add the model, but need some insight on adapting the conversion. Looking to make a new release in the next couple weeks
—
Reply to this email directly, view it on GitHub<#1389 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ATGDTHIJ4FCX2ZEL5VDB4CLZ5CPZ5AVCNFSM6AAAAABGXGSW26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZUGUZDKNZZGE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
No worries. Now-ish or in the next couple weeks would be a perfect time, as I'm rebuilding the models for a new version of Stanza using a new version of UD. Later works too. If you can summarize some of the tree edits needed to get from the original treebank to the version used for training, I can help redo those in tregex & tsurgeon and also rebuild the models if needed. Thanks! |
BEFORE YOU START: please make sure your pull request is against the
dev
branch.We cannot accept pull requests against the
main
branch.See our contributing guide for details.
Description
I want to add an Icelandic constituency model for Icelandic. I have made changes to prepare_con_dataset.py and added the script convert_icepahc.py in stanza/utils/datasets/constituency. I have also added an Icelandic BERT model to stanza/resources/default_packages.py.
Fixes Issues
A list of issues/bugs with # references. (e.g., #123)
Unit test coverage
Are there unit tests in place to make sure your code is functioning correctly?
(see here for a simple example)
Known breaking changes/behaviors
None.