Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default combined CorefUD model has inconsistent outputs for English #1450

Open
amir-zeldes opened this issue Jan 24, 2025 · 9 comments
Open

Comments

@amir-zeldes
Copy link

I've been testing the CorefUD-trained Stanza model on English and seeing some inconsistent results, especially with regard to singletons. Since the model is trained on data that has singletons (but possibly also data that has no singletons? Is ParCorFull included for English? Or is the default a totally multilingual model?), it should produce predictions for most obvious noun phrases and for the most part it does:

Image

However other times it ignores very obvious mentions, perhaps because figuring out an antecedent is non-trivial:

Image

Notice that the model misses even completely reliable mention spans such as the pronouns "I" or "their", which are virtually guaranteed to be a mention (even if we can find no antecedent, at least in a corpus with singletons they would still be annotated).

What I'm actually looking to get is English GUM-like results, and I'm wondering whether this is the result of multi-dataset training/conflicting guidelines (esp. regarding singletons). Is there any chance to get a GUM-only trained model for English?

@Jemoka
Copy link
Member

Jemoka commented Jan 24, 2025

I think this is the one that's trained with the mixed multilingual backbone, and possibly with a mixture with/without singletons; we can ship a GUM only model, or even perhaps OntoNotes + Singleton. @amir-zeldes — do you have the OntoNotes augmented dataset somewhere? Would love to train a test model off of that.

@amir-zeldes
Copy link
Author

@Jemoka that would be amazing! I think we'd actually want all of those different models if possible, since I think ON w/singletons+GUM would be great for mention detection, but they have rather different coref guidelines, so that could create a hodgepodge of inconsistent clustering predictions. It's an empirical question, but I could imagine if you were scoring GUM-style coref incl. singletons, throwing in all predictions from both models might actually outperform either model by itself and prevent the low recall issues with ON-only models. Then again it might need some rule based postprocessing...

@yilunzhu has put the ON singleton predictions up on GitHub, I think this is the latest (Yilun please correct me if there's something newer)

For training with GUM it might also be worth waiting a little - we're close to ready to release GUM v11, with new data, probably in about 2 weeks. I can post to this thread when that happens if that's of interest.

@yilunzhu
Copy link

Yes this is the latest version.

@Jemoka
Copy link
Member

Jemoka commented Jan 24, 2025

For training with GUM it might also be worth waiting a little - we're close to ready to release GUM v11, with new data, probably in about 2 weeks. I can post to this thread when that happens if that's of interest.

Sounds good; will hold off on that. In the meantime I will train an English Ontonotes + Singletons model and reprot back on this thread.

@Jemoka
Copy link
Member

Jemoka commented Jan 26, 2025

Update:
Great news! We have 0.812 head-match LEA dev on this dataset using our approach + Roberta backbone.
Bad news! Manually running the model by hands reveals no corefs, something is wrong with our client inference procedure. Stay tuned.

Update 2:
Looks like the model got biased to the length of OntoNotes documents; I arbiturarily updated my test to contain much longer inputs and its doing better now. I'll run with an augmentation tomorrow/Monday were we repeat the training data a few times across varying lengths.

@Jemoka
Copy link
Member

Jemoka commented Jan 29, 2025

Done! CC @amir-zeldes
For English dev set—

span-match LEA: 72.19
head-word match CoNLL 2012: 82.90

Here's the weights: https://drive.google.com/drive/folders/14EwOVRSrdbp9cjARgTu-DNaCjryNePJW?usp=sharing

To use them:

nlp = stanza.Pipeline('en', processors='tokenize,coref', coref_model_path="./the_path_to/roberta_lora.pt")

@AngledLuffa
Copy link
Collaborator

Thank you, @Jemoka !

To make it more convenient to get the model, I uploaded it to HuggingFace. You should be able to download it using Stanza version 1.10 with:

pipe = stanza.Pipeline("en", processors="tokenize,coref", package={"coref": "ontonotes-singletons_roberta-large-lora"})

@amir-zeldes
Copy link
Author

OK, coref still has some issues for the sample text I was using above, but this is much much better for mention detection:

Image

The only sort of systematic concerns I have about it are direct results of ON guidelines, for example the treatment of appositions (so we get [CEO of the Refugee Council [Enver Solomon]] as a nested mention) and no compound modifiers (e.g no [asylum] in [asylum decisions]). Coordination is also a bit odd with [Albania and [France]], I'd expect the big coordination box to oscillate (not annotated unless referred back to), but Albania by itself being missing is a bit odd.

But either way, this is worlds better, thanks so much for making this model available!

We're getting close to wrapping up the latest GUM release, I'll post a link to the data as soon as it's ready.

@Jemoka
Copy link
Member

Jemoka commented Feb 1, 2025

sounds good; once the next GUM is released I'll be glad to build a model for that. There's a chance that upping top-k in the initial filtering step will be better for things like coordination with a lot of nesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@amir-zeldes @AngledLuffa @yilunzhu @Jemoka and others