How to annotate a corpus to train a SpanFinder #13100
-
Hi, I was wondering how to train a SpanFinder model on its own. I don't yet know all the classes I will assign to spans, but wish to train a model that will suggest spans to annotate in Prodigy. There is no available dataset for my specific task. I've found the existing example config and corpus referenced in this post , but it's unclear to me how to go about annotating a corpus with unlabelled spans using Prodigy. I know how to train a span_cat model, but, as I mentioned, at this stage I'd like to train a span_finder to suggest spans that I will classify later. My solution so far has been to annotate spans with a single label ( Thanks! PS I posted this on the Prodigy forum but was redirected here (see reply). |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 7 replies
-
Hi @ddenz!
Can you elaborate on "But this doesn't separate the two tasks as I would like."? |
Beta Was this translation helpful? Give feedback.
-
Hi yes sure. I meant the tasks of span identification and span classification, which I would like to perform separately. I’d like to annotate a corpus with unlabelled spans which I would then use to train a SpanFinder. Once that model is satisfactory and I have an idea of the different span classes I need, I will go on to annotate labelled spans and train a SpanCategorizer using spans suggested by the SpanFinder model. Maybe my current strategy is fine as ultimately what I want is a corpus to train a SpanCategorizer model. I would still like to know if it’s possible to annotate unlabelled spans in Prodigy and how. |
Beta Was this translation helpful? Give feedback.
-
Annotating spans without assigning any label isn't possible. In your case it's probably easiest to do one of these two things:
|
Beta Was this translation helpful? Give feedback.
-
Thanks very much for your response @rmitsch - that's good to know. Seeing as I had existing annotated data, I have tried your first solution - removing all the labels from the dataset. However, I'm a little confused. Can the SpanFinder be trained directly on spans with labels? i.e. as a trainable component, does it not simply ignore the "label" feature? Also, when generating a spacy dataset from annotations, there appears to be no option to specify span_finder:
How can I produce a spacy binary file without labels (is this really what I need to do?)? Or can I just use my existing data that contains labelled spans, specify the --spancat option to create the dataset and then modify the config file accordingly to train the span_finder? More generally, would it not make sense to allow for the annotation of unlabelled spans in prodigy, seeing as the span_finder is a trainable component? The generation of a default config for span_finder training would be helpful as well. I understand it is still an "experimental" feature, so maybe such things are on the todo list... |
Beta Was this translation helpful? Give feedback.
-
I suppose I found a "workaround" by annotating spans using a single label. Perhaps my question now is "how do I train a span_finder without training a span_cat?" The case being that I don't know what my labels are going to be.
I tried this but got all zero scores for the span_finder model during training...something wrong with the data or config I guess.
Does this not imply an unnecessary training overhead due to training two models (span_finder + span_cat) instead of just one? |
Beta Was this translation helpful? Give feedback.
-
Not sure I understand what you suggest I check but the data split was done from a single dataset using
I am annotating spans that indicate someone has difficulty with a particular task or activity around their home. For example, "has trouble walking up stairs", "finds it difficult to get into the shower", "has a hard time getting to the bathroom". There is a large-ish lexical variety, but there is a fairly limited set of syntactic configurations. I have managed to get the span_finder to train. I used a config generated by The config I used is below:
Training outputs:
I also had to use |
Beta Was this translation helpful? Give feedback.
Be sure that the
score_weights
include the right scores for thespan_finder
.In practice, especially if you have longer texts, I've found that it doesn't work that well to train
span_finder
+spancat
together from scratch. Thespan_finder
needs to be trained for long enough to start giving reasonable suggestions before it makes sense to addspancat
on top. Andspancat
can run out of memory if the untrainedspan_finder
starts trying to suggest every single possible span in the text.What you can do instead is train
span_finder
on its own until its performance is reasonable and then continue trainingspan_finder
+spancat
by sourcing thespan_finder
in the combined config. You don't need to fre…