Training a relation extraction model with span categorization - MemoryError #12974
-
Hi! I'm trying to train the demo I followed the instructions from this discussion Training a relation extraction model with span categorization instead of NER These are the steps that I followed
# Parse the entities
spans = example["spans"]
entities = []
span_end_to_start = {}
for span in spans:
entity = doc.char_span(
span["start"], span["end"], label=span["label"]
)
span_end_to_start[span["token_end"]] = span["token_start"]
entities.append(entity)
span_starts.add(span["token_start"])
if not entities:
msg.warn("Could not parse any entities from the JSON file.")
doc.spans[spans_key] = entities
@spacy.registry.misc("rel_span_instance_generator.v1")
def create_instances(max_length: int, span_key: str) -> Callable[[Doc], List[Tuple[Span, Span]]]:
def get_instances(doc: Doc) -> List[Tuple[Span, Span]]:
instances = []
for ent1 in doc.spans[span_key]:
for ent2 in doc.spans[span_key]:
if ent1 != ent2:
if max_length and abs(ent2.start - ent1.start) <= max_length:
instances.append((ent1, ent2))
return instances
return get_instances
@Language.factory(
"relation_extractor",
requires=["doc.spans", "token.ent_iob", "token.ent_type"],
assigns=["doc._.rel"],
default_score_weights={
"rel_micro_p": None,
"rel_micro_r": None,
"rel_micro_f": None,
},
)
Adding
These modifications produce the following error
Our data has 318 examples with 258 tokens in average. Any suggestions on how to resolve this issue? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I think you're running into this problem because the Instead, try training You can experiment with whether it works better to continue training I suspect that you'll need a good bit more training data than this to see reasonable performance from |
Beta Was this translation helpful? Give feedback.
I think you're running into this problem because the
spancat
component is initially randomly initialized (untrained) and can produce nonsense, like annotating every single n-gram as an entity, which overwhelms the following relation extraction component.Instead, try training
spancat
separately first until its performance is reasonably good, and then usesource
to include thetok2vec
andspancat
in the relation extraction config, similar to this example: https://spacy.io/usage/training#annotating-components. Usingtok2vec
, you'll need to include bothtok2vec
andspancat
in the annotating components in this pipeline.You can experiment with whether it works better to continue training
spancat