Training a relation extraction model with span categorization - MemoryError #12974

Racana · 2023-09-11T22:23:04Z

Racana
Sep 11, 2023

Hi! I'm trying to train the demo rel_component pipeline jointly with spancat, but can't seem to get it to work. Specifically, it raises a MemoryError during the training step. Any pointers to why this is happening or how to optimize it to reduce the memory consumption is highly appreciated!

I followed the instructions from this discussion Training a relation extraction model with span categorization instead of NER

These are the steps that I followed

Clone the rel_component project
Modify scripts/parse_data.py to add spancat labels

# Parse the entities
spans = example["spans"]
entities = []
span_end_to_start = {}
for span in spans:
    entity = doc.char_span(
        span["start"], span["end"], label=span["label"]
    )
    span_end_to_start[span["token_end"]] = span["token_start"]
    entities.append(entity)
    span_starts.add(span["token_start"])
if not entities:
    msg.warn("Could not parse any entities from the JSON file.")
doc.spans[spans_key] = entities

Modify scripts/rel_model.py as suggested in the original discussion

@spacy.registry.misc("rel_span_instance_generator.v1")
def create_instances(max_length: int, span_key: str) -> Callable[[Doc], List[Tuple[Span, Span]]]:
   def get_instances(doc: Doc) -> List[Tuple[Span, Span]]:
       instances = []
       for ent1 in doc.spans[span_key]:
           for ent2 in doc.spans[span_key]:
               if ent1 != ent2:
                   if max_length and abs(ent2.start - ent1.start) <= max_length:
                       instances.append((ent1, ent2))
       return instances

   return get_instances

Modify scripts/rel_pipe.py to add spans as requisite to relation_extraction

@Language.factory(
    "relation_extractor",
    requires=["doc.spans", "token.ent_iob", "token.ent_type"],
    assigns=["doc._.rel"],
    default_score_weights={
        "rel_micro_p": None,
        "rel_micro_r": None,
        "rel_micro_f": None,
    },
)

Finally modify config files in config/rel_tok2vec.cfg to include spancat configuration and add it to annotating_components

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.width}

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

Adding spancat to annotating_components as I want to predict both pipes in the same model.

[training]
annotating_components = ["spancat"]

These modifications produce the following error

================================= train_cpu =================================
ℹ Re-running 'train_cpu': spaCy minor version changed (3.4.2 in
project.lock, 3.6.0 current)
Running command: 'C:\Users\John\anaconda3\python.exe' -m spacy train configs/rel_tok2vec.cfg --output training --paths.train data/train.spacy --paths.dev data/dev.spacy -c ./scripts/custom_functions.py
ℹ Saving to output directory: training
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-09-11 15:40:28,810] [INFO] Set up nlp object from config
[2023-09-11 15:40:28,820] [INFO] Pipeline: ['tok2vec', 'spancat', 'relation_extractor']
[2023-09-11 15:40:28,824] [INFO] Created vocabulary
[2023-09-11 15:40:28,824] [INFO] Finished initializing nlp object
[2023-09-11 15:40:29,595] [INFO] Initialized pipeline components: ['tok2vec', 'spancat', 'relation_extractor']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'spancat', 'relation_extractor']
ℹ Set annotations on update for: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS SPANCAT  LOSS RELAT...  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  REL_MICRO_P  REL_MICRO_R  REL_MICRO_F  SCORE
---  ------  ------------  ------------  -------------  ----------  ----------  ----------  -----------  -----------  -----------  ------
⚠ Aborting and saving the final best model. Encountered exception:
MemoryError((82919916, 96), dtype('float32'))
Traceback (most recent call last):
  File "C:\Users\John\.conda\envs\venv\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\John\.conda\envs\venv\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\spacy\__main__.py", line 4, in <module>
    setup_cli()
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\spacy\cli\_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\typer\main.py", line 532, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\spacy\cli\train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\spacy\cli\train.py", line 75, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\spacy\training\loop.py", line 122, in train
    raise e
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\spacy\training\loop.py", line 105, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\spacy\training\loop.py", line 203, in train_while_improving
    nlp.update(
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\spacy\language.py", line 1170, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])  # type: ignore
  File "C:\Users\John\OneDrive - Amplity Health\rel_component\scripts\rel_pipe.py", line 132, in update
    predictions, backprop = self.model.begin_update(docs)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\thinc\model.py", line 328, in begin_update
    return self._func(self, X, is_train=True)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\thinc\layers\chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\thinc\model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "C:\Users\John\OneDrive - Amplity Health\rel_component\scripts\rel_model.py", line 78, in instance_forward
    pooled, bp_pooled = pooling(entities, is_train)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\thinc\model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "C:\Users\John\.conda\envs\venv\lib\site-packages\thinc\layers\reduce_mean.py", line 18, in forward
    Y = model.ops.reduce_mean(cast(Floats2d, Xr.data), Xr.lengths)
  File "thinc\backends\numpy_ops.pyx", line 328, in thinc.backends.numpy_ops.NumpyOps.reduce_mean
numpy.core._exceptions.MemoryError: Unable to allocate 29.7 GiB for an array with shape (82919916, 96) and data type float32

Our data has 318 examples with 258 tokens in average.

Any suggestions on how to resolve this issue?

Answered by adrianeboyd

Sep 12, 2023

I think you're running into this problem because the spancat component is initially randomly initialized (untrained) and can produce nonsense, like annotating every single n-gram as an entity, which overwhelms the following relation extraction component.

Instead, try training spancat separately first until its performance is reasonably good, and then use source to include the tok2vec and spancat in the relation extraction config, similar to this example: https://spacy.io/usage/training#annotating-components. Using tok2vec, you'll need to include both tok2vec and spancat in the annotating components in this pipeline.

You can experiment with whether it works better to continue training spancat

View full answer

adrianeboyd · 2023-09-12T06:33:34Z

adrianeboyd
Sep 12, 2023

I think you're running into this problem because the spancat component is initially randomly initialized (untrained) and can produce nonsense, like annotating every single n-gram as an entity, which overwhelms the following relation extraction component.

Instead, try training spancat separately first until its performance is reasonably good, and then use source to include the tok2vec and spancat in the relation extraction config, similar to this example: https://spacy.io/usage/training#annotating-components. Using tok2vec, you'll need to include both tok2vec and spancat in the annotating components in this pipeline.

You can experiment with whether it works better to continue training spancat in combination with rel_component or whether it works better to freeze tok2vec and spancat.

I suspect that you'll need a good bit more training data than this to see reasonable performance from spancat, but this always depends on your data and config. If you train spancat separately, you can train it with a larger NER dataset, and then train the relation extractor on a smaller dataset.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a relation extraction model with span categorization - MemoryError #12974

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Training a relation extraction model with span categorization - MemoryError #12974

Racana Sep 11, 2023

Replies: 1 comment

adrianeboyd Sep 12, 2023

Racana
Sep 11, 2023

adrianeboyd
Sep 12, 2023