Creating custom DependencyParser from scratch in Spacy 3 #9239
-
I am trying to implement my own DependencyParser from scratch in Spacy 3. I create an empty model, create an empty DependencyParser, train it and save its configuration. But when I try to load my custom parser config again, I can only do it successfully if the model is empty. If I am using a non-empty model, then I keep getting this error - ValueError: could not broadcast input array from shape (106,64) into shape (27,64). import spacy
import random
from spacy.tokens import Doc
from spacy.training import Example
from spacy.pipeline import DependencyParser
from typing import List, Tuple
PARSER_CONFIG = 'parser.cfg'
TRAINING_DATA = [
('find a high paying job with no experience', {
'heads': [0, 4, 4, 4, 0, 7, 7, 4],
'deps': ['ROOT', '-', 'QUALITY', 'QUALITY', 'ACTIVITY', '-', 'QUALITY', 'ATTRIBUTE']
}),
('find good workout classes near home', {
'heads': [0, 3, 3, 0, 5, 3],
'deps': ['ROOT', 'QUALITY', 'QUALITY', 'ACTIVITY', 'QUALITY', 'ATTRIBUTE']
})
]
def create_training_examples(training_data: List[Tuple]) -> List[Example]:
""" Create list of training examples """
examples = []
nlp = spacy.load('en_core_web_md')
for text, annotations in training_data:
print(f"{text} - {annotations}")
examples.append(Example.from_dict(nlp(text), annotations))
return examples
def save_parser_config(parser: DependencyParser):
print(f"Save parser config to '{PARSER_CONFIG}' ... ", end='')
parser.to_disk(PARSER_CONFIG)
print("DONE")
def load_parser_config(parser: DependencyParser):
print(f"Load parser config from '{PARSER_CONFIG}' ... ", end='')
parser.from_disk(PARSER_CONFIG)
print("DONE")
def main():
nlp = spacy.blank('en')
# Create new parser
parser = nlp.add_pipe('parser', first=True)
for text, annotations in TRAINING_DATA:
for label in annotations['deps']:
if label not in parser.labels:
parser.add_label(label)
print(f"Added labels: {parser.labels}")
examples = create_training_examples(TRAINING_DATA)
# Training
# NOTE: The 'lambda: examples' part is mandatory in Spacy 3 - https://spacy.io/usage/v3#migrating-training-python
optimizer = nlp.initialize(lambda: examples)
print(f"Training ... ", end='')
for i in range(25):
print(f"{i} ", end='')
random.shuffle(examples)
nlp.update(examples, sgd=optimizer)
print(f"... DONE")
save_parser_config(parser)
# I can load parser config to blank model ...
nlp = spacy.blank('en')
parser = nlp.add_pipe('parser')
# ... but I cannot load parser config to already existing model
# Return -> ValueError: could not broadcast input array from shape (106,64) into shape (27,64)
# nlp = spacy.load('en_core_web_md')
# parser = nlp.get_pipe('parser')
load_parser_config(parser)
print(f"Current pipeline is {nlp.meta['pipeline']}")
doc = nlp(u'find a high paid job with no degree')
print(f"Arcs: {[(w.text, w.dep_, w.head.text) for w in doc if w.dep_ != '-']}")
if __name__ == '__main__':
main() The custom parser itself is working as expected. You can test this by commenting out all the code from My Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
What's happening is that you're loading a parser that was trained with some labels and trying to then load a config for distinct labels. This is not something you should do and the math isn't working. To be clear on a bit of terminology, it looks like you are training a dependency parser model, not implementing a dependency parser from scratch - that would involve writing the model code, which is not happening here. I am also confused by your whole approach here. Our recommended way to train components is to use the Quickstart in the docs to generate a config, use the cli based training, and then to load the trained pipeline and source components from it as necessary. Writing a custom training loop is possible but is prone to causing problems. While it is possible to save and load individual components, that's designed to be handled internally, and the expectation is that as a user you will save and load pipelines. Here's a way to make your code work with minimal changes, replacing your code from
|
Beta Was this translation helpful? Give feedback.
What's happening is that you're loading a parser that was trained with some labels and trying to then load a config for distinct labels. This is not something you should do and the math isn't working.
To be clear on a bit of terminology, it looks like you are training a dependency parser model, not implementing a dependency parser from scratch - that would involve writing the model code, which is not happening here.
I am also confused by your whole approach here. Our recommended way to train components is to use the Quickstart in the docs to generate a config, use the cli based training, and then to load the trained pipeline and source components from it as necessary. Writing a custom tra…