Creating custom DependencyParser from scratch in Spacy 3 #9239

cynical-guy · 2021-09-17T23:06:11Z

cynical-guy
Sep 17, 2021

I am trying to implement my own DependencyParser from scratch in Spacy 3. I create an empty model, create an empty DependencyParser, train it and save its configuration. But when I try to load my custom parser config again, I can only do it successfully if the model is empty. If I am using a non-empty model, then I keep getting this error - ValueError: could not broadcast input array from shape (106,64) into shape (27,64).

import spacy
import random
from spacy.tokens import Doc
from spacy.training import Example
from spacy.pipeline import DependencyParser
from typing import List, Tuple

PARSER_CONFIG = 'parser.cfg'
TRAINING_DATA = [
    ('find a high paying job with no experience', {
        'heads': [0, 4, 4, 4, 0, 7, 7, 4],
        'deps': ['ROOT', '-', 'QUALITY', 'QUALITY', 'ACTIVITY', '-', 'QUALITY', 'ATTRIBUTE']
    }),
    ('find good workout classes near home', {
        'heads': [0, 3, 3, 0, 5, 3],
        'deps': ['ROOT', 'QUALITY', 'QUALITY', 'ACTIVITY', 'QUALITY', 'ATTRIBUTE']
    })
]


def create_training_examples(training_data: List[Tuple]) -> List[Example]:
    """ Create list of training examples """
    examples = []
    nlp = spacy.load('en_core_web_md')
    for text, annotations in training_data:
        print(f"{text} - {annotations}")
        examples.append(Example.from_dict(nlp(text), annotations))
    return examples


def save_parser_config(parser: DependencyParser):
    print(f"Save parser config to '{PARSER_CONFIG}' ... ", end='')
    parser.to_disk(PARSER_CONFIG)
    print("DONE")


def load_parser_config(parser: DependencyParser):
    print(f"Load parser config from '{PARSER_CONFIG}' ... ", end='')
    parser.from_disk(PARSER_CONFIG)
    print("DONE")


def main():
    nlp = spacy.blank('en')
    # Create new parser
    parser = nlp.add_pipe('parser', first=True)
    for text, annotations in TRAINING_DATA:
        for label in annotations['deps']:
            if label not in parser.labels:
                parser.add_label(label)
    print(f"Added labels: {parser.labels}")

    examples = create_training_examples(TRAINING_DATA)

    # Training
    # NOTE: The 'lambda: examples' part is mandatory in Spacy 3 - https://spacy.io/usage/v3#migrating-training-python
    optimizer = nlp.initialize(lambda: examples)
    print(f"Training ... ", end='')
    for i in range(25):
        print(f"{i} ", end='')
        random.shuffle(examples)
        nlp.update(examples, sgd=optimizer)
    print(f"... DONE")

    save_parser_config(parser)

    # I can load parser config to blank model ...
    nlp = spacy.blank('en')
    parser = nlp.add_pipe('parser')

    # ... but I cannot load parser config to already existing model
    # Return -> ValueError: could not broadcast input array from shape (106,64) into shape (27,64)
    # nlp = spacy.load('en_core_web_md')
    # parser = nlp.get_pipe('parser')

    load_parser_config(parser)

    print(f"Current pipeline is {nlp.meta['pipeline']}")

    doc = nlp(u'find a high paid job with no degree')
    print(f"Arcs: {[(w.text, w.dep_, w.head.text) for w in doc if w.dep_ != '-']}")


if __name__ == '__main__':
    main()

The custom parser itself is working as expected. You can test this by commenting out all the code from save_parser_config(parser) to load_parser_config(parser) (inclusive), and run the code again. You will see new labels are assigned as needed. This is why I think the root of the problem is the inability to load the parser configuration of an empty model into a non-empty model. But how to get around this?

My Environment

CentOs 8
Python 3.6.8
Spacy 3.1.2

Answered by polm

Sep 18, 2021

What's happening is that you're loading a parser that was trained with some labels and trying to then load a config for distinct labels. This is not something you should do and the math isn't working.

To be clear on a bit of terminology, it looks like you are training a dependency parser model, not implementing a dependency parser from scratch - that would involve writing the model code, which is not happening here.

I am also confused by your whole approach here. Our recommended way to train components is to use the Quickstart in the docs to generate a config, use the cli based training, and then to load the trained pipeline and source components from it as necessary. Writing a custom tra…

View full answer

polm · 2021-09-18T06:32:24Z

polm
Sep 18, 2021

What's happening is that you're loading a parser that was trained with some labels and trying to then load a config for distinct labels. This is not something you should do and the math isn't working.

To be clear on a bit of terminology, it looks like you are training a dependency parser model, not implementing a dependency parser from scratch - that would involve writing the model code, which is not happening here.

I am also confused by your whole approach here. Our recommended way to train components is to use the Quickstart in the docs to generate a config, use the cli based training, and then to load the trained pipeline and source components from it as necessary. Writing a custom training loop is possible but is prone to causing problems. While it is possible to save and load individual components, that's designed to be handled internally, and the expectation is that as a user you will save and load pipelines.

Here's a way to make your code work with minimal changes, replacing your code from save_parser_config to load_parser_config.

# NLP here is the thing you did training with
nlp.to_disk("custom-parser")
nlp = spacy.load('en_core_web_md', exclude=["parser"])
parser_nlp = spacy.load("custom-parser")
nlp.add_pipe("parser", source=parser_nlp)

3 replies

cynical-guy Sep 18, 2021
Author

I took this example from the book "Natural Language Processing with Python and spaCy A Practical Introduction". In this book, author uses Spacy 2.x for book's examples. For the sake of experiment, I tried to execute mentioned example for Spacy 2.3.7 and it works as expected (there are a little changes in the code, because of absent the Example class in Spacy 2):

import spacy
import random
from spacy.pipeline import DependencyParser

PARSER_CONFIG = 'parser.cfg'
TRAINING_DATA = [
    ('find a high paying job with no experience', {
        'heads': [0, 4, 4, 4, 0, 7, 7, 4],
        'deps': ['ROOT', '-', 'QUALITY', 'QUALITY', 'ACTIVITY', '-', 'QUALITY', 'ATTRIBUTE']
    }),
    ('find good workout classes near home', {
        'heads': [0, 3, 3, 0, 5, 3],
        'deps': ['ROOT', 'QUALITY', 'QUALITY', 'ACTIVITY', 'QUALITY', 'ATTRIBUTE']
    })
]


def main():
    blank_nlp = spacy.blank('en')
    parser = blank_nlp.create_pipe('parser')
    blank_nlp.add_pipe(parser, first=True)
    for text, annotations in TRAINING_DATA:
        for d in annotations.get('deps', []):
            parser.add_label(d)

    optimizer = blank_nlp.begin_training()
    for i in range(25):
        random.shuffle(TRAINING_DATA)
        for text, annotations in TRAINING_DATA:
            blank_nlp.update([text], [annotations], sgd=optimizer)

    parser.to_disk(PARSER_CONFIG)

    nlp = spacy.load('en_core_web_md', disable=['parser'])
    parser = DependencyParser(nlp.vocab)
    parser.from_disk(PARSER_CONFIG)
    nlp.add_pipe(parser, "custom_parser")
    print(nlp.meta['pipeline'])

    doc = nlp(u'find a high paid job with no degree')
    print([(w.text, w.dep_, w.head.text) for w in doc if w.dep_ != '-'])


if __name__ == '__main__':
    main()

My question is: Why could I save parser independently for Spacy 2.x but for Spacy 3 I have to save the whole pipeline? What exactly changed?

polm Sep 19, 2021

OK, thank you for clarifying why you're doing things this way, that makes a lot of sense.

In v3 you still can save and load the parser independently - that happens behind the scenes when you save and load the pipeline, for example - but that's not the way we expect end-users to do things, and if I were working on a project myself that's not what I'd do. Basically when saving and loading individual components there's some things you have to look out for like getting the Vocab and other shared state right, and working at the pipeline level allows for internal code to take care of all that for you.

The change between v2 and v3 here is not that individually saving things got harder in v3 in any way, it's that mixing and matching components from saved pipelines got a lot easier. This is partly due to config related to tooling and partly due to making the components more consistent in a way that enabled the tooling.

cynical-guy Sep 19, 2021
Author

OK. Thank you for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating custom DependencyParser from scratch in Spacy 3 #9239

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Creating custom DependencyParser from scratch in Spacy 3 #9239

cynical-guy Sep 17, 2021

Replies: 1 comment · 3 replies

polm Sep 18, 2021

cynical-guy Sep 18, 2021 Author

polm Sep 19, 2021

cynical-guy Sep 19, 2021 Author

cynical-guy
Sep 17, 2021

Replies: 1 comment 3 replies

polm
Sep 18, 2021

cynical-guy Sep 18, 2021
Author

cynical-guy Sep 19, 2021
Author