Trainer a NER from scratch and reusing the parser, tagger, and other components from en_core_web_md model #9213

oliviercwa · 2021-09-14T15:28:34Z

oliviercwa
Sep 14, 2021

I have been turning around for quite some time on this one and I guess I am probably missing something.

I want to:
1- Train a NER on new entities starting from the en_core_web_md vectors
2- Use the newly trained NER in a pipeline along with the parser, lemmatizer, tagger and other components from the en_core_web_md mode. In Spacy 2.1.9, we could this but I have not found the right way in spacy 3.1.2

Here is what I do:
1- I have a simple config for the training that essentially defines a pipeline with tok2vec and ner. All configurations are taken from the en_core_web_md model

[paths]
train = "./data_docbins/ner-data-md-all-train.spacy"
dev = "./data_docbins/ner-data-md-all-eval.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH","SHAPE"]
rows = [5000,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[initialize]
vectors = "en_core_web_md"

# These settings must be added manually
[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

The NER trains correctly

Now I want to use the trained NER in a more complete pipeline with lemmatizer, etc. I defined a config with all elements but of course the tagger, parser, etc. need to reference the components from en_core_web_md while the NER comes from the newly trained model

I have tried many things:
1- I tried to copy the folders for each component (tagger, parser, etc.) into the new trained model ==> The parser splits sentences in weird ways. The toc2vect has changed so it probably does not help
2- I tried to train with all components, freezing all of them except NER and tok2vect ==> I get errors

I have tried many other things.

What is the proper way to do this ?

Answered by oliviercwa

Sep 15, 2021

I believe I found the solution. They key was to freeze absolutely everything and to set the tok2vec source to en_core_web_md.

[components]

[components.tok2vec]
source = "en_core_web_md"

[components.tagger]
source = "en_core_web_md"

[components.parser]
source = "en_core_web_md"

[components.attribute_ruler]
source = "en_core_web_md"

[components.lemmatizer]
source = "en_core_web_md"

[components.ner]
source = "en_core_web_md"

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = ["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"…

View full answer

oliviercwa · 2021-09-14T20:19:38Z

oliviercwa
Sep 14, 2021
Author

I have made a bit of progress essentially by using the source attribute in the config

pipeline = ["tok2vec", "tagger","parser", "attribute_ruler", "lemmatizer", "ner"]
batch_size = 1000
disabled = ["parser"]

...

[components.tagger]
source = "en_core_web_md"

[components.parser]
source = "en_core_web_md"

[components.attribute_ruler]
source = "en_core_web_md"

[components.lemmatizer]
source = "en_core_web_md"

[components.ner]
source = "en_core_web_md"

When I load the resulting pipeline, the final model is able to detect the entities properly. However if I enable back the parser, a sentence of 12 words is split into 8 sentences. The same sentence remains a single sentence when using directly the en_core_web_md model.

The cfg, model and moves files in the parser folder of the trained model seems identical to the ones in en_core_web_md folder.

0 replies

oliviercwa · 2021-09-15T08:59:09Z

oliviercwa
Sep 15, 2021
Author

I believe I found the solution. They key was to freeze absolutely everything and to set the tok2vec source to en_core_web_md.

[components]

[components.tok2vec]
source = "en_core_web_md"

[components.tagger]
source = "en_core_web_md"

[components.parser]
source = "en_core_web_md"

[components.attribute_ruler]
source = "en_core_web_md"

[components.lemmatizer]
source = "en_core_web_md"

[components.ner]
source = "en_core_web_md"

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = ["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]
annotating_components = []
before_to_disk = null

1 reply

mbrunecky Sep 20, 2021

Just my $05, as I too am struggling with using the lemmatizer in my spancat pipeline (see #9201)
I do not see replace_listeners in your

[components.tagger]
source = "en_core_web_md"
replace_listeners = ["model.tok2vec"]

I was told I need this to add the "LEMMA" attribute, in order to use it in my (private, embedded in spancat) tok2vec training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer a NER from scratch and reusing the parser, tagger, and other components from en_core_web_md model #9213

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Trainer a NER from scratch and reusing the parser, tagger, and other components from en_core_web_md model #9213

oliviercwa Sep 14, 2021

Replies: 2 comments · 1 reply

oliviercwa Sep 14, 2021 Author

oliviercwa Sep 15, 2021 Author

mbrunecky Sep 20, 2021

oliviercwa
Sep 14, 2021

Replies: 2 comments 1 reply

oliviercwa
Sep 14, 2021
Author

oliviercwa
Sep 15, 2021
Author