Update NER model with huggingface transformer #12914

shahryary · 2023-08-15T08:19:55Z

shahryary
Aug 15, 2023

I'm working on training a 'ner' (named entity recognition) model using the Hugging Face microsoft/biogpt transformer. So far, things went smoothly during the initial training phase with my training and development datasets. I got pretty close to the accuracy and F-score I was hoping for.

Now, I'm trying to continue to train the model with another dataset. From what I've learned (from here) , I need to keep the ['transformer'] layer frozen and just focus on updating the 'ner' part. I did this part too [from here] (#11547)

I train the data with frozen transformer and it works fine, but now I want to update NER with another data-set that I have- in other words, I want to update NER regularly as soon I have new data-set to train it, image transformer+train set 1 --> A
later (transformer)A+train set2 , and go on....

Should I update only data-sets and continue to train with config bellow or I have to make a pipeline again for each train set?

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
source = "./ini_tft"



[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 500
frozen_components = ["transformer"]
annotating_components = ["transformer"]
before_to_disk = null
before_update = null


[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "outDB/labels/ner.json"

[initialize.tokenizer]

Answered by rmitsch

Aug 18, 2023

Hi @shahryary, for the training run with your data set A this config is fine. If you just swap out the dataset for the second run, you'll have your model learn from scratch though. Instead, source both the transformer and the ner from the pipeline that was trained on A. From then on you can use the same config as long as you always source from the latest model they trained.

In practice, the whole [components.ner] would become just one line:

source = "./current_model"

(Where ./current_model is the location of the model trained with the previous dataset.)

Note that catastrophic forgetting might become an issue if you train with datasets with diverging distributions over time. There are some…

View full answer

rmitsch · 2023-08-18T11:40:35Z

rmitsch
Aug 18, 2023
Maintainer

Hi @shahryary, for the training run with your data set A this config is fine. If you just swap out the dataset for the second run, you'll have your model learn from scratch though. Instead, source both the transformer and the ner from the pipeline that was trained on A. From then on you can use the same config as long as you always source from the latest model they trained.

In practice, the whole [components.ner] would become just one line:

source = "./current_model"

(Where ./current_model is the location of the model trained with the previous dataset.)

Note that catastrophic forgetting might become an issue if you train with datasets with diverging distributions over time. There are some resources on this related to spaCy: e. g. a blog post, and several forum posts.

Basically if you train on A, then B, then C, ... and if the data sets vary between them, at some point you might find that there model will start performing worse on the original data (e.g. data A). This happens because the weights of the NER model are being gradually overwritten by the later data sets.

To remedy, at least continue tracking the performance of the newly trained models on the older datasets (test set of A, etc). If you notice regressions, consider mixing in data from the older datasets when training on a new one, just to ensure that signal keeps getting reinforced and backpropped.

1 reply

shahryary Aug 21, 2023
Author

@rmitsch thanks a lot for your hints. sounds I could increase accuracy in the second run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update NER model with huggingface transformer #12914

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Update NER model with huggingface transformer #12914

shahryary Aug 15, 2023

Replies: 1 comment · 1 reply

rmitsch Aug 18, 2023 Maintainer

shahryary Aug 21, 2023 Author

shahryary
Aug 15, 2023

Replies: 1 comment 1 reply

rmitsch
Aug 18, 2023
Maintainer

shahryary Aug 21, 2023
Author