When HuggaFace transformers model file has tokenizer_config.json and tokenizer.json，how config the config.cfg file? #8907

baiziyuandyufei · 2021-08-09T05:39:05Z

baiziyuandyufei
Aug 9, 2021

How to reproduce the behaviour

I use the model https://huggingface.co/hfl/chinese-roberta-wwm-ext/tree/main，

the training config like below

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "./chinese-roberta-wwm-ext"

[components.transformer.model.tokenizer_config]
use_fast = true

but the model files contain the file

tokenizer_config.json
tokenizer.json
added_tokens.json

these files seems to not be used by the SpaCy.

Your Environment

Info about spaCy

spaCy version: 3.1.1
Platform: Linux-5.10.25-linuxkit-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
Pipelines: zh_core_web_sm (3.1.0), xx_ent_wiki_sm (3.1.0)

Answered by polm

Aug 9, 2021

When you load a Transformer/HuggingFace model with spaCy, it uses the HuggingFace code, so even if the config file doesn't mention it the model can do other stuff, which might include loading json config files.

I'm a little unclear about what you're trying to do. Are you trying to customize the json config files, or does loading that model not work, or...? If you have tried something and it didn't work it'd be helpful to have any error messages you've gotten.

View full answer

polm · 2021-08-09T05:59:00Z

polm
Aug 9, 2021

When you load a Transformer/HuggingFace model with spaCy, it uses the HuggingFace code, so even if the config file doesn't mention it the model can do other stuff, which might include loading json config files.

I'm a little unclear about what you're trying to do. Are you trying to customize the json config files, or does loading that model not work, or...? If you have tried something and it didn't work it'd be helpful to have any error messages you've gotten.

2 replies

baiziyuandyufei Aug 9, 2021
Author

I loaded chinese-roberta-wwm-ext model work.
My question is:

I think these file
"added_tokens.json",
"special_tokens_map.json",
"tokenizer.json",
"tokenizer_config.json",
in the chinese-roberta-wwm-ext, not be used by SpaCy, but these files are part of the model.

SpaCy's transformer model use the internal tokenizer.

Is there something I can do, customize the json config files, to use the files up.

polm Aug 9, 2021

When spaCy uses Transformers, it actually uses the spaCy tokenizer and the HuggingFace tokenizer. It then creates an alignment between the tokens to share the embeddings properly. So I think those files are being used.

Is there something I can do, customize the json config files, to use the files up.

I don't understand what you mean here. I think if you edit them the settings should just be applied, but if the files change between training and inference the model may not give useful output.

Einsteinder · 2021-09-27T20:09:32Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When HuggaFace transformers model file has tokenizer_config.json and tokenizer.json，how config the config.cfg file? #8907

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

When HuggaFace transformers model file has tokenizer_config.json and tokenizer.json，how config the config.cfg file? #8907

baiziyuandyufei Aug 9, 2021

How to reproduce the behaviour

Your Environment

Info about spaCy

Replies: 2 comments · 4 replies

polm Aug 9, 2021

baiziyuandyufei Aug 9, 2021 Author

polm Aug 9, 2021

Einsteinder Sep 27, 2021

polm Sep 28, 2021

polm Sep 28, 2021

baiziyuandyufei
Aug 9, 2021

Replies: 2 comments 4 replies

polm
Aug 9, 2021

baiziyuandyufei Aug 9, 2021
Author

Einsteinder
Sep 27, 2021