OOV tokens and TextCategorization #9375

frutik · 2021-10-05T14:00:03Z

frutik
Oct 5, 2021

Hi All.

I am trying to setup a pipeline for text classification (Based on few Medium publications. So, my understanding can be pretty shallow).

When I play with the model I've built, I can see following result

>>> doc = nlp('PME-Legend Trui PKW211303')
>>> for token in doc:
...     print(token.text, token.has_vector, token.vector_norm, token.is_oov)
...
PME-Legend False 0.0 True
Trui True 23.873672 False
PKW211303 False 0.0 True

So, brand name and model id are out-of-vocabulary words.

But, I do believe, for my use case brand can be a very important signal for the classifier. I found this issue on github #2624 . So, according to it, I can add the brands into vovabulary. But does that makes any sense? I can not provide vectors for those. Will Spacy use oov tokens to learn the classifier, or it just ignores those tokens?

Answered by polm

Oct 6, 2021

spaCy will be able to recognize that OOV tokens are different words and train the model to be cognizant of that using the attributes, like ORTH (the raw text), LOWER, SHAPE, etc. However it is true that the word vectors for OOV tokens will not be useful, so you should definitely try training your own word vectors.

Note that in general this still won't help with tokens that weren't seen at all at training time, like new product codes or something, though you can usually ignore those. It looks like you're classifying product titles, I assume into retail categories (like "men's clothing > tops > sweaters" or something), in which case the only necessary information in your title really would …

View full answer

polm · 2021-10-06T02:59:53Z

polm
Oct 6, 2021

spaCy will be able to recognize that OOV tokens are different words and train the model to be cognizant of that using the attributes, like ORTH (the raw text), LOWER, SHAPE, etc. However it is true that the word vectors for OOV tokens will not be useful, so you should definitely try training your own word vectors.

Note that in general this still won't help with tokens that weren't seen at all at training time, like new product codes or something, though you can usually ignore those. It looks like you're classifying product titles, I assume into retail categories (like "men's clothing > tops > sweaters" or something), in which case the only necessary information in your title really would be "trui" (assuming that is Dutch for "sweater").

4 replies

frutik Oct 6, 2021
Author

Thant you for the detailed explanation!

in which case the only necessary information in your title really would be "trui"

Sometimes my current classifier (trained on top of the Dutch model) assigns products to the really wrong categories. For example, "Car parts" instead of "Clothing." I assumed that brands could be particular for the top-level categories (I am using only one level in my classes taxonomy). So, Spacy will be able to "understand" that "PME Legend" was never seen in "Car parts" and is widely represented in "Clothing".

frutik Oct 6, 2021
Author

you should definitely try training your own word vectors.

Any suggestions/links on that topic?

The only way to do that I see right now is to train word2vec with gensim and try to use it with spacy. But I don't think this is the proper way to go

polm Oct 6, 2021

The only way to do that I see right now is to train word2vec with gensim and try to use it with spacy. But I don't think this is the proper way to go

That is totally fine. I would also suggest trying fasttext (though I think you can technically train that with gensim). After you've trained vectors using another tool take a look at the static vectors section in the docs.

We are working on our own vector training implementation, but it's not quite ready for release yet.

adrianeboyd Nov 9, 2021

An update: spacy v3.2.0 now includes support for vectors that use subwords for OOV tokens as "floret" vectors: https://spacy.io/usage/v3-2#vectors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOV tokens and TextCategorization #9375

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

OOV tokens and TextCategorization #9375

frutik Oct 5, 2021

Replies: 1 comment · 4 replies

polm Oct 6, 2021

frutik Oct 6, 2021 Author

frutik Oct 6, 2021 Author

polm Oct 6, 2021

adrianeboyd Nov 9, 2021

frutik
Oct 5, 2021

Replies: 1 comment 4 replies

polm
Oct 6, 2021

frutik Oct 6, 2021
Author

frutik Oct 6, 2021
Author