OOV tokens and TextCategorization #9375
-
Hi All. I am trying to setup a pipeline for text classification (Based on few Medium publications. So, my understanding can be pretty shallow). When I play with the model I've built, I can see following result
So, brand name and model id are out-of-vocabulary words. But, I do believe, for my use case brand can be a very important signal for the classifier. I found this issue on github #2624 . So, according to it, I can add the brands into vovabulary. But does that makes any sense? I can not provide vectors for those. Will Spacy use oov tokens to learn the classifier, or it just ignores those tokens? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
spaCy will be able to recognize that OOV tokens are different words and train the model to be cognizant of that using the attributes, like ORTH (the raw text), LOWER, SHAPE, etc. However it is true that the word vectors for OOV tokens will not be useful, so you should definitely try training your own word vectors. Note that in general this still won't help with tokens that weren't seen at all at training time, like new product codes or something, though you can usually ignore those. It looks like you're classifying product titles, I assume into retail categories (like "men's clothing > tops > sweaters" or something), in which case the only necessary information in your title really would be "trui" (assuming that is Dutch for "sweater"). |
Beta Was this translation helpful? Give feedback.
spaCy will be able to recognize that OOV tokens are different words and train the model to be cognizant of that using the attributes, like ORTH (the raw text), LOWER, SHAPE, etc. However it is true that the word vectors for OOV tokens will not be useful, so you should definitely try training your own word vectors.
Note that in general this still won't help with tokens that weren't seen at all at training time, like new product codes or something, though you can usually ignore those. It looks like you're classifying product titles, I assume into retail categories (like "men's clothing > tops > sweaters" or something), in which case the only necessary information in your title really would …