Skip to content

OOV tokens and TextCategorization #9375

Oct 5, 2021 · 1 comments · 4 replies
Discussion options

You must be logged in to vote

spaCy will be able to recognize that OOV tokens are different words and train the model to be cognizant of that using the attributes, like ORTH (the raw text), LOWER, SHAPE, etc. However it is true that the word vectors for OOV tokens will not be useful, so you should definitely try training your own word vectors.

Note that in general this still won't help with tokens that weren't seen at all at training time, like new product codes or something, though you can usually ignore those. It looks like you're classifying product titles, I assume into retail categories (like "men's clothing > tops > sweaters" or something), in which case the only necessary information in your title really would …

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@frutik
Comment options

@frutik
Comment options

@polm
Comment options

@adrianeboyd
Comment options

Answer selected by frutik
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / textcat Feature: Text Classifier
3 participants