FastText subword vectors support in Spacy3 #8454
-
As far as I see, Spacy can't handle OOV problem like FastText during NER/TEXTCAT training. Thus, we lose somehow the interest to train our vectors in FastText since it will just be converted into static vectors. I see the discussion in 2019. Apparently you were adding the support for fasttext but it was still partial. Do we have more information on the support ? If it's still not supported, do you have some suggestions in handling the OOV words in spacy ? Thanks a lot ! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The fasttext-ish code doesn't really work like fasttext, and it's not used by the statistical models, so I wouldn't recommend using it. Static fasttext vectors are fine in spacy models as static vectors, so it's not a reason against using fasttext vs. glove, for instance, but no, it's not using the subword information for unknown tokens. One of the ideas I've been working on a bit is to add an option for replacing the |
Beta Was this translation helpful? Give feedback.
The fasttext-ish code doesn't really work like fasttext, and it's not used by the statistical models, so I wouldn't recommend using it.
Static fasttext vectors are fine in spacy models as static vectors, so it's not a reason against using fasttext vs. glove, for instance, but no, it's not using the subword information for unknown tokens.
One of the ideas I've been working on a bit is to add an option for replacing the
HashEmbed
layer with a custom version of fasttext vectors where all tokens have a vector calculated from n-gram substrings, but with a compact embedding table rather than its default single-hash 2M ngram table. The fasttext part is implemented but the thinc/spacy part is not…