Skip to content

Spacy entity detection ner with non printable characters #9307

Discussion options

You must be logged in to vote

I'm a little confused about what you're asking here. The character you've included seems to be U+2022 BULLET, which is a normal printable character. A "non printable character" is like an ASCII bell or something.

It's not too surprising these characters confused the models, they wouldn't be in the training data. If you don't like the way spaCy is behaving around these characters, you can remove them, preprocess them to surround them with spaces, or add a tokenizer exception.

Also keep in mind the models are not perfect: #3052.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage lang / en English language data and models feat / ner Feature: Named Entity Recognizer
2 participants