Spacy entity detection ner with non printable characters #9307
-
How do we handle text with non printable characters so that Spacy ner identifies all occurrences. In the below example o/p:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I'm a little confused about what you're asking here. The character you've included seems to be It's not too surprising these characters confused the models, they wouldn't be in the training data. If you don't like the way spaCy is behaving around these characters, you can remove them, preprocess them to surround them with spaces, or add a tokenizer exception. Also keep in mind the models are not perfect: #3052. |
Beta Was this translation helpful? Give feedback.
I'm a little confused about what you're asking here. The character you've included seems to be
U+2022 BULLET
, which is a normal printable character. A "non printable character" is like an ASCII bell or something.It's not too surprising these characters confused the models, they wouldn't be in the training data. If you don't like the way spaCy is behaving around these characters, you can remove them, preprocess them to surround them with spaces, or add a tokenizer exception.
Also keep in mind the models are not perfect: #3052.