-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement max character check #398
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much! Just note that the config is instantiated with the tokenizer.json, and as a result, relies on the exact same casing... which in this case, means it has to be max_input_chars_per_word
See https://huggingface.co/Xenova/bert-base-uncased/raw/main/tokenizer.json:
I also think it will be useful to add a test case for this here, something like:
https://github.com/xenova/transformers.js/blob/4e4148cb5ce7f4a9265f58b4eeb660c64bed0386/tests/tokenizers.test.js#L42-L50.
Co-authored-by: Joshua Lochner <[email protected]>
Co-authored-by: Joshua Lochner <[email protected]>
Co-authored-by: Joshua Lochner <[email protected]>
Thanks so much! 🚀 The same output as the python library is obtained now:
(and the time taken is now in the same range as the python library: 1.67ms for python, 1.56ms for JS... not extensive tests, just ran a few times and took the average) |
Amazing! I shall update my package :) |
A new release (2.8.1) will be out shortly, but in the meantime, you can install from source (GitHub) if this fix is urgent. |
Returns unknown token for strings greater than param maxInputCharsPerWord (defaulting to 100). Fixes #397