Implement max character check #398

samlhuillier · 2023-11-16T22:17:40Z

Returns unknown token for strings greater than param maxInputCharsPerWord (defaulting to 100). Fixes #397

xenova

Thanks so much! Just note that the config is instantiated with the tokenizer.json, and as a result, relies on the exact same casing... which in this case, means it has to be max_input_chars_per_word

See https://huggingface.co/Xenova/bert-base-uncased/raw/main/tokenizer.json:

I also think it will be useful to add a test case for this here, something like:
https://github.com/xenova/transformers.js/blob/4e4148cb5ce7f4a9265f58b4eeb660c64bed0386/tests/tokenizers.test.js#L42-L50.

src/tokenizers.js

Co-authored-by: Joshua Lochner <[email protected]>

xenova · 2023-11-17T19:55:03Z

Thanks so much! 🚀 The same output as the python library is obtained now:

It took  1.5605939999222755 ms to tokenize
Encoded:  [101,  100, 1009, 100, 1027, 1027, 102]
Decoded:  [CLS] [UNK] + [UNK] = = [SEP]

(and the time taken is now in the same range as the python library: 1.67ms for python, 1.56ms for JS... not extensive tests, just ran a few times and took the average)

samlhuillier · 2023-11-18T02:04:58Z

Amazing! I shall update my package :)

xenova · 2023-11-18T02:32:58Z

A new release (2.8.1) will be out shortly, but in the meantime, you can install from source (GitHub) if this fix is urgent.

Implement max character check per token

a1d3526

samlhuillier changed the title ~~Implement max character check per token~~ Implement max character check Nov 16, 2023

samlhuillier mentioned this pull request Nov 16, 2023

[Question] Tokenizing a base64 for string is very slow? #397

Closed

xenova requested changes Nov 16, 2023

View reviewed changes

src/tokenizers.js Outdated Show resolved Hide resolved

src/tokenizers.js Outdated Show resolved Hide resolved

src/tokenizers.js Outdated Show resolved Hide resolved

samlhuillier and others added 3 commits November 16, 2023 16:29

Update maxInputCharsPerWord to max_input_chars_per_word

b3d094b

Co-authored-by: Joshua Lochner <[email protected]>

Update maxInputCharsPerWord to max_input_chars_per_word

6d8155d

Co-authored-by: Joshua Lochner <[email protected]>

Update to ??

e639941

Co-authored-by: Joshua Lochner <[email protected]>

xenova merged commit c8bbdd4 into huggingface:main Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement max character check #398

Implement max character check #398

samlhuillier commented Nov 16, 2023

xenova left a comment •

edited

Loading

xenova commented Nov 17, 2023 •

edited

Loading

samlhuillier commented Nov 18, 2023

xenova commented Nov 18, 2023

Implement max character check #398

Implement max character check #398

Conversation

samlhuillier commented Nov 16, 2023

xenova left a comment • edited Loading

Choose a reason for hiding this comment

xenova commented Nov 17, 2023 • edited Loading

samlhuillier commented Nov 18, 2023

xenova commented Nov 18, 2023

xenova left a comment •

edited

Loading

xenova commented Nov 17, 2023 •

edited

Loading