BOS tokens not properly added in some circumstances #345

jquesnelle · 2025-02-13T15:12:06Z

When using DocumentTokenizer if an eos_token is specified the tokenizer post processor is replaced with one that appends and EOS. However, this has the effect of NOT placing a BOS token at the beginning of the sequence.

See here: https://github.com/huggingface/datatrove/blob/main/src/datatrove/utils/tokenization.py#L55

This can be reproduced by tokenizing with a tokenizer like Llama 3 and looking at the raw token values

The text was updated successfully, but these errors were encountered:

guipenedo · 2025-02-13T16:05:22Z

That's a good point, would you be willing to make a PR?

jquesnelle linked a pull request Feb 13, 2025 that will close this issue

fix bos token missing #346

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOS tokens not properly added in some circumstances #345

BOS tokens not properly added in some circumstances #345

jquesnelle commented Feb 13, 2025

guipenedo commented Feb 13, 2025

BOS tokens not properly added in some circumstances #345

BOS tokens not properly added in some circumstances #345

Comments

jquesnelle commented Feb 13, 2025

guipenedo commented Feb 13, 2025