Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOS tokens not properly added in some circumstances #345

Open
jquesnelle opened this issue Feb 13, 2025 · 1 comment · May be fixed by #346
Open

BOS tokens not properly added in some circumstances #345

jquesnelle opened this issue Feb 13, 2025 · 1 comment · May be fixed by #346

Comments

@jquesnelle
Copy link

When using DocumentTokenizer if an eos_token is specified the tokenizer post processor is replaced with one that appends and EOS. However, this has the effect of NOT placing a BOS token at the beginning of the sequence.

See here: https://github.com/huggingface/datatrove/blob/main/src/datatrove/utils/tokenization.py#L55

This can be reproduced by tokenizing with a tokenizer like Llama 3 and looking at the raw token values

@guipenedo
Copy link
Collaborator

That's a good point, would you be willing to make a PR?

@jquesnelle jquesnelle linked a pull request Feb 13, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants