Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resizing tokenizer leads to missing end token and garbage response? #1273

Open
Mark-DelGrande opened this issue Nov 10, 2024 · 1 comment
Open

Comments

@Mark-DelGrande
Copy link

I am using a chat ML template like this to format prompts:

def format_conversation(examples):
    conversations = examples['conversation']
    texts = []
    for convo in conversations:
        conversation_text = ''
        for turn in convo:
            role = turn['role']
            content = turn['content']
            # Format each turn using ChatML
            if role == 'user':
                conversation_text += f"<|im_start|>user\n{content}<|im_end|>\n"
            elif role == 'assistant':
                conversation_text += f"<|im_start|>assistant\n{content}<|im_end|>\n"
        texts.append(conversation_text)
    return {'text': texts}

when I use to resize with this code I got back the response:

special_tokens_dict = {'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}
tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))

Embedding(128258, 4096)

Now I am getting back:

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
Embedding(128258, 4096, padding_idx=128004)

Not sure if this is related but it feels like I t might be an di tired to turn False on and it still gave me back

Embedding(128258, 4096, padding_idx=128004)

After fine-tuning Llama 3.1 with the same code my responses went from something like this:

Description: Active use case
Time left: 12:00

This is what I would like to get out of it and looks like my data I fine tuned on but it has become:

Description: Active use case
Time left: 12:00actionDate
.<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://
�
�
팬
<|end_of_text|><|begin_of_text|>://
�
<|end_of_text|><|begin_of_text|>://
ி
고
")));<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://
안
"]);<|end_of_text|><|begin_of_text|>://
토크
��
<|end_of_text|><|begin_of_text|>://
t
o
")));<|end_of_text|><|begin_of_text|>://
y
i
"]);<|end_of_text|><|begin_of_text|>://
현재
")));<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://

프
")));<|end_of_text|><|begin_of_text|>://
<|end_of_text|><|begin_of_text|>://
")));<|end_of_text|><|begin_of_text|>://
멘
")));"),"...
�
actionDate
 ActiveForm

Anyone have any ideas if something changed, and how I can get my end token to be caught again?

@danielhanchen
Copy link
Contributor

Oh wait please use https://github.com/unslothai/unsloth/wiki#adding-new-tokens ie

model, tokenizer = FastLanguageModel.from_pretrained(...)
from unsloth import add_new_tokens
add_new_tokens(model, tokenizer, new_tokens = ["<CHARACTER_1>", "<THINKING>", "<SCRATCH_PAD>"])
model = FastLanguageModel.get_peft_model(...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants