feat: llama3 tokenizer #64

omarkilani · 2024-05-21T23:37:03Z

I'm not sure if the project wants to support non-OpenAI models, but just in case someone comes along and needs this, here you are. :)

zurawiki

This code LGTM, just make sure to add the public URL for the llama3 tokenizer here:

tiktoken-rs/scripts/download_assets.sh

Lines 6 to 14 in 784f0e5

    
           export ASSETS=$(cat <<EOF 
        
           https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe 
        
           https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json 
        
           https://openaipublic.blob.core.windows.net/encodings/r50k_base.tiktoken 
        
           https://openaipublic.blob.core.windows.net/encodings/p50k_base.tiktoken 
        
           https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken 
        
           https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken 
        
           EOF 
        
           )

omarkilani · 2024-10-12T18:24:00Z

@zurawiki there are updates for 3.1 and 3.2 that need to be pushed also.

Will extract it from our stuff soon.

Thanks!

noppej · 2024-12-12T20:31:32Z

@zurawiki This will be a useful addition to make llama encodings accessible for Rust. @omarkilani ... thanks in advance for your contribution!

zurawiki · 2024-12-12T22:07:46Z

Hi guys, I'm open to supporting the llama tokenizer and merging this PR but there are still two outstanding items here

We need to update and create tokenizers and each of llama3.0, llama3.1, llama3.2, and (maybe) llama3.3
All assets need to be added by URL to tiktoken-rs/scripts/download_assets.sh

feat: llama3 tokenizer

97a63b9

zurawiki requested changes Oct 12, 2024

View reviewed changes

noppej mentioned this pull request Dec 12, 2024

Support for other LLMs? #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: llama3 tokenizer #64

feat: llama3 tokenizer #64

omarkilani commented May 21, 2024

zurawiki left a comment

omarkilani commented Oct 12, 2024

noppej commented Dec 12, 2024 •

edited

Loading

zurawiki commented Dec 12, 2024

	export ASSETS=$(cat <<EOF
	https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe
	https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json
	https://openaipublic.blob.core.windows.net/encodings/r50k_base.tiktoken
	https://openaipublic.blob.core.windows.net/encodings/p50k_base.tiktoken
	https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
	https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
	EOF
	)

feat: llama3 tokenizer #64

Are you sure you want to change the base?

feat: llama3 tokenizer #64

Conversation

omarkilani commented May 21, 2024

zurawiki left a comment

Choose a reason for hiding this comment

omarkilani commented Oct 12, 2024

noppej commented Dec 12, 2024 • edited Loading

zurawiki commented Dec 12, 2024

noppej commented Dec 12, 2024 •

edited

Loading