Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization issue for code #61

Open
brando90 opened this issue Jun 27, 2023 · 7 comments
Open

tokenization issue for code #61

brando90 opened this issue Jun 27, 2023 · 7 comments

Comments

@brando90
Copy link

Does this still a bug for tokenization? I want to use this for code. Thanks!

@gjmulder
Copy link

If you are talking about the fast encoder, it was fixed in the main branch of transformers. AFAIK it hasn't been tagged as a release, yet.

@gjmulder
Copy link

Probably a duplicate of #40?

@young-geng
Copy link
Contributor

Check out our OpenLLaMA v2 model, which has a new tokenizer and is pretrained with a lot of code. The official release of that will happen very soon.

@brando90
Copy link
Author

brando90 commented Jul 7, 2023 via email

@young-geng
Copy link
Contributor

@brando90 The v2 model is a completely different one trained on a new mixture of dataset, so you'll need to load the new weights too.

@brando90
Copy link
Author

brando90 commented Jul 7, 2023 via email

@young-geng
Copy link
Contributor

@brando90 Yeah. I imagine you probably want to use v2 almost always since it is a better model overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants