tokenization issue for code #61

brando90 · 2023-06-27T19:55:31Z

Does this still a bug for tokenization? I want to use this for code. Thanks!

gjmulder · 2023-06-28T07:47:03Z

If you are talking about the fast encoder, it was fixed in the main branch of transformers. AFAIK it hasn't been tagged as a release, yet.

gjmulder · 2023-06-29T05:52:02Z

Probably a duplicate of #40?

young-geng · 2023-07-07T07:52:23Z

Check out our OpenLLaMA v2 model, which has a new tokenizer and is pretrained with a lot of code. The official release of that will happen very soon.

brando90 · 2023-07-07T18:07:52Z

can we use the old models or how does this work? We just load the old model with the new tokenizer? ----- Brando Miranda Ph.D. Student Computer Science, Stanford University EDGE Scholar, Stanford University ***@***.*** website: https://brando90.github.io/brandomiranda/home.html mentorship opportunities: https://brando90.github.io/brandomiranda/prospective-collaborations.html On Jul 7, 2023, at 12:52 AM, Xinyang (Young) Geng ***@***.***> wrote: Check out our OpenLLaMA v2 model<https://huggingface.co/openlm-research/open_llama_7b_v2>, which has a new tokenizer and is pretrained with a lot of code. The official release of that will happen very soon. — Reply to this email directly, view it on GitHub<#61 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAOE6LRSBDRPTWUHHJTCYHDXO654DANCNFSM6AAAAAAZWCI5DE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

young-geng · 2023-07-07T18:54:57Z

@brando90 The v2 model is a completely different one trained on a new mixture of dataset, so you'll need to load the new weights too.

brando90 · 2023-07-07T21:39:03Z

Got it. Thanks! I will assume v1 open llama is basically unusable for code gen (what I want) and use only v2. Thanks! ----- Brando Miranda Ph.D. Student Computer Science, Stanford University EDGE Scholar, Stanford University ***@***.*** website: https://brando90.github.io/brandomiranda/home.html mentorship opportunities: https://brando90.github.io/brandomiranda/prospective-collaborations.html On Jul 7, 2023, at 11:55 AM, Xinyang (Young) Geng ***@***.***> wrote: @brando90<https://github.com/brando90> The v2 model is a completely different one trained on a new mixture of dataset, so you'll need to load the new weights too. — Reply to this email directly, view it on GitHub<#61 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAOE6LTTBRYD5CYUEVWXAGDXPBLQXANCNFSM6AAAAAAZWCI5DE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

young-geng · 2023-07-07T21:41:03Z

@brando90 Yeah. I imagine you probably want to use v2 almost always since it is a better model overall.

Provide feedback