-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
language support #7
Comments
Hi, Thanks for the question. Technically there are two ways. One way is to extract as much data as possible, and apply a language filter, like what we did in this Chinese dataset. Another possible solution is to apply a logit processor at the beginning of the generator, which punishes all tokens that do not belong to a specific language. However, it is quite hard for some low-resource languages as they might share similar tokens. We plan to add a logit processor for low-resource language generation but it may take some time to figure out the most elegant solution. Once we have the solution, I will ping this issue and let you know! Here are some cases of using Magpie to extract Japanese and Korean dataset, which might be of your interest. Hope this response helps! 😀 |
Hi, It seems that there is another simple way to generate a specific language! You can refer to this gist for details. Basically, they just append 'spanish:' at the end of the pre-query template. |
@fly-dust thx for the great work! Do you think that adding a system message could help with generating data in a specific language or this would deteriorate the results? |
For anyone seeking information on this topic, I have written a simple article on it: |
Hi @anakin87, that's really nice work! Sorry for the delay in responding, as I was working on other projects... BTW, we also found that whether system prompts can control the language effectively depends on the model's data distribution. For example, we found that for multilingual tasks, the Qwen family seems to perform better. We hypothesize this is because they used more multilingual data for instruction tuning. |
Hi,
Thanks for the amazing work. Could you please tell me how I can extract instructions and responses in a specific language?
The text was updated successfully, but these errors were encountered: