Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

language support #7

Open
sazzadhrz opened this issue Jun 25, 2024 · 5 comments
Open

language support #7

sazzadhrz opened this issue Jun 25, 2024 · 5 comments

Comments

@sazzadhrz
Copy link

Hi,

Thanks for the amazing work. Could you please tell me how I can extract instructions and responses in a specific language?

@zhangchen-xu
Copy link
Contributor

Hi,

Thanks for the question. Technically there are two ways. One way is to extract as much data as possible, and apply a language filter, like what we did in this Chinese dataset.

Another possible solution is to apply a logit processor at the beginning of the generator, which punishes all tokens that do not belong to a specific language. However, it is quite hard for some low-resource languages as they might share similar tokens. We plan to add a logit processor for low-resource language generation but it may take some time to figure out the most elegant solution. Once we have the solution, I will ping this issue and let you know!

Here are some cases of using Magpie to extract Japanese and Korean dataset, which might be of your interest.

Hope this response helps! 😀

@zhangchen-xu
Copy link
Contributor

Hi,

It seems that there is another simple way to generate a specific language! You can refer to this gist for details. Basically, they just append 'spanish:' at the end of the pre-query template.

@anakin87
Copy link

@fly-dust thx for the great work!

Do you think that adding a system message could help with generating data in a specific language or this would deteriorate the results?

@anakin87
Copy link

For anyone seeking information on this topic, I have written a simple article on it:
🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦‍⬛

@zhangchen-xu
Copy link
Contributor

Hi @anakin87, that's really nice work! Sorry for the delay in responding, as I was working on other projects...

BTW, we also found that whether system prompts can control the language effectively depends on the model's data distribution. For example, we found that for multilingual tasks, the Qwen family seems to perform better. We hypothesize this is because they used more multilingual data for instruction tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants