language support #7

sazzadhrz · 2024-06-25T19:57:13Z

Hi,

Thanks for the amazing work. Could you please tell me how I can extract instructions and responses in a specific language?

zhangchen-xu · 2024-06-26T02:13:05Z

Hi,

Thanks for the question. Technically there are two ways. One way is to extract as much data as possible, and apply a language filter, like what we did in this Chinese dataset.

Another possible solution is to apply a logit processor at the beginning of the generator, which punishes all tokens that do not belong to a specific language. However, it is quite hard for some low-resource languages as they might share similar tokens. We plan to add a logit processor for low-resource language generation but it may take some time to figure out the most elegant solution. Once we have the solution, I will ping this issue and let you know!

Here are some cases of using Magpie to extract Japanese and Korean dataset, which might be of your interest.

Hope this response helps! 😀

zhangchen-xu · 2024-07-03T03:46:28Z

Hi,

It seems that there is another simple way to generate a specific language! You can refer to this gist for details. Basically, they just append 'spanish:' at the end of the pre-query template.

anakin87 · 2024-09-12T00:02:34Z

@fly-dust thx for the great work!

Do you think that adding a system message could help with generating data in a specific language or this would deteriorate the results?

anakin87 · 2024-10-21T13:52:13Z

For anyone seeking information on this topic, I have written a simple article on it:
🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦‍⬛

zhangchen-xu · 2024-10-22T05:39:59Z

Hi @anakin87, that's really nice work! Sorry for the delay in responding, as I was working on other projects...

BTW, we also found that whether system prompts can control the language effectively depends on the model's data distribution. For example, we found that for multilingual tasks, the Qwen family seems to perform better. We hypothesize this is because they used more multilingual data for instruction tuning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

language support #7

language support #7

sazzadhrz commented Jun 25, 2024

zhangchen-xu commented Jun 26, 2024

zhangchen-xu commented Jul 3, 2024

anakin87 commented Sep 12, 2024

anakin87 commented Oct 21, 2024

zhangchen-xu commented Oct 22, 2024

language support #7

language support #7

Comments

sazzadhrz commented Jun 25, 2024

zhangchen-xu commented Jun 26, 2024

zhangchen-xu commented Jul 3, 2024

anakin87 commented Sep 12, 2024

anakin87 commented Oct 21, 2024

zhangchen-xu commented Oct 22, 2024