Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于中文词表选择不扩充的问题 #16

Open
bytes-lost opened this issue Jun 9, 2023 · 2 comments
Open

关于中文词表选择不扩充的问题 #16

bytes-lost opened this issue Jun 9, 2023 · 2 comments

Comments

@bytes-lost
Copy link

论文3.1节提到

To improve the decoding efficiency of Chinese sentences, Cui et al. (2023) expand the vocabulary by adding common Chinese characters and re-training these newly added word embeddings along with the model parameters. However, our prior study shows that expanding the vocabulary does not seem to bring further improvement on downstream Chinese NLU tasks. We therefore choose to keep LLaMA’s vocabulary unchanged during the training.

请问一下,关于prior study能提供更详细一点的描述吗?

@AndrewZhe
Copy link
Owner

我们在部分中文nlu上进行了测试。相比较于是否扩词表,可能训练的token数目的影响会显著的更大一些。

@Heepo
Copy link

Heepo commented Jul 13, 2023

我们在部分中文nlu上进行了测试。相比较于是否扩词表,可能训练的token数目的影响会显著的更大一些。

同疑问,想继续请教一下,既然token数目影响更大,也就是说应该在更多的token上训练对吧?那不使用中文词表,训练效率不是很低吗?毕竟同样多的字数的中文,不扩词表的情况下token数至少是三倍以上吧。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants