关于中文词表选择不扩充的问题 #16

bytes-lost · 2023-06-09T08:16:18Z

论文3.1节提到

To improve the decoding efficiency of Chinese sentences, Cui et al. (2023) expand the vocabulary by adding common Chinese characters and re-training these newly added word embeddings along with the model parameters. However, our prior study shows that expanding the vocabulary does not seem to bring further improvement on downstream Chinese NLU tasks. We therefore choose to keep LLaMA’s vocabulary unchanged during the training.

请问一下，关于prior study能提供更详细一点的描述吗？

The text was updated successfully, but these errors were encountered:

AndrewZhe · 2023-06-23T01:51:14Z

我们在部分中文nlu上进行了测试。相比较于是否扩词表，可能训练的token数目的影响会显著的更大一些。

Heepo · 2023-07-13T13:14:07Z

我们在部分中文nlu上进行了测试。相比较于是否扩词表，可能训练的token数目的影响会显著的更大一些。

同疑问，想继续请教一下，既然token数目影响更大，也就是说应该在更多的token上训练对吧？那不使用中文词表，训练效率不是很低吗？毕竟同样多的字数的中文，不扩词表的情况下token数至少是三倍以上吧。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于中文词表选择不扩充的问题 #16

关于中文词表选择不扩充的问题 #16

bytes-lost commented Jun 9, 2023

AndrewZhe commented Jun 23, 2023

Heepo commented Jul 13, 2023

关于中文词表选择不扩充的问题 #16

关于中文词表选择不扩充的问题 #16

Comments

bytes-lost commented Jun 9, 2023

AndrewZhe commented Jun 23, 2023

Heepo commented Jul 13, 2023