You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear Authors, you have undoubtedly done an excellent job (domain-specific post-pre-training). But I have a small question about the size of the free-law data used in the original paper, I free downloaded from (https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train) - law data This seems to be much smaller than the 35G (16B tokens) described in the paper "Table 7", but only 1.4B tokens are actually processed using llama tokenizer. May I ask whether the author used the data in this link or another link?
The text was updated successfully, but these errors were encountered:
Dear Authors, you have undoubtedly done an excellent job (domain-specific post-pre-training). But I have a small question about the size of the free-law data used in the original paper, I free downloaded from (https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train) - law data This seems to be much smaller than the 35G (16B tokens) described in the paper "Table 7", but only 1.4B tokens are actually processed using llama tokenizer. May I ask whether the author used the data in this link or another link?
The text was updated successfully, but these errors were encountered: