-
Notifications
You must be signed in to change notification settings - Fork 144
cc_net/tools/dl_cc_100.py fails to extract complete dataset #25
Comments
Fortunately the dataset is also available at http://data.statmt.org/cc-100/ Nevertheless, the code in the current repo should be fixed and ideally a link to http://data.statmt.org/cc-100/ should be included in the README. Thanks |
@leezu Hi, thank you very much for providing this website. I found that this website's download speed is slow and I also cannot download multiple files simultaneously. How do you solve this problem? Thanks. |
@wangyong1122 in principle, you could try avoid the IP-based throttling of statmt.org by using a multiple machines with different IP addresses at the same time. |
@leezu I see. Thank you very much. |
Does this give the same data as downloading it from this repo (after specifying the desired language(s), deduplication etc.)? By "the same" I mean the same format and formatting. I would compare the two myself, but I cannot use this repo to download on a remote server. |
@gwenzek I also encountered the same problem, do you have any plans to update the code? thanks! |
python3.7 cc_net/tools/dl_cc_100.py --outdir data/cc100 --processes 96
provides only 99GB (277 GB uncompressed) data across 10 languages:The script should provide all 100 languages listed in https://arxiv.org/pdf/1911.02116.pdf Figure 1:
The text was updated successfully, but these errors were encountered: