cc_net/tools/dl_cc_100.py fails to extract complete dataset #25

leezu · 2021-04-01T16:02:51Z

python3.7 cc_net/tools/dl_cc_100.py --outdir data/cc100 --processes 96 provides only 99GB (277 GB uncompressed) data across 10 languages:

780M    /mnt/data/cc100/bn_IN
2.0G    /mnt/data/cc100/hi_IN
25G     /mnt/data/cc100/id_ID
12G     /mnt/data/cc100/ko_KR
89M     /mnt/data/cc100/my_MM
25G     /mnt/data/cc100/sv_SE
270M    /mnt/data/cc100/sw_KE
6.7G    /mnt/data/cc100/th_TH
475M    /mnt/data/cc100/tl_XX
21G     /mnt/data/cc100/vi_VN

The script should provide all 100 languages listed in https://arxiv.org/pdf/1911.02116.pdf Figure 1:

The text was updated successfully, but these errors were encountered:

leezu · 2021-04-05T23:50:35Z

Fortunately the dataset is also available at http://data.statmt.org/cc-100/ Nevertheless, the code in the current repo should be fixed and ideally a link to http://data.statmt.org/cc-100/ should be included in the README. Thanks

wangyong1122 · 2021-04-08T14:21:35Z

@leezu Hi, thank you very much for providing this website. I found that this website's download speed is slow and I also cannot download multiple files simultaneously. How do you solve this problem? Thanks.

leezu · 2021-04-08T17:31:48Z

@wangyong1122 in principle, you could try avoid the IP-based throttling of statmt.org by using a multiple machines with different IP addresses at the same time.

wangyong1122 · 2021-04-13T06:57:54Z

@leezu I see. Thank you very much.

izaskr · 2021-11-08T13:58:24Z

@wangyong1122 in principle, you could try avoid the IP-based throttling of statmt.org by using a multiple machines with different IP addresses at the same time.

Does this give the same data as downloading it from this repo (after specifying the desired language(s), deduplication etc.)? By "the same" I mean the same format and formatting. I would compare the two myself, but I cannot use this repo to download on a remote server.

zhangfanTJU · 2021-11-26T09:20:28Z

@gwenzek I also encountered the same problem, do you have any plans to update the code? thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cc_net/tools/dl_cc_100.py fails to extract complete dataset #25

cc_net/tools/dl_cc_100.py fails to extract complete dataset #25

leezu commented Apr 1, 2021

leezu commented Apr 5, 2021

wangyong1122 commented Apr 8, 2021

leezu commented Apr 8, 2021

wangyong1122 commented Apr 13, 2021

izaskr commented Nov 8, 2021

zhangfanTJU commented Nov 26, 2021

cc_net/tools/dl_cc_100.py fails to extract complete dataset #25

cc_net/tools/dl_cc_100.py fails to extract complete dataset #25

Comments

leezu commented Apr 1, 2021

leezu commented Apr 5, 2021

wangyong1122 commented Apr 8, 2021

leezu commented Apr 8, 2021

wangyong1122 commented Apr 13, 2021

izaskr commented Nov 8, 2021

zhangfanTJU commented Nov 26, 2021