From ad48471099b80afca31314a3bf8aec5d58b9bbcb Mon Sep 17 00:00:00 2001 From: Pavel Stranak Date: Sun, 27 Aug 2023 22:38:29 +0200 Subject: [PATCH 1/3] Update README.md Corrected and updated 2 URLs, marked third one as unavailable. --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 05e67d4..44812a3 100644 --- a/README.md +++ b/README.md @@ -120,7 +120,7 @@ Benchmarks spanning multiple tasks. ### Monolingual Corpus -- [AIBharat IndicCorp](https://ai4bharat.iitm.ac.in/indic-corp): contains 8.9 billion tokens from 12 Indian languages (including Indian English). +- [AIBharat IndicCorp](https://ai4bharat.iitm.ac.in/indic-corp): contains 8.9 billion tokens from 12 Indian languages (including Indian English). URL not available as of 2023-08-27 - [Wikipedia Dumps](https://dumps.wikimedia.org) - Common Crawl - [OSCAR Corpus](https://traces1.inria.fr/oscar): Released in 2019, large-scaled processed CommonCrawl. @@ -128,7 +128,7 @@ Benchmarks spanning multiple tasks. - [CC-100 Corpus](): Facebook CommonCrawl extracted data. They provide scripts for processing CommonCrawl. StatMT has built a replica of the CC-100 corpus using these scripts. You can find it [HERE](http://data.statmt.org/cc-100). This corpus also has romanized corpora for some Indian languages. - [WMT NEWS Crawl](http://data.statmt.org/news-crawl) - [LDCIL Monolingual Corpus](https://data.ldcil.org) -- [Charles University Hindi Monolingual Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-625F-0) +- [Charles University Hindi Monolingual Corpus](http://hdl.handle.net/11858/00-097C-0000-0023-6260-A) - [Charles University Urdu Monolingual Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5) - [IIT Bombay Hindi Monolingual Corpus](http://www.cfilt.iitb.ac.in/iitb_parallel/iitb_corpus_download/monolingual.hi.tgz) - [EMILLE Corpus (multiple Indian languages)](https://www.lancaster.ac.uk/fass/projects/corpus/emille/) @@ -192,7 +192,7 @@ Benchmarks spanning multiple tasks. - [PMIndia](http://data.statmt.org/pmindia): Parallel corpus for En-Indian languages mined from _Mann ki Baat_ speeches of the PM of India ([paper](https://arxiv.org/abs/2001.09907)). - [OPUS corpus](http://opus.nlpl.eu/) - [WAT 2018 Parallel Corpus](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/index.html): There may significant overlap between WAT and OPUS. -- [Charles University English-Hindi Parallel Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-BD17-1): This is included in the IITB parallel corpus. +- [Charles University English-Hindi Parallel Corpus](http://hdl.handle.net/11858/00-097C-0000-0023-625F-0) - [Charles University English-Tamil Parallel Corpus](http://ufal.mff.cuni.cz/~ramasamy/parallel/html/) - [Charles University English-Odia Parallel Corpus v1.0](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2879) - [Charles University English-Odia Parallel Corpus v2.0](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3211) From f08a6c33d4283220d5125d3fcfedda6c0e8fc727 Mon Sep 17 00:00:00 2001 From: Pavel Stranak Date: Sun, 27 Aug 2023 22:39:14 +0200 Subject: [PATCH 2/3] Update CONTRIBUTORS.md --- CONTRIBUTORS.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index 2adc4bc..0bf8e5f 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -23,3 +23,4 @@ - Kaushal Bhosale - Tahir Javed - Maharaja Brahma +– Pavel Straňák From c6c25293d5d6369d3bca93c5cd4f92da5fdf50d7 Mon Sep 17 00:00:00 2001 From: Pavel Stranak Date: Sun, 27 Aug 2023 22:39:42 +0200 Subject: [PATCH 3/3] Update CONTRIBUTORS.md --- CONTRIBUTORS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index 0bf8e5f..dc0c29e 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -23,4 +23,4 @@ - Kaushal Bhosale - Tahir Javed - Maharaja Brahma -– Pavel Straňák +- Pavel Straňák