Skip to content

Latest commit

 

History

History
44 lines (36 loc) · 3.01 KB

File metadata and controls

44 lines (36 loc) · 3.01 KB
layout title nav_order
default
Datasets
4

Dataset importers

Dataset importers can be used in datasets sections of the training config.

Example:

  train:
    - opus_ada83/v1
    - mtdata_newstest2014_ruen
Data source Prefix Name examples Type Comments
MTData mtdata newstest2017_ruen corpus Supports many datasets. Run mtdata list -l ru-en to see datasets for a specific language pair.
OPUS opus ParaCrawl/v7.1 corpus Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
SacreBLEU sacrebleu wmt20 corpus Official evaluation datasets available in SacreBLEU tool. Recommended to use in datasets:test config section. Look up supported datasets and language pairs in sacrebleu.dataset python module.
Flores flores dev, devtest corpus Evaluation dataset from Facebook that supports 100 languages.
Custom parallel custom-corpus /tmp/test-corpus corpus Custom parallel dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"
Paracrawl paracrawl-mono paracrawl8 mono Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer.
News crawl news-crawl news.2019 mono Some news monolingual datasets from WMT21
Common crawl commoncrawl wmt16 mono Huge web crawl datasets. The links are posted on WMT21
Custom mono custom-mono /tmp/test-mono mono Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"

You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.

Set up a local poetry environment.

make install-utils
python utils/find-corpus.py en ru opus
python utils/find-corpus.py en ru mtdata
python utils/find-corpus.py en ru sacrebleu

Make sure to check licenses of the datasets before using them.

Adding a new importer

Just add a shell script to corpus or mono which is named as <prefix>.sh and accepts the same parameters as the other scripts from the same folder.