Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLQA #2622

Merged
merged 7 commits into from
Jan 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@
| medqa | Multiple choice question answering based on the United States Medical License Exams. | |
| [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
| [minerva_math](minerva_math/README.md) | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English |
| [mlqa](mlqa/README.md) | MultiLingual Question Answering benchmark dataset for evaluating cross-lingual question answering performance. | English, Arabic, German, Spanish, Hindi, Vietnamese, Simplified Chinese |
| [mmlu](mmlu/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English |
| [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English |
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
Expand Down
101 changes: 101 additions & 0 deletions lm_eval/tasks/mlqa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# MLQA

### Paper

Title: `MLQA: Evaluating Cross-lingual Extractive Question Answering`

Abstract: `https://arxiv.org/abs/1910.07475`

MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance.
MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic,
German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between
4 different languages on average

Homepage: `https://github.com/facebookresearch/MLQA`


### Citation

```
@misc{lewis2020mlqaevaluatingcrosslingualextractive,
title={MLQA: Evaluating Cross-lingual Extractive Question Answering},
author={Patrick Lewis and Barlas Oğuz and Ruty Rinott and Sebastian Riedel and Holger Schwenk},
year={2020},
eprint={1910.07475},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/1910.07475},
}
```

### Groups, Tags, and Tasks

#### Groups

* Not part of a group yet

#### Tasks

Tasks of the form `mlqa_context-lang_question-lang.yaml`
* `mlqa_ar_ar.yaml`
* `mlqa_ar_de.yaml`
* `mlqa_ar_vi.yaml`
* `mlqa_ar_zh.yaml`
* `mlqa_ar_en.yaml`
* `mlqa_ar_es.yaml`
* `mlqa_ar_hi.yaml`
* `mlqa_de_ar.yaml`
* `mlqa_de_de.yaml`
* `mlqa_de_vi.yaml`
* `mlqa_de_zh.yaml`
* `mlqa_de_en.yaml`
* `mlqa_de_es.yaml`
* `mlqa_de_hi.yaml`
* `mlqa_vi_ar.yaml`
* `mlqa_vi_de.yaml`
* `mlqa_vi_vi.yaml`
* `mlqa_vi_zh.yaml`
* `mlqa_vi_en.yaml`
* `mlqa_vi_es.yaml`
* `mlqa_vi_hi.yaml`
* `mlqa_zh_ar.yaml`
* `mlqa_zh_de.yaml`
* `mlqa_zh_vi.yaml`
* `mlqa_zh_zh.yaml`
* `mlqa_zh_en.yaml`
* `mlqa_zh_es.yaml`
* `mlqa_zh_hi.yaml`
* `mlqa_en_ar.yaml`
* `mlqa_en_de.yaml`
* `mlqa_en_vi.yaml`
* `mlqa_en_zh.yaml`
* `mlqa_en_en.yaml`
* `mlqa_en_es.yaml`
* `mlqa_en_hi.yaml`
* `mlqa_es_ar.yaml`
* `mlqa_es_de.yaml`
* `mlqa_es_vi.yaml`
* `mlqa_es_zh.yaml`
* `mlqa_es_en.yaml`
* `mlqa_es_es.yaml`
* `mlqa_es_hi.yaml`
* `mlqa_hi_ar.yaml`
* `mlqa_hi_de.yaml`
* `mlqa_hi_vi.yaml`
* `mlqa_hi_zh.yaml`
* `mlqa_hi_en.yaml`
* `mlqa_hi_es.yaml`
* `mlqa_hi_hi.yaml`

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
48 changes: 48 additions & 0 deletions lm_eval/tasks/mlqa/generate_tasks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# ruff: noqa: E731, E741
"""
Script to generate task YAMLs for the mlqa dataset.
Based on `tasks/bigbench/generate_tasks.py`.
"""

from datasets import get_dataset_config_names


chosen_subtasks = []

language_dict = {
"en": "english",
"es": "spanish",
"hi": "hindi",
"vi": "vietnamese",
"de": "german",
"ar": "arabic",
"zh": "chinese",
}


def main() -> None:
configs = get_dataset_config_names("facebook/mlqa", trust_remote_code=True)
for config in configs:
if len(config.split(".")) == 2:
continue
else:
chosen_subtasks.append(config)
assert len(chosen_subtasks) == 49
for task in chosen_subtasks:
file_name = f"{task.replace('.', '_')}.yaml"
context_lang = file_name.split("_")[1]
# Not using yaml to avoid tagging issues with !function
with open(file_name, "w", encoding="utf-8") as f:
f.write("# Generated by generate_tasks.py\n")

# Manually writing the YAML-like content inside files to avoid tagging issues
f.write("include: mlqa_common_yaml\n")
f.write(f"task: {task.replace('.', '_')}\n")
f.write(f"dataset_name: {task}\n")
f.write(
f"process_results: !function utils.process_results_{context_lang}\n"
)


if __name__ == "__main__":
main()
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_ar
dataset_name: mlqa.ar.ar
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_de
dataset_name: mlqa.ar.de
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_en
dataset_name: mlqa.ar.en
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_es
dataset_name: mlqa.ar.es
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_hi
dataset_name: mlqa.ar.hi
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_vi
dataset_name: mlqa.ar.vi
process_results: !function utils.process_results_ar
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_ar_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_ar_zh
dataset_name: mlqa.ar.zh
process_results: !function utils.process_results_ar
22 changes: 22 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_common_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
dataset_path: facebook/mlqa
dataset_kwargs:
trust_remote_code: true
test_split: test
validation_split: validation
output_type: generate_until
doc_to_text: "Context: {{context}}\n\nQuestion: {{question}}\n\nAnswer:"
doc_to_target: "{{answers}}"
process_docs: !function utils.process_docs
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
- metric: f1
aggregation: mean
higher_is_better: true
generation_kwargs:
until:
- "\n"
do_sample: false
metadata:
version: 0.0
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_ar
dataset_name: mlqa.de.ar
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_de
dataset_name: mlqa.de.de
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_en
dataset_name: mlqa.de.en
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_es
dataset_name: mlqa.de.es
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_hi
dataset_name: mlqa.de.hi
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_vi
dataset_name: mlqa.de.vi
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_de_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_de_zh
dataset_name: mlqa.de.zh
process_results: !function utils.process_results_de
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_ar
dataset_name: mlqa.en.ar
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_de
dataset_name: mlqa.en.de
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_en
dataset_name: mlqa.en.en
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_es
dataset_name: mlqa.en.es
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_hi
dataset_name: mlqa.en.hi
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_vi
dataset_name: mlqa.en.vi
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_en_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_en_zh
dataset_name: mlqa.en.zh
process_results: !function utils.process_results_en
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_ar
dataset_name: mlqa.es.ar
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_de
dataset_name: mlqa.es.de
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_en
dataset_name: mlqa.es.en
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_es
dataset_name: mlqa.es.es
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_hi
dataset_name: mlqa.es.hi
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_vi
dataset_name: mlqa.es.vi
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_es_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_es_zh
dataset_name: mlqa.es.zh
process_results: !function utils.process_results_es
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_ar
dataset_name: mlqa.hi.ar
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_de.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_de
dataset_name: mlqa.hi.de
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_en
dataset_name: mlqa.hi.en
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_es.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_es
dataset_name: mlqa.hi.es
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_hi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_hi
dataset_name: mlqa.hi.hi
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_vi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_vi
dataset_name: mlqa.hi.vi
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_hi_zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_hi_zh
dataset_name: mlqa.hi.zh
process_results: !function utils.process_results_hi
5 changes: 5 additions & 0 deletions lm_eval/tasks/mlqa/mlqa_vi_ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Generated by generate_tasks.py
include: mlqa_common_yaml
task: mlqa_vi_ar
dataset_name: mlqa.vi.ar
process_results: !function utils.process_results_vi
Loading
Loading