A list of supported tasks and task groupings can be viewed with lm-eval --tasks list
.
For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder.
Task Family | Description | Language(s) |
---|---|---|
aclue | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese |
aexams | Tasks in Arabic related to various academic exams covering a range of subjects. | Arabic |
agieval | Tasks involving historical data or questions related to history and historical texts. | English, Chinese |
anli | Adversarial natural language inference tasks designed to test model robustness. | English |
arabicmmlu | Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. | Arabic |
arc | Tasks involving complex reasoning over a diverse set of questions. | English |
arithmetic | Tasks involving numerical computations and arithmetic reasoning. | English |
asdiv | Tasks involving arithmetic and mathematical reasoning challenges. | English |
babi | Tasks designed as question and answering challenges based on simulated stories. | English |
basqueglue | Tasks designed to evaluate language understanding in Basque language. | Basque |
bbh | Tasks focused on deep semantic understanding through hypothesization and reasoning. | English, German |
belebele | Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages) |
benchmarks | General benchmarking tasks that test a wide range of language understanding capabilities. | |
bertaqa | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT) |
bigbench | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple |
blimp | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English |
ceval | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
cmmlu | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese |
code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby |
commonsense_qa | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English |
copal_id | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian |
coqa | Conversational question answering tasks to test dialog understanding. | English |
crows_pairs | Tasks designed to test model biases in various sociodemographic groups. | English, French |
csatqa | Tasks related to SAT and other standardized testing questions for academic assessment. | Korean |
drop | Tasks requiring numerical reasoning, reading comprehension, and question answering. | English |
eq_bench | Tasks focused on equality and ethics in question answering and decision-making. | English |
eus_exams | Tasks based on various professional and academic exams in the Basque language. | Basque |
eus_proficiency | Tasks designed to test proficiency in the Basque language across various topics. | Basque |
eus_reading | Reading comprehension tasks specifically designed for the Basque language. | Basque |
eus_trivia | Trivia and knowledge testing tasks in the Basque language. | Basque |
fda | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English |
fld | Tasks involving free-form and directed dialogue understanding. | English |
french_bench | Set of tasks designed to assess language model performance in French. | French |
glue | General Language Understanding Evaluation benchmark to test broad language abilities. | English |
gpqa | Tasks designed for general public question answering and knowledge verification. | English |
gsm8k | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. | English |
haerae | Tasks focused on assessing detailed factual and historical knowledge. | Korean |
headqa | A high-level education-based question answering dataset to test specialized knowledge. | Spanish, English |
hellaswag | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. | English |
hendrycks_ethics | Tasks designed to evaluate the ethical reasoning capabilities of models. | English |
hendrycks_math | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English |
ifeval | Interactive fiction evaluation tasks for narrative understanding and reasoning. | English |
inverse_scaling | Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. | English |
kmmlu | Knowledge-based multi-subject multiple choice questions for academic evaluation. | Korean |
kobest | A collection of tasks designed to evaluate understanding in Korean language. | Korean |
kormedmcqa | Medical question answering tasks in Korean to test specialized domain knowledge. | Korean |
lambada | Tasks designed to predict the endings of text passages, testing language prediction skills. | English |
lambada_cloze | Cloze-style LAMBADA dataset. | English |
lambada_multilingual | Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use lambada_multilingual_stablelm . |
German, English, Spanish, French, Italian |
lambada_multilingual_stablelm | Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on lambada_multilingual . |
German, English, Spanish, French, Italian, Dutch, Portuguese |
leaderboard | Task group used by Hugging Face's Open LLM Leaderboard v2. Those tasks are static and will not change through time | English |
logiqa | Logical reasoning tasks requiring advanced inference and deduction. | English, Chinese |
logiqa2 | Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. | English, Chinese |
mathqa | Question answering tasks involving mathematical reasoning and problem-solving. | English |
mc_taco | Question-answer pairs that require temporal commonsense comprehension. | English |
med_concepts_qa | Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept. | English |
medmcqa | Medical multiple choice questions assessing detailed medical knowledge. | English |
medqa | Multiple choice question answering based on the United States Medical License Exams. | |
mgsm | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
minerva_math | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English |
mmlu | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English |
mmlusr | Variation of MMLU designed to be more rigourous. | English |
model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
mutual | A retrieval-based dataset for multi-turn dialogue reasoning. | English |
nq_open | Open domain question answering tasks based on the Natural Questions dataset. | English |
okapi/arc_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) Machine Translated. |
okapi/hellaswag_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (30 languages) Machine Translated. |
okapi/mmlu_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (34 languages) Machine Translated. |
okapi/truthfulqa_multilingual | Tasks that involve reading comprehension and information retrieval challenges. | Multiple (31 languages) Machine Translated. |
openbookqa | Open-book question answering tasks that require external knowledge and reasoning. | English |
paloma | Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit. | English |
paws-x | Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities. | English, French, Spanish, German, Chinese, Japanese, Korean |
pile | Open source language modelling data set that consists of 22 smaller, high-quality datasets. | English |
pile_10k | The first 10K elements of The Pile, useful for debugging models trained on it. | English |
piqa | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English |
polemo2 | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish |
prost | Tasks requiring understanding of professional standards and ethics in various domains. | English |
pubmedqa | Question answering tasks based on PubMed research articles for biomedical understanding. | English |
qa4mre | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English |
qasper | Question Answering dataset based on academic papers, testing in-depth scientific knowledge. | English |
race | Reading comprehension assessment tasks based on English exams in China. | English |
realtoxicityprompts | Tasks to evaluate language models for generating text with potential toxicity. | |
sciq | Science Question Answering tasks to assess understanding of scientific concepts. | English |
scrolls | Tasks that involve long-form reading comprehension across various domains. | English |
siqa | Social Interaction Question Answering to evaluate common sense and social reasoning. | English |
squad_completion | A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. | English |
squadv2 | Stanford Question Answering Dataset version 2, a reading comprehension benchmark. | English |
storycloze | Tasks to predict story endings, focusing on narrative logic and coherence. | English |
super_glue | A suite of challenging tasks designed to test a range of language understanding skills. | English |
swag | Situations With Adversarial Generations, predicting the next event in videos. | English |
swde | Information extraction tasks from semi-structured web pages. | English |
tinyBenchmarks | Evaluation of large language models with fewer examples using tiny versions of popular benchmarks. | English |
tmmluplus | An extended set of tasks under the TMMLU framework for broader academic assessments. | Traditional Chinese |
toxigen | Tasks designed to evaluate language models on their propensity to generate toxic content. | English |
translation | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese |
triviaqa | A large-scale dataset for trivia question answering to test general knowledge. | English |
truthfulqa | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English |
unitxt | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
unscramble | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
webqs | Web-based question answering tasks designed to evaluate internet search and retrieval. | English |
wikitext | Tasks based on text from Wikipedia articles to assess language modeling and generation. | English |
winogrande | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge. | English |
wmdp | A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions. | English |
wmt2016 | Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. | English, Czech, German, Finnish, Russian, Romanian, Turkish |
wsc273 | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. | English |
xcopa | Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. | Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese |
xnli | Cross-Lingual Natural Language Inference to test understanding across different languages. | Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese |
xnli_eu | Cross-lingual Natural Language Inference tasks in Basque. | Basque |
xstorycloze | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese |
xwinograd | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese |