Question answering

Question answering is the task of answering a question.

ARC

The AI2 Reasoning Challenge (ARC) dataset is a question answering, which contains 7,787 genuine grade-school level, multiple-choice science questions. The dataset is partitioned into a Challenge Set and an Easy Set. The Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. Models are evaluated based on accuracy.

A public leaderboard is available on the ARC website.

Reading comprehension

Most current question answering datasets frame the task as reading comprehension where the question is about a paragraph or document and the answer often is a span in the document. The Machine Reading group at UCL also provides an overview of reading comprehension tasks.

CNN / Daily Mail

The CNN / Daily Mail dataset is a Cloze-style reading comprehension dataset created from CNN and Daily Mail news articles using heuristics. Close-style means that a missing word has to be inferred. In this case, "questions" were created by replacing entities from bullet points summarizing one or several aspects of the article. Coreferent entities have been replaced with an entity marker @entityn where n is a distinct index. The model is tasked to infer the missing entity in the bullet point based on the content of the corresponding article and models are evaluated based on their accuracy on the test set.

	CNN	Daily Mail
# Train	380,298	879,450
# Dev	3,924	64,835
# Test	3,198	53,182

Example:

Passage	Question	Answer
( @entity4 ) if you feel a ripple in the force today , it may be the news that the official @entity6 is getting its first gay character . according to the sci-fi website @entity9 , the upcoming novel " @entity11 " will feature a capable but flawed @entity13 official named @entity14 who " also happens to be a lesbian . " the character is the first gay figure in the official @entity6 -- the movies , television shows , comics and books approved by @entity6 franchise owner @entity22 -- according to @entity24 , editor of " @entity6 " books at @entity28 imprint @entity26 .	characters in " @placeholder " movies have gradually become more diverse	@entity6

Model	CNN	Daily Mail	Paper / Source
Neural net (Chen et al., 2016)	72.4	75.8	A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
Classifier (Chen et al., 2016)	67.9	68.3	A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
Impatient Reader (Hermann et al., 2015)	63.8	68.0	Teaching Machines to Read and Comprehend

MS MARCO

MS MARCO aka Human Generated MAchine Reading COmprehension Dataset, is designed and developed by Microsoft AI & Research. Link to paper

The questions are obtained from real anonymized user queries.
The answers are human generated. The context passages from which the answers are obtained are extracted from real documents using the latest Bing search engine.
The data set contains 100,000 queries and a subset of them contain multiple answers, and aim to release 1M queries in the future.

The leaderboards for multiple tasks are available on the MS MARCO leaderboard page.

MultiRC

MultiRC (Multi-Sentence Reading Comprehension) is a dataset of short paragraphs and multi-sentence questions that can be answered from the content of the paragraph. We have designed the dataset with three key challenges in mind:

The number of correct answer-options for each question is not pre-specified. This removes the over-reliance of current approaches on answer-options and forces them to decide on the correctness of each candidate answer independently of others. In other words, unlike previous work, the task here is not to simply identify the best answer-option, but to evaluate the correctness of each answer-option individually.
The correct answer(s) is not required to be a span in the text.
The paragraphs in our dataset have diverse provenance by being extracted from 7 different domains such as news, fiction, historical text etc., and hence are expected to be more diverse in their contents as compared to single-domain datasets.

The leaderboards for the dataset is available on the MultiRC website.

NewsQA

The NewsQA dataset is a reading comprehension dataset of over 100,000 human-generated question-answer pairs from over 10,000 news articles from CNN, with answers consisting of spans of text from the corresponding articles. Some challenging characteristics of this dataset are:

Answers are spans of arbitrary length;
Some questions have no answer in the corresponding article;
There are no candidate answers from which to choose. Although very similar to the SQuAD dataset, NewsQA offers a greater challenge to existing models at time of introduction (eg. the paragraphs are longer than those in SQuAD). Models are evaluated based on F1 and Exact Match.

Example:

Story	Question	Answer
MOSCOW, Russia (CNN) -- Russian space officials say the crew of the Soyuz space ship is resting after a rough ride back to Earth. A South Korean bioengineer was one of three people on board the Soyuz capsule. The craft carrying South Korea's first astronaut landed in northern Kazakhstan on Saturday, 260 miles (418 kilometers) off its mark, they said. Mission Control spokesman Valery Lyndin said the condition of the crew -- South Korean bioengineer Yi So-yeon, American astronaut Peggy Whitson and Russian flight engineer Yuri Malenchenko -- was satisfactory, though the three had been subjected to severe G-forces during the re-entry. [...]	Where did the Soyuz capsule land?	northern Kazakhstan

The dataset can be downloaded here.

Model	F1	EM	Paper / Source
MINIMAL(Dyn) (Min et al., 2018)	63.2	50.1	Efficient and Robust Question Answering from Minimal Context over Documents
FastQAExt (Weissenborn et al., 2017)	56.1	43.7	Making Neural QA as Simple as Possible but not Simpler

QAngaroo

QAngaroo is a set of two reading comprehension datasets, which require multiple steps of inference that combine facts from multiple documents. The first dataset, WikiHop is open-domain and focuses on Wikipedia articles. The second dataset, MedHop is based on paper abstracts from PubMed.

The leaderboards for both datasets are available on the QAngaroo website.

RACE

The RACE dataset is a reading comprehension dataset collected from English examinations in China, which are designed for middle school and high school students. The dataset contains more than 28,000 passages and nearly 100,000 questions and can be downloaded here. Models are evaluated based on accuracy on middle school examinations (RACE-m), high school examinations (RACE-h), and on the total dataset (RACE).

Model	RACE-m	RACE-h	RACE	Paper / Source
Finetuned Transformer LM (Radford et al., 2018)	62.9	57.4	59.0	Improving Language Understanding by Generative Pre-Training
BiAttention MRU (Tay et al., 2018)	60.2	50.3	53.3	Multi-range Reasoning for Machine Comprehension

SQuAD

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles. The answer to every question is a segment of text (a span) from the corresponding reading passage. Recently, SQuAD 2.0 has been released, which includes unanswerable questions.

The public leaderboard is available on the SQuAD website.

Story Cloze Test

The Story Cloze Test is a dataset for story understanding that provides systems with four-sentence stories and two possible endings. The systems must then choose the correct ending to the story.

Model	Accuracy	Paper / Source
Finetuned Transformer LM (Radford et al., 2018)	86.5	Improving Language Understanding by Generative Pre-Training
Hidden Coherence Model (Chaturvedi et al., 2017)	77.6	Story Comprehension for Predicting What Happens Next
val-LS-skip (Srinivasan et al., 2018)	76.5	A Simple and Effective Approach to the Story Cloze Test

Winograd Schema Challenge

The Winograd Schema Challenge is a dataset for common sense reasoning. It employs Winograd Schema questions that require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models are evaluated based on accuracy.

Example:

The trophy doesn’t fit in the suitcase because it is too big. What is too big? Answer 0: the trophy. Answer 1: the suitcase

Model	Score	Paper / Source
Word-LM-partial (Trinh and Le, 2018)	62.6	A Simple Method for Commonsense Reasoning
Char-LM-partial (Trinh and Le, 2018)	57.9	A Simple Method for Commonsense Reasoning
USSM + Supervised DeepNet + KB (Liu et al., 2017)	52.8	Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems

Go back to the README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question_answering.md

question_answering.md

Question answering

ARC

Reading comprehension

CNN / Daily Mail

MS MARCO

MultiRC

NewsQA

QAngaroo

RACE

SQuAD

Story Cloze Test

Winograd Schema Challenge

Files

question_answering.md

Latest commit

History

question_answering.md

File metadata and controls

Question answering

ARC

Reading comprehension

CNN / Daily Mail

MS MARCO

MultiRC

NewsQA

QAngaroo

RACE

SQuAD

Story Cloze Test

Winograd Schema Challenge