Benchmarks

Part-of-Speech Tagging

The Penn Treebank III: Wall Street Journal (accuracy).

Training: sections 0-18.
Development: sections 19-21.
EvaluationL sections 22-24.

Neural Models

Model	ALL	OOV
Bonnet et al. (2018)	97.96	-
Akbik et al. (2018)	97.85 (±0.01)	-
Ling et al. (2015)	97.78	-
Yasunaga et al. (2018)	97.58	-
Ma & Hovy (2016)	97.55	93.16
Tsuboi (2014)	97.51	91.64
Andor et al. (2016)	97.45	-
Santos & Zadrozny (2014)	97.32	89.86

Contextual String Embeddings for Sequence Labeling, Akbik et al., COLING, 2018.
Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings, Bonnet et al., ACL, 2018.
Robust Multilingual Part-of-Speech Tagging via Adversarial Training, Yasunaga et al., NAACL, 2018.
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, Ma & Hovy, ACL, 2016.
Globally Normalized Transition-Based Neural Networks, Andor et al., ACL, 2016.
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation, Ling et al., EMNLP, 2015.
Neural Networks Leverage Corpus-wide Information for Part-of-speech Tagging, Tsuboi, EMNLP, 2014.
Learning Character-level Representations for Part-of-Speech Tagging, Santos & Zadrozny, ICML, 2014.

Non-Neural Models

Model	ALL	OOV
Choi (2016)	97.64	92.03
Søgaard (2011)	97.50	-
Spoustová et al. (2009)	97.44	89.92
Moore (2015)	97.36	91.09
Sun (2014)	97.36	-
Shen et. al. (2007)	97.33	89.61
Manning (2011)	97.32	90.79

Dynamic Feature Induction: The Last Gist to the State-of-the-Art, Choi, NAACL, 2016.
An Improved Tag Dictionary for Faster Part-of-Speech Tagging, Moore, EMNLP, 2015.
Structure Regularization for Structured Prediction, Sun, NIPS, 2014.
Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?, Manning, CICLing, 2011.
Semisupervised condensed nearest neighbor for part-of-speech tagging, Søgaard, ACL, 2011.
Semi-supervised Training for the Averaged Perceptron POS Tagger, Spoustová et al., EACL, 2009.
Guided Learning for Bidirectional Sequence Classification, Shen et al., ACL, 2007.

Named Entity Recognition

The CoNLL'03 shared task dataset: English.

Approach	F1 Score	DNN
Akbik et al. (2018)	93.09 (±0.12)	O
Peters et al. (2018)	92.22 (±0.10)	O
Peters et al. (2017)	91.93 (±0.19)	O
Chiu & Nichols (2016)	91.62 (±0.33)	O
Ma & Hovy (2016)	91.21	O
Luo et al. (2015)	91.20
Suzuki et al. (2011)	91.02
Choi (2016)	91.00
Lample et al. (2016)	90.94	O
Passos et al. (2014)	90.90
Lin & Wu (2009)	90.90
Ratinov & Roth (2009)	90.57
Turian et al. (2010)	90.36

Contextual String Embeddings for Sequence Labeling, Akbik et al., COLING, 2018.
Deep contextualized word representations, Peters et al., NAACL, 2018.
Semi-supervised sequence tagging with bidirectional language models, Peters et al., ACL, 2017.
Named Entity Recognition with Bidirectional LSTM-CNNs, Chiu & Nichols, TACL, 2016.
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, Ma & Hovy, ACL, 2016.
Neural Architectures for Named Entity Recognition, Lample et al., NAACL, 2016.
Dynamic Feature Induction: The Last Gist to the State-of-the-Art, Choi, NAACL, 2016.
Joint Named Entity Recognition and Disambiguation, Luo et al., EMNLP, 2015.
Lexicon Infused Phrase Embeddings for Named Entity Resolution, Passos et al., CoNLL, 2014.
Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning, Suzuki, ACL, 2011.
Word Representations: A Simple and General Method for Semi-Supervised Learning, Turian et al., ACL, 2010.
Design Challenges and Misconceptions in Named Entity Recognition, Ratinov & Roth, CoNLL, 2009.
Phrase Clustering for Discriminative Learning, Lin & Wu, ACL, 2009.

The OntoNotes 5.0 dataset for the CoNLL'12 shared task: English.

Approach	F1 Score	DNN
Chiu & Nichols (2016)	86.28 (±0.26)	O
Durrett & Klein (2014)	84.04

Named Entity Recognition with Bidirectional LSTM-CNNs, Chiu & Nichols, TACL, 2016.
A Joint Model for Entity Analysis: Coreference, Typing, and Linking, Durrett & Klein, TACL, 2014.

Dependency Parsing

The Penn Treebank III: Wall Street Journal (attachment scores).

Training: sections 2-21.
Development: sections 22.
EvaluationL sections 23.

Y&M Headrules	UAS	LAS	T/G	EXT
Honnibal et al. (2013)	91.1	88.9	T
Zhang and Zhang (2015)	91.80	90.68	T
Zhang and Nivre (2011)	92.9	91.8	T
Choi and McCallum (2013)	92.96	91.93	T
Koo and Collins (2010)	93.04	-	G
Ma et al. (2014)	93.06	-	T
Koo et al. (2008)	93.16	-	G	O
Chen et al. (2009)	93.16	-	G	O
Li et al. (2014)	93.19	-	G	O
Zhou et al. (2015)	93.28	92.35	T	O
Pei et. al. (2015)	93.29	92.13	G	O
Carreras et al. (2008)	93.5	-	G	O
Chen et al. (2013)	93.77	-	G	O
Suzuki et al. (2009)	93.79	-	G	O
Suzuki et al. (2011)	94.22	-	G	O

Stanford	UAS	LAS	T/G
Goldberg and Nivre (2012)	90.96	88.72	T
Honnibal et al. (2013)	91.0	88.9	T
Chen and Manning (2014)	91.8	89.6	T
Zhang and Nivre (2011)	93.5	91.9	T
Kiperwasser and Goldberg (2015)	92.45	-
Dyer et. al. (2015)	93.1	90.9
Weiss et. al. (2015)	94.26	92.41

Simple Semi-supervised Dependency Parsing, Koo et al., ACL, 2008.
TAG, Dynamic Programming, and the Perceptron for Efficient, Feature-rich Parsing, Carreras et al., CoNLL, 2008.
An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing, Suzuki et al., EMNLP, 2009.
Improving Dependency Parsing with Subtrees from Auto-Parsed Data, Chen et al., EMNLP, 2009.
Efficient Third-order Dependency Parsers, Koo and Collins, ACL, 2010.
Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning, Suzuki et al., ACL, 2011.
Transition-based Dependency Parsing with Rich Non-local Features, Zhang and Nivre, ACL, 2011.
A Dynamic Oracle for Arc-Eager Dependency Parsing, Goldberg and Nivre, COLING, 2012.
Transition-based Dependency Parsing with Selectional Branching, Choi and McCallum, ACL, 2013.
Semi-supervised Feature Transformation for Dependency Parsing, Chen et al., EMNLP, 2013.
Ambiguity-aware Ensemble Training for Semi-supervised Dependency Parsing, Li et al., ACL, 2014.
Punctuation Processing for Projective Dependency Parsing, Ma et al., ACL, 2014.
A Fast and Accurate Dependency Parser using Neural Networks, Chen and Manning, EMNLP, 2014.
A Non-Monotonic Arc-Eager Transition System for Dependency Parsing, Honnibal et al., CoNLL, 2015.
Combining Discrete and Continuous Features for Deterministic Transition-based Dependency Parsing, Zhang and Zhang, EMNLP, 2015
Transition-Based Dependency Parsing with Stack Long Short-Term Memory, Dyer et al., ACL, 2015.
Structured Training for Neural Network Transition-Based Parsing, Weiss et al., ACL, 2015.
An Effective Neural Network Model for Graph-based Dependency Parsing, Pei et. al., ACL, 2015.
A Neural Probabilistic Structured-Prediction Model for Transition-Based Dependency Parsing, Zhou et al., ACL, 2015
Semi-supervised Dependency Parsing using Bilexical Contextual Features from Auto-Parsed Data, Kiperwasser and Goldberg, EMNLP, 2015.

Coreference Resolution

The CoNLL 2012 shared task dataset (F1 scores).

Model	MUC	B³	CEAF_e	CoNLL
Fernandes et al. (2014)	70.51	57.58	53.86	60.65
Björkelund & Kuhn (2014)	70.72	58.58	55.61	61.63
Durrett & Klein (2014)	71.24	58.71	55.18	61.71
Martschat & Strube (2015)	72.17	59.58	55.67	62.47
Clark & Manning (2015)	72.59	60.44	56.02	63.02
Peng et al. (2015)	72.22	60.50	56.37	63.03
Wiseman et al. (2015)	72.60	60.52	57.05	63.39
Wiseman et al. (2016)	73.42	61.50	57.70	64.21
Clark & Manning (2016)	74.06	62.86	58.96	65.29

Latent Trees for Coreference Resolution, Fernandes et al., CL, 2014.
Learning structured perceptrons for coreference Resolution with Latent Antecedents and Non-local Features, Björkelund and Kuhn, ACL, 2014.
A Joint Model for Entity Analysis: Coreference, Typing, and Linking, Durrett and Klein, TACL, 2014.
Latent Structures for Coreference Resolution, Martschat and Strube, TACL, 2015.
Entity-Centric Coreference Resolution with Model Stacking, Clark and Manning, ACL, 2015.
A Joint Framework for Coreference Resolution and Mention Head Detection, Peng et al., CoNLL, 2015.
Learning Anaphoricity and Antecedent Ranking Features for Coreference Resolution, Wiseman et al., ACL, 2015.
Learning Global Features for Coreference Resolution, Wiseman et al., NAACL, 2016.
Improving Coreference Resolution by Learning Entity-Level Distributed Representations, Clark and Manning, ACL, 2016.

Sentiment Analysis

The Stanford Sentiment Treebank (accuracies).

Model	Fine-grained	Binary
Socher et al. (2013)	45.7	85.4
Kim (2014)	48.0	87.2
Kalchbrenner et al. (2014)	48.5	86.8
Le and Mikolov (2014)	48.7	87.8
Shin et al. (2016)	48.8	-
Yin and Schütze (2015)	49.6	89.4
Irsoy and Cardie (2014)	49.8	86.6
Tai et al. (2015)	51.0	88.0