Awesome Text VQA

Text related VQA is a fine-grained direction of the VQA task, which only focuses on the question that requires to read the textual content shown in the input image.

Datasets

NewsVideoQA dataset (WACV2023) [Project][Paper]
ViteVQA dataset (NeurIPS 2022) [Project][Paper]
VisualMRC dataset (AAAI 2021) [Project][Paper]
EST-VQA dataset (CVPR 2020) [Project][Paper]
DOC-VQA dataset (CVPR Workshop 2020) [Project][Paper]
Text-VQA dataset (CVPR 2019) [Project][Paper]
ST-VQA dataset (ICCV 2019) [Project][Paper]
OCR-VQA dataset (ICDAR 2019) [Project][Paper]

Dataset	#Train+Val Img	#Train+Val Que	#Test Img	#Test Que	Image Source	Language
Text-VQA	25,119	39,602	3,353	5,734	[1]	EN
ST-VQA	19,027	26,308	2,993	4,163	[2, 3, 4, 5, 6, 7, 8]	EN
OCR-VQA	186,775	901,717	20,797	100,429	[9]	EN
EST-VQA	17,047	19,362	4,000	4,525	[4, 5, 8, 10, 11, 12, 13]	EN+CH
DOC-VQA	11,480	44,812	1,287	5,188	[14]	EN
VisualMRC	7,960	23,854	2,237	6,708	self-collected webpage screenshot	EN
ViteVQA(Task1Spilt1)	5,969	19,840	971	3,183	YouTuBe	EN

Image Source:
[1] OpenImages: A public dataset for large-scale multi-label and multi-class image classification (v3) [dataset]
[2] Imagenet: A large-scale hierarchical image database [dataset]
[3] Vizwiz grand challenge: Answering visual questions from blind people [dataset]
[4] ICDAR 2013 robust reading competition [dataset]
[5] ICDAR 2015 competition on robust reading [dataset]
[6] Visual Genome: Connecting language and vision using crowdsourced dense image annotations [dataset]
[7] Image retrieval using textual cues [dataset]
[8] Coco-text: Dataset and benchmark for text detection and recognition in natural images [dataset]
[9] Judging a book by its cover [dataset]
[10] Total Text [dataset]
[11] SCUT-CTW1500 [dataset]
[12] MLT [dataset]
[13] Chinese Street View Text [dataset]
[14] UCSF Industry Document Library [dataset]

Related Challenges

ICDAR 2021 COMPETITION On Document Visual Question Answering (DocVQA) Submission Deadline: 31st March 2021 [Challenge]
Document Visual Question Answering （CVPR 2020 Workshop on Text and Documents in the Deep Learning Era Submission Deadline: ~~30 April 2020~~ [Challenge]

Papers

2023

[RUArt] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering(T-MM) [Paper][Project]
[BOV++] Beyond OCR + VQA: Towards end-to-end reading and reasoning for robust and accurate textvqa (PR) [Paper]

2022

[TAG] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation (PR) [Paper][Project]
[ViteVQA] Towards Video Text Visual Question Answering: Benchmark and Baseline (NeurIPS) [Paper][Project]
[LaTr] LaTr: Layout-Aware Transformer for Scene-Text VQA (CVPR) [Paper][Unofficial Code]
[TIG] Text-instance graph: Exploring the relational semantics for text-based visual question answering (PR)[Paper]
[SMA] Structured Multimodal Attentions for TextVQA (T-PAMI)[Paper][Project]
[DA-Net] Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering (arXiv)[Paper]
[SenseGATE] SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering (arXiv)[Paper]
[MLCI] Multi-level, multi-modal interactions for visual question answering over text in images (WWW)[Paper][Project]
[TWA] From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA (ACMMM)[Paper][Project]
[TAG] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation (BMVC)[Paper][Project]
[MGEN] MODALITY-SPECIFIC MULTIMODAL GLOBAL ENHANCED NETWORK FOR TEXT-BASED VISUAL QUESTION ANSWERING (ICME)[Paper]
[SC-Net] TOWARDS ESCAPING FROM LANGUAGE BIAS AND OCR ERROR: SEMANTICS-CENTERED TEXT VISUAL QUESTION ANSWERING (ICME)[Paper]
[EKTVQA] EKTVQA: Generalized Use of External Knowledge to Empower Scene Text in Text-VQA (Access)[Paper]
[Two-stage fusion] Two-stage Multimodality Fusion for High-performance Text-based Visual Question Answering (ACCV)[Paper]

2021

[VisualMRC] VisualMRC: Machine Reading Comprehension on Document Images (AAAI) [Paper][Project]
[SSBaseline] Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps (AAAI) [Paper][code]

2020

[SA-M4C] Spatially Aware MultimodalTransformers for TextVQA (ECCV) [Paper][Project][Code]
[EST-VQA] On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering (CVPR) [Paper]
[M4C] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA (CVPR) [Paper][Project]
[LaAP-Net] Finding the Evidence: Localization-aware Answer Prediction for TextVisual Question Answering (COLING) [Paper]
[CRN] Cascade Reasoning Network for Text-basedVisual Question Answering (ACM MM) [Paper][Project]

2019

[Text-VQA/LoRRA] Towards VQA Models That Can Read (CVPR) [Paper][Code]
[ST-VQA] Scene Text Visual Question Answering (ICCV) [Paper]
[Text-KVQA] From Strings to Things: Knowledge-enabled VQA Modelthat can Read and Reason (ICCV) [Paper]
[OCR-VQA] OCR-VQA: Visual Question Answering by Reading Text in Images (ICDAR) [Paper]

Technical Reports

[DiagNet] DiagNet: Bridging Text and Image [Report][Code]
[DCD_ZJU] Winner of 2019 Text-VQA challenge [Slides]
[Schwail] Runner-up of 2019 Text-VQA challenge [Slides]

Benchmark

Acc. : Accuracy I. E. : Image Encoder Q. E. : Question Encoder O. E. : OCR Token Encoder Ensem. : Ensemble

Text-VQA

[official leaderboard(2019)] [official leaderboard(2020)]

Y-C./J.	Methods	Acc.	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2019--CVPR	LoRRA	26.64	Faster R-CNN	GloVe	Rosetta-ml	FastText	Classification	N
2019--N/A	DCD_ZJU	31.44	Faster R-CNN	BERT	Rosetta-ml	FastText	Classification	Y
2020--CVPR	M4C	40.46	Faster R-CNN (ResNet-101)	BERT	Rosetta-en	FastText	Decoder	N
2020--Challenge	Xiangpeng	40.77
2020--Challenge	colab_buaa	44.73
2020--Challenge	CVMLP(SAM)	44.80
2020--Challenge	NWPU_Adelaide_Team(SMA)	45.51	Faster R-CNN	BERT	BDN	Graph Attention	Decoder	N
2020--ECCV	SA-M4C	44.6*	Faster R-CNN (ResNext-152)	BERT	Google-OCR	FastText+PHOC	Decoder	N
2020--arXiv	TAP	53.97*	Faster R-CNN (ResNext-152)	BERT	Microsoft-OCR	FastText+PHOC	Decoder	N
2022--arXiv	TAG	53.63	Faster R-CNN (ResNext-152)	BERT	Microsoft-OCR	FastText+PHOC	Decoder	N

* Using external data for training.

ST-VQA

[official leaderboard]
T1 : Strongly Contextualised Task T2 : Weakly Contextualised Task T3 : Open Dictionary

Y-C./J.	Methods	Acc. (T1/T2/T3)	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2020--CVPR	M4C	na/na/0.4621	Faster R-CNN (ResNet-101)	BERT	Rosetta-en	FastText	Decoder	N
2020--Challenge	SMA	0.5081/0.3104/0.4659	Faster	BERT	BDN	Graph Attention	Decoder	N
2020--ECCV	SA-M4C	na/na/0.5042	Faster R-CNN (ResNext-152)	BERT	Google-OCR	FastText+PHOC	Decoder	N
2020--arXiv	TAP	na/na/0.5967	Faster R-CNN (ResNext-152)	BERT	Microsoft-OCR	FastText+PHOC	Decoder	N
2022--arXiv	TAG	na/na/0.6019	Faster R-CNN (ResNext-152)	BERT	Microsoft-OCR	FastText+PHOC	Decoder	N

OCR-VQA

Y-C./J.	Methods	Acc.	I. E.	Q. E.	OCR	O. E.	Output	Ensem.
2020--CVPR	M4C	63.9	Faster R-CNN	BERT	Rosetta-en	FastText	Decoder	N

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome Text VQA

Datasets

Related Challenges

Papers

2023

2022

2021

2020

2019

Technical Reports

Benchmark

Text-VQA

ST-VQA

OCR-VQA

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Text VQA

Datasets

Related Challenges

Papers

2023

2022

2021

2020

2019

Technical Reports

Benchmark

Text-VQA

ST-VQA

OCR-VQA