This repo collects OCR-related datasets. In general, the datasets are classified by 6 types, i.e., Natural Scene Text, Document Text, Handwritten Text, Historical Document Text, Video Text, and Synthetic Text.
- Natural Scene Text: The images in this type of dataset are usually taken in natural scenes, so the difficulty of this task lies in the complex lighting transformations, shooting angles, blurring, varied fonts, etc.
- Document Text: only focues on document images, the difficulty is the variety of typesetting.
- Historical Document Text: is usally designed for assisting social science research. For example, digitized antiquarian documents help preserve historical materials and facilitate scholars to conduct related research.
- Video Text: aims at recognizing texts in videos, which introduces temporal information into the OCR task.
- Synthetic Text: synthetically generates images containing texts and the corresponding annotations by rendering texts of different fonts into natural photos. This type of dataset usually includes hundreds of thousands of samples since it does not require human beings to annotate the images. However, due to the limited technology, there is usually a large domain gap between the synthetic images and authentic samples; these datasets are often employed for pre-training only.
Natural Scene Text | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Year/Venue | Name | Task | #Train(#wds) | #Val(#wds) | #Test(#wds) | Granu. | Anno. Form | Language | Scene | Paper | Size | |
2003-05/ICDAR | IC03/IC05 | Det. & Rec. | 258 (1110) | N/A | 251 (1156) | Word | Rect [x, y, w, h, "transcript"] | English | Natural | 112MB | ||
2011-15/ICDAR | Born-DIgital-Image (IC2011-2015) | Det. & Rec. & Seg. | 410 (3564) | N/A | 141 (1439) | Word & Pixel | Rect [x, y, w, h, "transcript"] | English | Natural/Web/Email | 40MB | ||
2013-15/ICDAR | Focused Scene Text (IC13) | Det. & Rec. & Seg. | 229 (848) | N/A | 233 (1095) | Word & Pixel | Rect [x1, y1, x2, y2, "transcript"] & SegMap | English | Natural | 250MB | ||
2015/ICDAR | Incidental Scene Text (IC15) | Det. & Rec. | 1,000 (4468) | N/A | 500 (2077) | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | English | Natural | 130MB | ||
2017/ICDAR | Multi-Lingual Scene Text (MLT2017) | Det. & Rec. | 7,200 | 1,800 | private | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans'] | multi-lingual | Natural | - | 12GB | |
2019/ICDAR | Multi-Lingual Scene Text (MLT2019) | Det. & Rec. | 10,000 | N/A | 10,000 | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, Lan, 'trans'] | multi-lingual | Natural | ~12GB | ||
2017/ICDAR | COCO-Text v2.0 | Det. & Rec. | 43,686 | 10,000 | 10,000 | Word | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] | En & NonEn | Natural | 13GB | ||
2019/ICDAR | ReCTS | Det. & Rec. | 20,000 | N/A | 5,000 | Word/Line | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | Chinese | Signboard | - | ~2.5GB | |
2017/ICDAR | Total-Text | Det. & Rec. | 1255 | N/A | 300 | Word & Pixel | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] | English | Natural | 441MB | ||
2019/PR | SCUT-CTW1500 | Det. & Rec. | 1,000 | N/A | 500 | Line | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] | En & Ch | Natural | 800MB | ||
2019/ICDAR | Arbitrary-Shaped Text (ART) | Det. & Rec. | 5,603 (50,029) | N/A | 4,563 (52,631) | Word(En)/Line(CH) | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], Lan, 'trans'] | En & Ch | Natural | - | 4.4GB | |
2017/ICDAR | RCTW-17 (CTW-12k) | Det. & Rec. | 11514 | N/A | 1000 | Line | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | Chinese | Mixture | 11GB | ||
2019/ICDAR/ICCV | Large-scale Street View Text (LSVT) | Det. & Rec. | 30,000 | N/A | 20,000 | Line | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] | En & Ch | Street View | 14GB | ||
2016/DAS | MLe2e | Det. & Script Identifica. | 450 | N/A | 261 | Word | Rect [x1, y1, x2, y2, language] | multi-lingual | Natural | 82MB | ||
2017/ICDAR | IIIT-ILST | Det. & Rec. | 893 | Word | Rect [x, y, w, h, "transcript"] | Indic | Google Images | 609MB | ||||
2017/CVPRW | UberText | Det. & Rec. | 117,969 (571,534) | Word | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] | English | Street View | 197GB | ||||
2009/VISAPP | Chars74k | Det. & Rec. | 1922 | Character | En & Kanada | Natural Scene | 739MB | |||||
2010/ICPR | KAIST | Det. & Rec. & Seg. | 3000 | Char & Word & Pixel | Rect [x, y, w, h, "transcript"] & SegMap | En & Korean | Mixture | 364MB | ||||
2010/ECCV | SVT | Det. & Rec. | 100 (211) | N/A | 250 (514) | Word | Rect [x, y, w, h, "transcript"] | English | Street View | 118MB | ||
2013/ICCV | SVTP (download code:vnis) | Rec. | 238 (639) | - | English | Street View | ~1MB | |||||
2011/NIPSw | SVHN | Det. & Rec. | 73,257+531,131 | N/A | 26,032 | Character | Rect [x, y, w, h, "transcript"] | Digit | House Number | ~3GB | ||
2011/ICDARw | NEOCR | Det. | 659 (5,238) | Line | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | multi-lingual | Natural Scene | 1.3GB | ||||
2012/CVPR | MSRA-TD500 | Det. | 300 | N/A | 200 | Line | RotRect [ind, difficult, x, y, w, h, theta] | multi-lingual | Street View | 96MB | ||
2012/BMVC | IIIT 5k-word | Rec. | 380 (2000) | N/A | 740 (3000) | Word | English | Natural | 106MB | |||
2014/ESWA | CUTE80 | Rec. | 80 | Line | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]]] | English | Street View | 44MB | ||||
2015/TPAMI | USTB-SV1K | Det. & Rec. | 500 | N/A | 500 | Word | RotRect [ind, difficult, x, y, w, h, theta, "trans"] | English | Street View | 36MB | ||
2019/JCST | Chinese Text in the Wild (CTW) | Det. & Rec. | 25,887(812,872chrs) | N/A | 3,269(103,519chrs) | Char & Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | Chinese | Street View | ~40GB | ||
2019/TITS | ShopSign | Det. & Rec. | 1258 sample images | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | Chinese | Signboard | 3GB | ||||
2021/CVPR | TextOCR | Det. & Rec. & VQA | 24902 (822,572) | N/A | 3232 (80,497) | Word | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] | English | Natural Scene | ~8GB | ||
2021/CVPR | VinText | Det. & Rec. | 1,200 | N/A | 300+500 | Word | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] | Vietnamese | Natural Scene | 1GB | ||
2018/Competition | ICPR MTWI2018 | Det. & Rec. | 10,000 | N/A | 10,000 | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | En & Ch | WEB Images | 2GB | ||
2019/Competition | 百度中文场景文字识别比赛 | Rec. | 50,000 | N/A | 10,000 | - | [h, w, 'trans'] | En & Ch | Street View | - | ||
Document Text | ||||||||||||
Year/Venue | Name | Task | #Train | #Val | #Test | Granu. | Anno. Form | Language | Scene | Paper | Size | |
2011/ICDAR | RETAS | No public download link | Char & Word | No public download link | - | |||||||
2013/IJDAR | LRDE-DBD Document Binarization | Det. & Binarization | 125 | Line & Mask | Rect | French | Magzine | ~700MB | ||||
2015/ICDAR | SmartDOC | 3630 | N/A | 8470 | ~30GB | |||||||
2016/ICFHR | KPTI | Rec. | 11,910 | 2,552 | 2,553 | - | ['transcripts'] | Pashto | Document | ~100MB | ||
2017/ICDAR | DeText | Det. & Rec. | 100 | 100 | 300 | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | English | Scientific |
10MB | ||
2019/ICDAR | SROIE | Det. & Rec. & Info Ext. | 600 | 400 | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | English | Receipt | - | <1GB | ||
2019/ICDAR | FUNSD | Det. & Rec. & Info Ext. | 149 | N/A | 50 | Word | Rect [x1, y1, x2, y2, "transcript"] | English | Form | 16MB | ||
2019/ICDAR | NAF | Det. & Rec. & Info Ext. | 682 | 59 | 63 | Line | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | English | Form | |||
2020 | BID | Det. & Rec. | 28880 | Line | Poly | Latin | ID Document | |||||
2020/ISCSIC | DDI-100 | Det. & Rec. | ~ 100,000 (70% train, 30% val) | Char & Word & Mask | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | English | Distorted Document | ~300GB | ||||
Handwritten Text | ||||||||||||
Year/Venue | Name | Task | #Train | #Val | #Test | Granu. | Anno. Form | Language | Scene | Paper | Size | |
2008-11/ICDAR | RIMES | No public download link | Word & Line | No public download link | ||||||||
2010/DAS | HIT-OR3C | Rec. | Char set 832,650 chars / Doc set 77,168 chars | - | special format | Chinese | Handwritten | 1GB | ||||
2012/PR | KHATT | Rec. | 8,368 | 1,793 | 1,822 | - | ['transcripts'] | Arabic | Handwritten | |||
98-2014 | HANDS | No public download link | Japanese | Handwritten | ||||||||
- | Lao-SABAIDEE | 500 SAMPLES | No public download link | Laos | Handwritten | |||||||
2014/ICFHR | ORAND-CAR/CVL | Rec. | 5,000 | N/A | 5,000 | Word | ['image_name', 'trans'] | Digits | Handwritten Digits | 194MB | ||
2018/ICFHR | VNOnDB | Rec. | 1,146 paragraphs 7,296 lines 380,000 chars |
Word/Line/Parag. | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'trans'] | Vietnamese | Handwritten | 200MB | ||||
2013-16/IJDAR | PE92/SERI95/HanDB (HangulDB) | Rec. | 1200 samples (90% Train/10% Test) | .HGU1 format | Korean | Handwritten | 800MB | |||||
95-2016 | NIST | Rec. | English | |||||||||
2011/ICDAR | CASIA-OLHWDB/HWDB | Rec. | Chinese | Handwritten | ||||||||
2021/ICDAR | IIT-INDIC-HW-WORDS | Rec. | 872,000 instances | Word | ['image_name', 'vocab_id'] & vocabularly | Indic | Handwritten | ~20GB | ||||
1999/ICDAR | IAM Handwriting Database | Rec. | 6,161 | 900+940 | 1,861 | Registration is Required | ||||||
2005/ICDAR | IAM ONLINE Handwritting Data | Rec. | 86,272 word instances | Registration is Required | ||||||||
2018/ICDAR | IAM-MonDo | Rec. | Registration is Required | |||||||||
2011-14/ICDAR | CHROME | Rec. | > 10,000 expressions | symbol & expression | inkml format, latex | Symbol | Mathematical | 58MB | ||||
2017/ICDAR | MUSICMA++ | Rec. | 140 | Symbol | Music Notation | |||||||
2018/Access | SCUT-EPT | Rec. | 40,000 | N/A | 10,000 | Chinese | Educational Doc. | 1.08GB | ||||
2020/ICFHR | HHD | Rec. | 3965 | 1134 | Hebrew | |||||||
2021/ArXiv | IMGUR5K | Det. & Rec. | (~108,000) | (~13,000) | (~14,000) | Word | Rect [x, y, w, h, "transcript"] | English | Handwritten | - | ||
2021/ArXiv | VML-MOC | Seg. & Rec. | Hebrew | |||||||||
2021/ICDAR | Bengali | Rec. | Bengali | |||||||||
2021/ICDAR | GNHK | Det. & Rec. | 687 | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | English | ||||||
Historical Document Text | ||||||||||||
Year/Venue | Name | Task | #Train | #Val | #Test | Granu. | Anno. Form | Language | Scene | Paper | Size | |
2010-11/DAS | IAM-HistDB | Rec. | 127 | Word & Line | ['image_id', 'transcript'] | En & Ger & Latin | >200mb | |||||
2016/ICFHR | H-KWS (1. Botany 2. AK) | Det. & Rec. | 1849 | 3734 | N/A | Word & Line | Rect [x, y, w, h, "transcript"] | English | ||||
2016/ICFHR | READ | Registration is Required | German | ~600mb | ||||||||
2017/ICFHR | Palm Leaf Manuscript | Det. & Rec. | ~19,000 Balinese + ~20,000 Khmer | Char | No public download link | Khmer | Palm Leaf | |||||
2017/HIP | SleukRith-Set | Det. & Rec. | 658 | Char & Word | Polygon [[[x1,y1], [x2,y2], ..., [xn, yn]], 'transcript'] | Khmer | Palm Leaf | 1GB | ||||
2019/NCA | ARDIS | Rec. | 10,000 | Char & Word | ['transcript'] | Digits | Church Records | |||||
2019/ICDAR | Pinkas | Det. & Rec. | Word & Line | Hebrew | historical manuscripts | ~50MB | ||||||
2020/ICFHR | Cuneiform | |||||||||||
2020/ICFHR | MTHv2 | Det. & Rec. | 2,399 | N/A | 800 | Char & Line | Chinese | Acient Book | 4.6GB | |||
2021/ICDAR | IHR-NomDB | Det. & Rec. | 267 | Line | Rect [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | ChuNom | Acient Book | |||||
2021/ICDAR | VML-HP | Hebrew | ||||||||||
VML-AHTE | ||||||||||||
2019/ICDAR | IndiScapes | Seg | No public download link | Indic | ||||||||
Video Text | ||||||||||||
Year/Venue | Name | Task | #TrainVids (#frames) | #ValVids (#f) | #TestVids(#f) | Granu. | Anno. Form | Language | Scene | Paper | Size | |
2013/15/ICDAR | Text in Videos (IC13) | Det. & Rec. | 25 (13450) | 24 (14374) | Word | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | English | Natural | ||||
2015/ICDAR | CVSI2015 | No public link for download | multi-lingual | |||||||||
2017/ICDAR | DOST | Word | QUAD | Japanese | ||||||||
2018/ICFHR | LectureVideoDB | Det. & Rec. | -52,225 | -27,900 | -36,460 | Word | English | Slides/Paper | 2.3GB | |||
2020/ICRA | RoadText-1K | Det. & Rec. | 500 (150,000) | 200 (60,000) | 300 (90,000) | Line | Rect [x1, y1, x2, y2, "transcript"] & SegMap | En & NonEn | Road/Traffic | |||
2020/ICMV | MIDV-500 & MIDV-2019 | Det. & Rec. & Others | 500 video clips | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | multi-lingual | Document | 32GB | |||||
2021/ICDAR | MIDV-LAIT | Det. & Rec. & Others | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | multi-lingual | Document | |||||||
2020/ICPR | AcTiVComp | Det. & Rec. | 2557 frames | Line | Rect [x1, y1, x2, y2, "transcript"] | Arabic | ||||||
Synthetic Text | ||||||||||||
Year/Venue | Name | Task | #Train | #Val | #Test | Granu. | Anno. Form | Language | Scene | Paper | Size | |
2016/CVPR | Synth800k | Det. & Rec. | 858,750 (7,266,866) | Char & Word & Line | Quad [x1, y1, x2, y2, x3, y3, x4, y4, 'trans'] | English | Synthetic | 41GB | ||||
2020 | UnrealText | 728,000 En + 674,000 others | multi-lingual | |||||||||
- | Chinese_ocr | Det. & Rec. | ~ 364 million | Chinese | Document | |||||||
- | UPTI | Urdu | ||||||||||
- | APTI | 45313600 (> 250 million chars) | Word | arabic | ||||||||
2021/ICDAR | SynthTiger | Rec. | ||||||||||
2021/ICDAR | DocSynth |