Pretrained models for the ANN-based post-correction component.
All models use cor-asv-ann-train --width 512 --depth 2
and were initialised with weights from a language model via --init-model
, then pretrained on 200k lines of clean text (input=output) from DTA, and then retrained via --load-model
on GT4HistOCR and OCR-D GT, processed by various OCR models (input=OCR with confidence, output=GT). The latter is allowed to change all weights (not just fine-tuning) and does not reset the encoder layer weights (--reset-encoder
).
- dta19.Fraktur4: on 19th century Fraktur texts (GT4HistOCR/corpus/dta19) for the Tesseract 4 model
script/Fraktur
- pre19.Fraktur4: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Tesseract 4 model
script/Fraktur
- pre19.deu-frak3: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Tesseract 3 model
deu-frak
- pre19.Latin4: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Tesseract 4 model
script/Latin
- pre19.deu4: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Tesseract 4 model
deu
- pre19.incunabula: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Ocropus 1 model
incunabula.pyrnn
included in GT4HistOCR - pre19.latinhist: on 15-18th century blackletter texts (GT4HistOCR others, OCR-D) for the Ocropus 1 model
latinhist.pyrnn
included in GT4HistOCR
- gt4histocr.s-ſ: on GT4HistOCR ground truth degraded by replacing
ſ
intos
in the input (encouraging the network to learn its reconstruction)
...