Skip to content

Optical Character Recognition (OCR)

clawsoftware edited this page Mar 21, 2023 · 4 revisions

Intro

Since version 0.8.7 clawPDF has a built-in text recognition (OCR). This allows to convert any document into a text or to create a PDF with text overlay. By default, the recognition of English, Spanish, French and German is supported. To install additional languages, go to the section Setup of additional languages.

Demo

OCR

Setup of additional languages

To download additional language files, visit the tessdata_best or tessdata_fast page and then copy them to the C:\Program Files (x86)\clawpdf\tessdata folder.

Correct use

  • To convert documents to text, select the OCR/TXT (print as text) profile or the OCR/TXT format.
  • To print any document to PDF with text overlay, choose the PDF/OCR (overlay with text) profile or the PDF/{color}-OCR format.
  • In the OCR tab, ensure that the font to be recognized corresponds to the abbreviation of the language file.

Troubleshooting

The whole text is recognized incorrectly

Make sure that the document is printed in portrait orientation. This step is already required in the Windows printer dialog.

Some characters are misrecognized

Make sure that the language file is present in the folder C:\Program Files (x86)\clawpdf\tessdata and the correct language is set in the OCR tab.