OCR Image to Text Conversion

This guide provides instructions on how to perform Optical Character Recognition (OCR) to convert images to text using Tesseract-OCR and Python.

Tesseract-OCR Method

Installation

Install Tesseract-OCR: sudo apt install tesseract-ocr
Download the Bulgarian language dictionary: bul.traineddata
Move the downloaded dictionary to the Tesseract-OCR data directory: mv bul.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
Install jq sudo apt install jq

Install Python Libraries

pip3 install pytesseract
pip3 install pillow
pip3 install pandas
pip3 install argparse
pip3 install pdf2image

Convert PDF into Images

Install ImageMagic: sudo apt install imagemagick
Add this policy in /etc/ImageMagick-6/policy.xml:
Convert it: convert -density 300 input.pdf output.png

Instructions

Run the python file to OCR the pdf file

python your_script.py your_json_file.json --delete