This guide provides instructions on how to perform Optical Character Recognition (OCR) to convert images to text using Tesseract-OCR and Python.
- Install Tesseract-OCR:
sudo apt install tesseract-ocr
- Download the Bulgarian language dictionary: bul.traineddata
- Move the downloaded dictionary to the Tesseract-OCR data directory:
mv bul.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
- Install jq
sudo apt install jq
- pip3 install pytesseract
- pip3 install pillow
- pip3 install pandas
- pip3 install argparse
- pip3 install pdf2image
- Install ImageMagic: sudo apt install imagemagick
- Add this policy in /etc/ImageMagick-6/policy.xml:
- Convert it: convert -density 300 input.pdf output.png
- Run the python file to OCR the pdf file
- python your_script.py your_json_file.json --delete