Saving the images ocrmypdf temporarily creates OR use existing pdf-to-img pdfs #1457
Replies: 1 comment 1 reply
-
You could create a OCRmyPDF plugin that uses Surya as its OCR engine instead of Tesseract for example. There are often conflicting goals in the latest LLM OCR compared to conventional OCR; in particularly they're strongly focused on accuracy, which is great, but not on pixel-perfect text positioning or grouping text into paragraphs. In particular, Surya only obtains text line level positioning, so for typical variable width fonts you'd see significant misalignment if it were used to generate PDF. You could also have your OCR plugin run both Surya and Tesseract OCR, saving Tesseract for PDF generation and setting aside Surya elsewhere. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I currently have an OCR workflow that uses Surya OCR for ocring and table recognition and generating text files and CSVs from scanned semi-technical PDF documents. These outputs are then fed into an LLM data extraction tool. While Surya performs well, it doesn't create searchable PDFs. To address this, I run OCRmyPDF on the original PDF as a final step to generate a searchable PDF.
However, I've noticed an inefficiency in this approach since both Surya OCR and OCRmyPDF perform PDF to image conversion and preprocessing. To optimize the workflow, I'm looking for a way to either:
a) Provide the images generated by Surya OCR as input to OCRmyPDF, avoiding the need for OCRmyPDF to convert the PDF to images;
OR
b) Change the order of my workflow to run OCRmyPDF first, retaining the images it creates during its pipeline. Then, modify the Surya OCR workflow to use these pre-generated images as input.
The goal is to perform the PDF to image conversion only once throughout the entire workflow.
If anyone has experience with modifying OCRmyPDF to achieve either option (a) or (b), or has an alternative suggestion for streamlining this process, I would greatly appreciate your insights and advice.
Thank you in advance for your help!
Best regards,
Jasper
Beta Was this translation helpful? Give feedback.
All reactions