From 6a8eb7daaad60d36b0667e197d498ba59308f554 Mon Sep 17 00:00:00 2001 From: James Barlow Date: Wed, 26 Jun 2024 01:16:31 -0700 Subject: [PATCH] docs: page seg mode --- docs/advanced.rst | 53 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/docs/advanced.rst b/docs/advanced.rst index 7a37ad030..7fbb7740b 100644 --- a/docs/advanced.rst +++ b/docs/advanced.rst @@ -228,6 +228,59 @@ then run ocrmypdf as follows (along with any other desired arguments): Some combinations of control parameters will break Tesseract or break assumptions that OCRmyPDF makes about Tesseract's output. +Changing page segmentation mode +------------------------------- + +The directive ``--tesseract-pagesegmode Nmode`` forwards the desired page segmentation +mode to Tesseract OCR. The default is 3. + +Page segmentation can improve OCR results when you know that a PDF ought to be +analyzed a particular way, such as PDFs whose pages contain only a single line of +text. For the vast majority of users, changing the page segmentation mode will only +make things worse. + +As of June 2024, the Tesseract page segmentation modes are: + ++-----+----------------------------------------------------------------------------------+ +| ID | Description | ++=====+==================================================================================+ +| 0 | Orientation and script detection (OSD) only. | ++-----+----------------------------------------------------------------------------------+ +| 1 | Automatic page segmentation with OSD. | ++-----+----------------------------------------------------------------------------------+ +| 2 | Automatic page segmentation, but no OSD, or OCR. (not implemented) | ++-----+----------------------------------------------------------------------------------+ +| 3 | Fully automatic page segmentation, but no OSD. (Default) | ++-----+----------------------------------------------------------------------------------+ +| 4 | Assume a single column of text of variable sizes. | ++-----+----------------------------------------------------------------------------------+ +| 5 | Assume a single uniform block of vertically aligned text. | ++-----+----------------------------------------------------------------------------------+ +| 6 | Assume a single uniform block of text. | ++-----+----------------------------------------------------------------------------------+ +| 7 | Treat the image as a single text line. | ++-----+----------------------------------------------------------------------------------+ +| 8 | Treat the image as a single word. | ++-----+----------------------------------------------------------------------------------+ +| 9 | Treat the image as a single word in a circle. | ++-----+----------------------------------------------------------------------------------+ +| 10 | Treat the image as a single character. | ++-----+----------------------------------------------------------------------------------+ +| 11 | Sparse text. Find as much text as possible in no particular order. | ++-----+----------------------------------------------------------------------------------+ +| 12 | Sparse text with OSD. | ++-----+----------------------------------------------------------------------------------+ +| 13 | Raw line. Treat the image as a single text line, bypassing hacks that are | +| | Tesseract-specific. | ++-----+----------------------------------------------------------------------------------+ + +Modes 0, 1, 2, and 12 (all of those that enable orientation and script detection) +are not compatible with OCRmyPDF, which performs OSD in a separate step from OCR. +Their use may interfere with ``--rotate-pages`` and other features. + +It is currently not possible to use advanced Tesseract OCR features, such as creating +OCR information, when using Tesseract through OCRmyPDF. + Changing the PDF renderer =========================