diff --git a/docs/apiref.rst b/docs/apiref.rst index 766c98863..48cf7a293 100644 --- a/docs/apiref.rst +++ b/docs/apiref.rst @@ -3,11 +3,11 @@ .. SPDX-License-Identifier: CC-BY-SA-4.0 ============= -API Reference +API reference ============= This page summarizes the rest of the public API. Generally speaking this -should mainly of interest to plugin developers. +should be mainly of interest to plugin developers. ocrmypdf ======== diff --git a/docs/contributing.rst b/docs/contributing.rst index 40a81e5c9..e1c8e138b 100644 --- a/docs/contributing.rst +++ b/docs/contributing.rst @@ -19,7 +19,7 @@ Code style ========== We use PEP8, ``black`` for code formatting and ``ruff`` for everything else. The -settings for these programs are in ``pyproject.toml`` and ``setup.cfg``. Pull +settings for these programs are in ``pyproject.toml``. Pull requests should follow the style guide. One difference we use from "black" style is that strings shown to the user are always in double quotes (``"``) and strings for internal uses are in single quotes (``'``). diff --git a/docs/introduction.rst b/docs/introduction.rst index 8a487f615..53f069688 100644 --- a/docs/introduction.rst +++ b/docs/introduction.rst @@ -6,23 +6,23 @@ Introduction ============ -OCRmyPDF is an application and library that adds text "layers" to images -in PDFs, making scanned image PDFs searchable. It uses OCR to guess what text -is contained in images. It is written in Python. OCRmyPDF supports plugins -that allow customization of its processing steps, and is very tolerant of -PDFs that contain scanned images and "born digital" content that needs no -text recognition. +OCRmyPDF is a Python application and library that adds text "layers" to images in +PDFs, making scanned image PDFs searchable. It uses OCR to guess the text +contained in images. OCRmyPDF also supports plugins +that enable customization of its processing steps, and it is highly tolerant +of PDFs containing scanned images and "born digital" content that doesn't +require text recognition. About OCR ========= `Optical character recognition `__ -is technology that converts images of typed or handwritten text, such as -in a scanned document, to computer text that can be selected, searched and copied. +is a technology that converts images of typed or handwritten text, such as +in a scanned document, into computer text that can be selected, searched and copied. OCRmyPDF uses -`Tesseract `__, the best +`Tesseract `__, a widely available open source OCR engine, to perform OCR. .. _raster-vector: @@ -30,19 +30,19 @@ available open source OCR engine, to perform OCR. About PDFs ========== -PDFs are page description files that attempts to preserve a layout +PDFs are page description files that attempt to preserve a layout exactly. They contain `vector graphics `__ -that can contain raster objects such as scanned images. Because PDFs can +that can contain raster objects, such as scanned images. Because PDFs can contain multiple pages (unlike many image formats) and can contain fonts -and text, it is a good format for exchanging scanned documents. +and text, they are a suitable format for exchanging scanned documents. |image| -A PDF page might contain multiple images, even if it only appears to -have one image. Some scanners or scanning software will segment pages -into monochromatic text and color regions for example, to improve the -compression ratio and appearance of the page. +A PDF page may contain multiple images, even if it appears to have only +one image. Some scanners or scanning software may segment pages into +monochromatic text and color regions, for example, to enhance the compression +ratio and appearance of the page. Rasterizing a PDF is the process of generating corresponding raster images. OCR engines like Tesseract work with images, not scalable vector graphics @@ -54,147 +54,131 @@ About PDF/A `PDF/A `__ is an ISO-standardized subset of the full PDF specification that is designed for archiving (the 'A' stands for Archive). PDF/A differs from PDF primarily by omitting -features that would make it difficult to read the file in the future, +features that could complicate future file readability, such as embedded Javascript, video, audio and references to external fonts. All fonts and resources needed to interpret the PDF must be contained within it. Because PDF/A disables Javascript and other types -of embedded content, it is probably more secure. +of embedded content, it is likely more secure. There are various conformance levels and versions, such as "PDF/A-2b". -Generally speaking, the best format for scanned documents is PDF/A. Some +In general, the preferred format for scanned documents is PDF/A. Some governments and jurisdictions, US Courts in particular, `mandate the use of PDF/A `__ for scanned documents. -Since most people who scan documents are interested in reading them -indefinitely into the future, OCRmyPDF generates PDF/A-2b by default. +Since most individuals scanning documents aim for long-term readability, +OCRmyPDF defaults to generating PDF/A-2b. -PDF/A has a few drawbacks. Some PDF viewers include an alert that the -file is a PDF/A, which may confuse some users. It also tends to produce -larger files than PDF, because it embeds certain resources even if they -are commonly available. PDF/A files can be digitally signed, but may not -be encrypted, to ensure they can be read in the future. Fortunately, -converting from PDF/A to a regular PDF is trivial, and any PDF viewer -can view PDF/A. +PDF/A does have a few drawbacks. Some PDF viewers display an alert +indicating that the file is in PDF/A format, which may confuse some users. +Additionally, it tends to result in larger files than standard PDFs because +it embeds certain resources, even if they are widely available. PDF/A +files can be digitally signed but may not be encrypted to ensure future +readability. Fortunately, converting from PDF/A to a regular PDF is +straightforward, and any PDF viewer can handle PDF/A files. What OCRmyPDF does ================== -OCRmyPDF analyzes each page of a PDF to determine the colorspace and -resolution (DPI) needed to capture all of the information on that page -without losing content. It uses -`Ghostscript `__ to rasterize the page, and -then performs OCR on the rasterized image to create an OCR "layer". -The layer is then grafted back onto the original PDF. +OCRmyPDF analyzes each page of a PDF to determine the required colorspace +and resolution (DPI) for capturing all the information on that page without +losing content. It uses +`Ghostscript `__ to rasterize each page and subsequently +performs OCR on the rasterized image to generate an OCR "layer." This layer +is then integrated back into the original PDF. -While one can use a program like Ghostscript or ImageMagick to get an -image and put the image through Tesseract, that actually creates a new -PDF and many details may be lost. OCRmyPDF can produce a minimally -changed PDF as output. +While it is possible to use a program like Ghostscript or ImageMagick to +obtain an image and then run that image through Tesseract OCR, this process +actually generates a new PDF, potentially resulting in the loss of various +details (such as the document's metadata). In contrast, OCRmyPDF can produce +a minimally altered PDF as the output. -OCRmyPDF also provides some image processing options, like deskew, which -improves the appearance of files and quality of OCR. When these are used, -the OCR layer is grafted onto the processed image instead. +OCRmyPDF also offers several image processing options, such as deskew, which +enhances the visual quality of files and the accuracy of OCR. When these +options are utilized, the OCR layer is integrated into the processed image. -By default, OCRmyPDF produces archival PDFs – PDF/A, which are a -stricter subset of PDF features designed for long term archives. If -regular PDFs are desired, this can be disabled with -``--output-type pdf``. +By default, OCRmyPDF generates archival PDFs in the PDF/A format, which is +a more rigid subset of PDF features designed for long-term archives. If you +prefer regular PDFs, you can disable this feature using the +``--output-type pdf`` option. Why you shouldn't do this manually ================================== A PDF is similar to an HTML file, in that it contains document structure -along with images. Sometimes a PDF does nothing more than present a full -page image, but often there is additional content that would be lost. - -A manual process could work like either of these: - -1. Rasterize each page as an image, OCR the images, and combine the - output into a PDF. This preserves the layout of each page, but - resamples all images (possibly losing quality, increasing file size, - introducing compression artifacts, etc.). -2. Extract each image, OCR, and combine the output into a PDF. This - loses the context in which images are used in the PDF, meaning that - cropping, rotation and scaling of pages may be lost. Some scanned - PDFs use multiple images segmented into black and white, grayscale +along with images. While some PDFs may solely display a full-page image, +they often contain additional content that would be forfeited if not preserved. + +A manual process could take one of these approaches: + +1. Rasterize each page as an image, perform OCR on the images, and then merge the + output into a PDF. This method preserves the layout of each page, but + resamples all images potentially leading to quality loss, increased file size, + and the introduction of compression artifacts, among other issues. +2. Extract each image, OCR, and combine the output into a PDF. This approach + loses the context in which images are used in the PDF, potentially resulting + in loss of information related to scaling and position of images. Some scanned + PDFs contain multiple images segmented into black and white, grayscale and color regions, with stencil masks to prevent overlap, as this can - enhance the appearance of a file while reducing file size. Clearly, - reassembling these images will be easy. This also loses and text or - vector art on any pages in a PDF with both scanned and pure digital - content. - -In the case of a PDF that is nothing other than a container of images -(no rotation, scaling, cropping, one image per page), the second -approach can be lossless. - -OCRmyPDF uses several strategies depending on input options and the -input PDF itself, but generally speaking it rasterizes a page for OCR -and then grafts the OCR back onto the original. As such it can handle -complex PDFs and still preserve their contents as much as possible. - -OCRmyPDF also supports a many, many edge cases that have cropped over -several years of development. We support PDF features like images inside -of Form XObjects, and pages with UserUnit scaling. We support rare image -formats like non-monochrome 1-bit images. We warn about files you may -not to OCR. Thanks to pikepdf and QPDF, we auto-repair PDFs that are -damaged. (Not that you need to know what any of these are! You should be -able to throw any PDF at it.) + enhance the appearance of a file while reducing file size. + Reassembling these images can be challenging, and risks losing vector art + or text that is not part of an image. + +In cases where a PDF solely serves as a container for images without any +rotation, scaling, or cropping, the second approach can be lossless. + +OCRmyPDF uses various strategies depending on input options and the input PDF +itself. Generally, it rasterizes a page for OCR and then integrates the OCR +data back into the original PDF. This approach allows it to handle complex +PDFs and preserve their content as much as possible. + +Furthermore, OCRmyPDF supports a wide range of edge cases that have emerged +during several years of development. It accommodates PDF features like +images within Form XObjects and pages with UserUnit scaling. It also +supports less common image formats like non-monochrome 1-bit images and +provides warnings about files you may not want to OCR. Thanks to tools +like pikepdf and QPDF, it can auto-repair damaged PDFs. You don't need to +understand the intricacies of these issues; you should be able to use +OCRmyPDF with any PDF file, and expect reasonable results. Limitations =========== -OCRmyPDF is limited by the Tesseract OCR engine. As such it experiences -these limitations, as do any other programs that rely on Tesseract: - -- The OCR is not as accurate as commercial OCR solutions. -- It is not capable of recognizing handwriting. -- It may find gibberish and report this as OCR output. -- If a document contains languages outside of those given in the - ``-l LANG`` arguments, results may be poor. -- It is not always good at analyzing the natural reading order of - documents. For example, it may fail to recognize that a document - contains two columns, and may try to join text across columns. -- Poor quality scans may produce poor quality OCR. Garbage in, garbage - out. -- It does not expose information about what font family text belongs - to. - -OCRmyPDF is also limited by the PDF specification: - -- PDF encodes the position of text glyphs but does not encode document - structure. There is no markup that divides a document in sections, - paragraphs, sentences, or even words (since blank spaces are not - represented). As such all elements of document structure including - the spaces between words must be derived heuristically. Some PDF - viewers do a better job of this than others. -- Because some popular open source PDF viewers have a particularly hard - time with spaces between words, OCRmyPDF appends a space to each text - element as a workaround (when using ``--pdf-renderer hocr``). While - this mixes document structure with graphical information that ideally - should be left to the PDF viewer to interpret, it improves - compatibility with some viewers and does not cause problems for - better ones. +OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine. +These limitations are inherent to any software relying on Tesseract: + +- The OCR accuracy may not match that of commercial OCR solutions. +- It is incapable of recognizing handwriting. +- It may detect gibberish and report it as OCR output. +- Results may be subpar when a document contains languages not specified + in the ``-l LANG`` argument. +- Tesseract may struggle to analyze the natural reading order of documents. + For instance, it might fail to recognize two columns in a document and + attempt to join text across columns. +- Poor quality scans can result in subpar OCR quality. In other words, the + quality of the OCR output depends on the quality of the input. +- Tesseract does not provide information about the font family to which text + belongs. +- Tesseract does not divide text into paragraphs or headings. It only provides + the text and its bounding box. As such, the generated PDF does not + contain any information about the document's structure. Ghostscript also imposes some limitations: -- PDFs containing JBIG2-encoded content will be converted to CCITT - Group4 encoding, which has lower compression ratios, if Ghostscript - PDF/A is enabled. -- PDFs containing JPEG 2000-encoded content will be converted to JPEG +- PDFs containing JPEG 2000-encoded content may be converted to JPEG encoding, which may introduce compression artifacts, if Ghostscript PDF/A is enabled. -- Ghostscript may transcode grayscale and color images, either lossy to - lossless or lossless to lossy, based on an internal algorithm. This +- Ghostscript may transcode grayscale and color images, potentially + lossily, based on an internal algorithm. This behavior can be suppressed by setting ``--pdfa-image-compression`` to ``jpeg`` or ``lossless`` to set all images to one type or the other. - Ghostscript has no option to maintain the input image's format. + Ghostscript lacks an option to maintain the input image's format. (Modern Ghostscript can copy JPEG images without transcoding them.) - Ghostscript's PDF/A conversion removes any XMP metadata that is not one of the standard XMP metadata namespaces for PDFs. In particular, PRISM Metadata is removed. -- Ghostscript's PDF/A conversion seems to remove or deactivate +- Ghostscript's PDF/A conversion may remove or deactivate hyperlinks and other active content. You can use ``--output-type pdf`` to disable PDF/A conversion and produce @@ -202,7 +186,7 @@ a standard, non-archival PDF. Regarding OCRmyPDF itself: -- PDFs that use transparency are not currently represented in the test +- PDFs using transparency are not currently represented in the test suite Similar programs @@ -210,11 +194,7 @@ Similar programs To the author's knowledge, OCRmyPDF is the most feature-rich and thoroughly tested command line OCR PDF conversion tool. If it does not -meet your needs, contributions and suggestions are welcome. If not, -consider one of these similar open source programs: - -- pdf2pdfocr -- pdfsandwich +meet your needs, contributions and suggestions are welcome. Ghostscript recently added three "pdfocr" output devices. They work by rasterizing all content and converting all pages to a single colour space. @@ -222,16 +202,19 @@ rasterizing all content and converting all pages to a single colour space. Web front-ends ============== -The Docker image ``ocrmypdf`` provides a web service front-end -that allows files to submitted over HTTP and the results "downloaded". -This is an HTTP server intended to simplify web services deployments; it -is not intended to be deployed on the public internet and no real -security measures to speak of. +The Docker image of OCRmyPDF provides a web service front-end +that allows files to submitted over HTTP, and the results can be downloaded. +This is an HTTP server intended to demonstrate how OCRmyPDF can be +integrated into a web service. It is not intended to be deployed on the +public internet and does not provide any security measures. In addition, the following third-party integrations are available: +- `Paperless-ngx `__ is a free software + document management system that uses OCRmyPDF to perform OCR on + uploaded documents. - `Nextcloud OCR `__ is a free software - plugin for the Nextcloud private cloud software + plugin for the Nextcloud private cloud software. OCRmyPDF is not designed to be secure against malware-bearing PDFs (see `Using OCRmyPDF online `__). Users should ensure they diff --git a/docs/jbig2.rst b/docs/jbig2.rst index befd7aadb..cc447d3aa 100644 --- a/docs/jbig2.rst +++ b/docs/jbig2.rst @@ -14,17 +14,20 @@ expired as of 2017, but it is possible that unknown patents exist. JBIG2 encoding is recommended for OCRmyPDF and is used to losslessly create smaller PDFs. If JBIG2 encoding is not available, lower quality -encodings will be used. +CCITT encoding will be used for monochrome images. JBIG2 decoding is not patented and is performed automatically by most PDF viewers. It is widely supported and has been part of the PDF specification since 2001. -On macOS, Homebrew packages jbig2enc and OCRmyPDF includes it by -default. The Docker image for OCRmyPDF also builds its own JBIG2 encoder -from source. +JBIG encoding is automatically provided by these OCRmyPDF packages: +- Docker image (both Ubuntu and Alpine) +- Snap package +- ArchLinux AUR package +- Alpine Linux package +- Homebrew on macOS -For all other Linux, you must build a JBIG2 encoder from source: +For all other platforms, you would need to build the JBIG2 encoder from source: .. code-block:: bash @@ -43,16 +46,21 @@ as libtool and leptonica-devel. Lossy mode JBIG2 ================ -OCRmyPDF provides lossy mode JBIG2 as an advanced feature. Users should +OCRmyPDF provides lossy mode JBIG2 as an advanced and potentially dangerous +feature. Users should `review the technical concerns with JBIG2 in lossy mode `__ -and decide if this feature is acceptable for their use case. +and decide if this feature is acceptable for their use case. In general, +this mode should not be used for archival purposes, should not be used when +the original document is not available or will be destroyed, and should +not be used when numbers present in the document are important, because +there is a risk of 6/8 and 8/6 substitution errors. JBIG2 lossy mode does achieve higher compression ratios than any other monochrome (bitonal) compression technology; for large text documents the savings are considerable. JBIG2 lossless still gives great compression ratios and is a major improvement over the older CCITT G4 -standard. As explained above, there is some risk of substitution errors. +standard. To turn on JBIG2 lossy mode, add the argument ``--jbig2-lossy``. ``--optimize {1,2,3}`` are necessary for the argument to take effect