Skip to content

Commit

Permalink
docs: update to discuss some v15 features not yet documented
Browse files Browse the repository at this point in the history
  • Loading branch information
jbarlow83 committed Sep 27, 2023
1 parent 01bbf7d commit 890b994
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 11 deletions.
32 changes: 26 additions & 6 deletions docs/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,14 +117,14 @@ exceed a certain number of megapixels with ``--skip-big``. (A 300 DPI,
OCR for huge images
-------------------

Separate from these settings, Tesseract has internal limits on the size
Tesseract has internal limits on the size
of images it will process. If you issue
``--tesseract-downsample-large-images``, OCRmyPDF will downsample images
to fit Tesseract limits. (The limits are usually entered only for scanned
images of oversized media, such as large maps or blueprints exceeding
110 cm or 43 inches in either dimension, and at high DPI.)

``--tesseract-downsample-above`` adjusts the threshold at which images
``--tesseract-downsample-above Npixels`` adjusts the threshold at which images
will be downsampled. By default, only images that exceed any of Tesseract's
internal limits are downsampled.

Expand Down Expand Up @@ -195,10 +195,10 @@ In each case OCRmyPDF will search the ``PATH`` environment variable to
locate the binaries. By modifying the ``PATH`` environment variable, you
can override the binaries that OCRmyPDF uses.

Changing tesseract configuration variables
Changing Tesseract configuration variables
------------------------------------------

You can override tesseract's default `control
You can override Tesseract's default `control
parameters <https://tesseract-ocr.github.io/tessdoc/tess3/ControlParams.html>`__
with a configuration file.

Expand Down Expand Up @@ -273,7 +273,7 @@ Unlike ``sandwich`` this renderer is implemented within OCRmyPDF; anyone
looking to customize how OCR is presented should look here. A major
disadvantage of this renderer is it not capable of correctly handling
text outside the Latin alphabet (specifically, it supports the ISO 8859-1
character). Pull requests to improve the situation are welcome.
character set). Pull requests to improve the situation are welcome.

Currently, this renderer has the best compatibility with Mozilla's
PDF.js viewer.
Expand All @@ -286,11 +286,31 @@ Rendering and rasterizing options
.. versionadded:: 14.3.0

The ``--continue-on-soft-render-error`` option allows OCRmyPDF to
proceed if a page cannot be rasterized rendered. This is useful if you are
proceed if a page cannot be rasterized/rendered. This is useful if you are
trying to get the best possible OCR from a PDF that is not well-formed,
and you are willing to accept some pages that may not visually match the
input, and that may not OCR well.

Color conversion strategy
=========================

.. versionadded:: 15.0.0

OCRmyPDF uses Ghostscript to convert PDF to PDF/A. In some cases, this
conversion requires color conversion. The default strategy is to convert
using the ``LeaveColorUnchanged`` strategy, which preserves the original
color space wherever possible (some rare color spaces might still be
converted).

Usually document scanners produce PDFs in the sRGB color space, and do
not need to be converted, so the default strategy is appropriate.

Suppose that you have a document that was prepared for professional
printing in a Separation or CMYK color space, and text was converted to
curves. In this case, you may want to use a different color conversion
strategy. The ``--color-conversion-strategy`` option allows you to select a
different strategy, such as ``RGB``.

Return code policy
==================

Expand Down
6 changes: 3 additions & 3 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -420,12 +420,12 @@ to change the PATH.

As of early 2021, users have reported problems with the Microsoft Store version of
Python and OCRmyPDF. These issues affect many other third party Python packages.
Please download Python from Python.org or Chocolatey instead, and do not use the
Please download Python from Python.org or a package manager instead of the
Microsoft Store version.

.. warning::

32-bit Windows might work, but is not supported.
32-bit Windows is not supported.

Windows Subsystem for Linux
---------------------------
Expand Down Expand Up @@ -558,7 +558,7 @@ The following versions are required:
- unpaper 6.1

We recommend 64-bit versions of all software. (32-bit versions are not
supported, although they may still work.)
supported, although on Linux, they may still work.)

jbig2enc, pngquant, and unpaper are optional. If missing certain
features are disabled. OCRmyPDF will discover them as soon as they are
Expand Down
4 changes: 4 additions & 0 deletions docs/jbig2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,5 +59,9 @@ To turn on JBIG2 lossy mode, add the argument ``--jbig2-lossy``.
also required. Also, a JBIG2 encoder must be installed as described in
the previous section.

You can adjust the threshold for JBIG2 compression with the
``--jbig2-threshold``. The default is 0.85, meaning that if two symbols
are 85% similar, they will be compressed together.

*Due to an oversight, ocrmypdf v7.0 and v7.1 used lossy mode by
default.*
4 changes: 2 additions & 2 deletions docs/optimizer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ Optimizations that always occurs
================================

OCRmyPDF will automatically replace obsolete or inferior compression schemes
such as RLE or LZW with superior schemes such as Deflate and converting
monochrome images to CCITT G4. Since this is harmless it always occurs and there
such as RLE or LZW with superior schemes such as Deflate, and convert
monochrome images to CCITT G4. Since this is lossless, it always occurs and there
is no way to disable it. Other non-image compressed objects are compressed as
well.

Expand Down

0 comments on commit 890b994

Please sign in to comment.