Skip to content

Commit

Permalink
docs: various updates
Browse files Browse the repository at this point in the history
  • Loading branch information
jbarlow83 committed Aug 14, 2023
1 parent a371655 commit 06a5e0c
Show file tree
Hide file tree
Showing 10 changed files with 154 additions and 52 deletions.
22 changes: 13 additions & 9 deletions docs/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ and clean up the margins of both.

Some ``unpaper`` features cause multiple input or output files to be
consumed or produced. OCRmyPDF requires ``unpaper`` to consume one
file and produce one file. An deviation from that condition will
result in errors.
file and produce one file; errors will result if this assumption is not
met.

.. note::

Expand Down Expand Up @@ -82,14 +82,17 @@ is stripped out. Then an image of each page is created with visible text
masked out. The page image is sent for OCR, and any additional text is
inserted as OCR. If a file contains a mix of text and bitmap images that
contain text, OCRmyPDF will locate the additional text in images without
disrupting the existing text.
disrupting the existing text. Some PDF OCR solutions render text as
technically printable or visible in some way, perhaps by drawing it and
then painting over it. OCRmyPDF cannot distinguish this type of OCR
text from real text, so it will not be "redone".

If ``--force-ocr`` is issued, then all pages will be rasterized to
images, discarding any hidden OCR text, and rasterizing any printable
text. This is useful for redoing OCR, for fixing OCR text with a damaged
character map (text is selectable but not searchable), and destroying
redacted information. Any forms and vector graphics will be rasterized
as well.
images, discarding any hidden OCR text, rasterizing any printable
text, and flattening form fields or interactive objects into their visual
representation. This is useful for redoing OCR, for fixing OCR text
with a damaged character map (text is selectable but not searchable),
and destroying redacted information.

Time and image size limits
--------------------------
Expand Down Expand Up @@ -154,7 +157,8 @@ In addition to tesseract, OCRmyPDF uses the following external binaries:
- ``jbig2``

In each case OCRmyPDF will search the ``PATH`` environment variable to
locate the binaries.
locate the binaries. By modifying the ``PATH`` environment variable, you
can override the binaries that OCRmyPDF uses.

Changing tesseract configuration variables
------------------------------------------
Expand Down
25 changes: 12 additions & 13 deletions docs/batch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,16 +46,6 @@ place, and printing each filename in between runs:
find . -printf '%p\n' -name '*.pdf' -exec ocrmypdf '{}' '{}' \;
Alternatively, with a Docker container and streaming the file through
standard input and output:

.. code-block:: bash
find . -name '*.pdf' -print0 | xargs -0 | while read pdf; do
pdfout=$(mktemp)
docker run --rm -i jbarlow83/ocrmypdf - - <$pdf >$pdfout && cp $pdfout $pdf
done
This only runs one ``ocrmypdf`` process at a time. This variation uses
``find`` to create a directory list and ``parallel`` to parallelize runs
of ``ocrmypdf``, again updating files in place.
Expand All @@ -70,6 +60,15 @@ In a Windows batch file, use
for /r %%f in (*.pdf) do ocrmypdf %%f %%f
With a Docker container, you will need to stream through standard input and output:

.. code-block:: bash
find . -name '*.pdf' -print0 | xargs -0 | while read pdf; do
pdfout=$(mktemp)
docker run --rm -i jbarlow83/ocrmypdf - - <$pdf >$pdfout && cp $pdfout $pdf
done
Sample script
-------------

Expand All @@ -88,9 +87,9 @@ package <https://www.synology.com/en-global/dsm/packages/Docker>`__ is
installed. Attached is a script to address particular quirks of using
OCRmyPDF on one of these devices.

This is only possible for x86-based Synology products. Some Synology
products use ARM or Power processors and do not support Docker. Further
adjustments might be needed to deal with the Synology's relatively
At the time this script was written, it only worked for x86-based Synology
products. It is not known if it will work on ARM-based Synology products.
Further adjustments might be needed to deal with the Synology's relatively
limited CPU and RAM.

.. literalinclude:: ../misc/synology.py
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@
# General information about the project.
project = 'ocrmypdf'
copyright = (
'2022, James R. Barlow. Licensed under Creative Commons Attribution-ShareAlike 4.0.'
'2023, James R. Barlow. Licensed under Creative Commons Attribution-ShareAlike 4.0.'
)
author = 'James R. Barlow'

Expand Down
33 changes: 25 additions & 8 deletions docs/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,17 @@ Tests

New features should come with tests that confirm their correctness.

New Python dependencies
=======================
New dependencies
================

If you are proposing a change that will require a new Python dependency, we
If you are proposing a change that will require a new dependency, we
prefer dependencies that are already packaged by Debian or Red Hat. This makes
life much easier for our downstream package maintainers.
life much easier for our downstream package maintainers. A package that is only
available on PyPI or GitHub, and not more widely packaged, may not be accepted.

We are unlikely to accept a dependency on CUDA or other GPU-based libraries,
because these are still difficult to package and install on many systems.
We recommend implementing these changes as plugins.

Python dependencies must also be license-compatible. GPLv3 or AGPLv3 are likely
incompatible with the project's license, but LGPLv3 is compatible.
Expand All @@ -43,7 +48,19 @@ New non-Python dependencies
===========================

OCRmyPDF uses several external programs (Tesseract, Ghostscript and others) for
its functionality. In general we prefer to avoid adding new external programs.
its functionality. In general we prefer to avoid adding new external programs,
and if we are to add external programs, we prefer those that are already
packaged by Debian or Red Hat.

Plugins
=======

Some new features may be a good fit for a plugin. Plugins are a way to add
features to OCRmyPDF without adding them to the core program. Plugins are
installed separately from OCRmyPDF. They are written in Python and can be
installed from PyPI. See the `plugin documentation <https://ocrmypdf.readthedocs.io/en/latest/plugins.html>`_.

We are happy to link users to your plugin from the documentation.

Style guide: Is it OCRmyPDF or ocrmypdf?
========================================
Expand All @@ -53,8 +70,8 @@ The program/project is OCRmyPDF and the name of the executable or library is ocr
Copyright and license
=====================

For contributions over 10 lines of code, please include your name to list of
For contributions over 10 lines of code, please add your name to list of
copyright holders for that file. The core program is licensed under MPL-2.0,
test files and documentation under CC-BY-SA 4.0, and miscellaneous files under
MIT. Please contribute code only that you wrote and you have the permission to
contribute or license to us.
MIT, with a few minor exceptions. Please contribute only content that you own
or have the right to contribute under these licenses.
16 changes: 12 additions & 4 deletions docs/cookbook.rst
Original file line number Diff line number Diff line change
Expand Up @@ -231,13 +231,20 @@ Don't actually OCR my PDF
=========================

If you set ``--tesseract-timeout 0`` OCRmyPDF will apply its image
processing without performing OCR, if all you want to is to apply image
processing or PDF/A conversion.
processing without performing OCR (by causing OCR to time out). This works
if all you want to is to apply image processing or PDF/A conversion.

.. code-block:: bash
ocrmypdf --tesseract-timeout=0 --remove-background input.pdf output.pdf
.. versionchanged:: v14.1.0

Prior to this version, ``--tesseract-timeout 0`` would prevent other
uses of Tesseract, such as deskewing, from working. This is no longer
the case. Use ``--tesseract-non-ocr-timeout`` to control the timeout
for non-OCR operations, if needed.

Optimize images without performing OCR
--------------------------------------

Expand All @@ -261,9 +268,10 @@ Hyphens denote a range of pages and commas separate page numbers. If you prefer
to use spaces, quote all of the page numbers: ``--pages '2, 3, 5, 7'``.

OCRmyPDF will warn if your list of page numbers contains duplicates or
overlap pages. OCRmyPDF does not currently account for document page numbers,
overlapping pages. OCRmyPDF does not currently account for document page numbers,
such as an introduction section of a book that uses Roman numerals. It simply
counts the number of virtual pieces of paper since the start.
counts the number of virtual pieces of paper since the start. If your list of
pages is out of numerical order, OCRmyPDF will sort it for you.

Regardless of the argument to ``--pages``, OCRmyPDF will optimize all pages/images
in the file and convert it to PDF/A, unless you disable those options. Both of these
Expand Down
32 changes: 32 additions & 0 deletions docs/design_notes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
.. SPDX-FileCopyrightText: 2023 James R. Barlow
.. SPDX-License-Identifier: CC-BY-SA-4.0
============
Design notes
============

Why doesn't OCRmyPDF use PyTesseract?
=====================================

PyTesseract is a Python wrapper around the Tesseract OCR engine. When OCRmyPDF was
first written, PyTesseract used ABI bindings to call the Tesseract library. This
was not a good fit for OCRmyPDF because ABI bindings can be fragile.

PyTesseract has since evolved calling the Tesseract executable, abandoning the ABI
approach and using the CLI instead, just like OCRmyPDF does. If it were written from
scratch today, OCRmyPDF might use PyTesseract.

PyTesseract has more features don't particularly need PDF output, but less features
than OCRmyPDF's API for creating PDFs.

What is ``executor()``?
=======================

OCRmyPDF uses a custom concurrent executor which can support either threads or
processes with the same interface. This is useful because OCRmyPDF can use
either threads or processes to parallelize work, whichever is more appropriate
for the task at hand.

The interface is currently private and subject to change. In particular, if
experiments with asyncio and anyio are successful, the interface will change.

1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ image processing and OCR to existing PDFs.
api
plugins
apiref
design_notes
contributing
maintainers

Expand Down
69 changes: 55 additions & 14 deletions docs/pdfsecurity.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,31 +54,72 @@ into the existing PDF or it may essentially reconstruct ("re-fry") a
visually identical PDF that may be quite different at the binary level.
That said, OCRmyPDF is not a tool designed for sanitizing PDFs.

Password protection, digital signatures and certification
=========================================================
Password protected PDFs
=======================

Password protected PDFs usually have two passwords, and owner and user
password. When the user password is set to empty, PDF readers will open
the file automatically and marked it as "(SECURED)". While not as
reliable as a digital signature, this indicates that whoever set the
password approved of the file at that time. When the user password is
set, the document cannot be viewed without the password.
the file automatically and mark it as "(SECURED)". Password security can
also request certain restrictions on the PDF, but anyone can remove these
restrictions if they have either the owner *or* user password. Passwords
mainly present a barrier for casual users.

Either way, OCRmyPDF does not remove passwords from PDFs and exits with
an error on encountering them.
OCRmyPDF cannot remove passwords from PDFs. If you want to remove a
password from a PDF, you must use other software, such as ``qpdf``.

``qpdf`` can remove passwords. If the owner and user password are set, a
If the owner and user password are set, a
password is required for ``qpdf``. If only the owner password is set, then the
password can be stripped, even if one does not have the owner password.
password can be stripped, even if one does not have the owner password. To
remove the password from a using QPDF, use:

After OCR is applied, password protection is not permitted on PDF/A
documents but the file can be converted to regular PDF.
.. code-block:: bash
qpdf --decrypt --password='abc123' input.pdf no_password.pdf
Then you can run OCRmyPDF on the file.

In its default mode, OCRmyPDF generates PDF/A. Passwords may not be set on PDF/A
documents. If you want to set a password on the output PDF, you must
specify ``--output-type pdf``.

Signature images
================

Many programs exist which are capable of inserting an image of someone's
signature. On its own, this offers no security guarantees. It is trivial
to remove the signature image and apply it to other files. This practice
offers no real security.

Digital signatures
==================

Important documents can be digitally signed and certified to attest to
their authorship. OCRmyPDF cannot do this. Open source tools such as
pdfbox (Java) have this capability as does Adobe Acrobat.
their authorship, approval or execution of a legal agreement. OCRmyPDF
will detect signed PDFs and will not modify them, unless the
``--invalidate-digital-signatures`` option is used, which will
invalidate any signatures. (The signature may still be present in the PDF
if opened, but PDF readers will not validate it.)

A digital signature adds a cryptographic hash of the document to the
document, so tamper protection is provided. That also precludes OCRmyPDF
from modifying the document and preserving the signature.

Digital signatures are not the same as a signature image. A digital
signature is a cryptographic hash of the document that is encrypted with
the author's private key. The signature is decrypted with the author's
public key. The public key is usually distributed by a certificate
authority. The signature is then verified by the PDF reader. If the
document is modified, the signature will be invalidated.

Certificate-encrypted PDFs
==========================

PDFs can be encrypted with a certificate. This is a more secure form of
encryption than a password. The certificate is usually issued by a
certificate authority. A certificate is used to encrypt the document using
the public key for the benefit of a specific recipient who possesses
the private key.

OCRmyPDF cannot open certificate-encrypted PDFs. If you have the
certificate, you can use other PDF software, such as Acrobat, to
decrypt the PDF.
4 changes: 2 additions & 2 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ v14.4.0
- Digitally signed PDFs are now detected. If the PDF is signed, OCRmyPDF will
refuse to modify it. Previously, only encrypted PDFs were detected, not
those that were signed but not encrypted. :issue:`1040`
- In addition, `--invalidate-digital-signatures` can be used to override the
- In addition, ``--invalidate-digital-signatures`` can be used to override the
above behavior and modify the PDF anyway. :issue:`1040`
- tqdm progress bars replaced with "rich" progress bars. The rich library is
a new dependency. Certain APIs that used tqdm are now deprecated and will
Expand Down Expand Up @@ -67,7 +67,7 @@ v14.2.1
v14.2.0
=======

- Added `--tesseract-downsample-above` to downsample larger images even when
- Added ``--tesseract-downsample-above`` to downsample larger images even when
they do not exceed Tesseract's internal limits. This can be used to speed
up OCR, possibly sacrificing accuracy.
- Fixed resampling AttributeError on older Pillow. :issue:`1096`
Expand Down
2 changes: 1 addition & 1 deletion src/ocrmypdf/builtin_plugins/concurrency.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ def process_init(q: Queue, user_init: UserInit, loglevel) -> None:
# Windows and Cygwin do not have pthread_sigmask or SIGBUS
signal.signal(signal.SIGBUS, process_sigbus)

# Remove any log handlers that belong to the parent process
# Remove any log handlers inherited from the parent process
root = logging.getLogger()
remove_all_log_handlers(root)

Expand Down

0 comments on commit 06a5e0c

Please sign in to comment.