docs: various updates

ferdiga · Aug 14, 2023 · 06a5e0c · 06a5e0c
1 parent a371655
commit 06a5e0c
Show file tree

Hide file tree

Showing 10 changed files with 154 additions and 52 deletions.
diff --git a/docs/advanced.rst b/docs/advanced.rst
@@ -47,8 +47,8 @@ and clean up the margins of both.
 
    Some ``unpaper`` features cause multiple input or output files to be
    consumed or produced. OCRmyPDF requires ``unpaper`` to consume one
-   file and produce one file. An deviation from that condition will
-   result in errors.
+   file and produce one file; errors will result if this assumption is not
+   met.
 
 .. note::
 
@@ -82,14 +82,17 @@ is stripped out. Then an image of each page is created with visible text
 masked out. The page image is sent for OCR, and any additional text is
 inserted as OCR. If a file contains a mix of text and bitmap images that
 contain text, OCRmyPDF will locate the additional text in images without
-disrupting the existing text.
+disrupting the existing text. Some PDF OCR solutions render text as
+technically printable or visible in some way, perhaps by drawing it and
+then painting over it. OCRmyPDF cannot distinguish this type of OCR
+text from real text, so it will not be "redone".
 
 If ``--force-ocr`` is issued, then all pages will be rasterized to
-images, discarding any hidden OCR text, and rasterizing any printable
-text. This is useful for redoing OCR, for fixing OCR text with a damaged
-character map (text is selectable but not searchable), and destroying
-redacted information. Any forms and vector graphics will be rasterized
-as well.
+images, discarding any hidden OCR text, rasterizing any printable
+text, and flattening form fields or interactive objects into their visual
+representation. This is useful for redoing OCR, for fixing OCR text
+with a damaged character map (text is selectable but not searchable),
+and destroying redacted information.
 
 Time and image size limits
 --------------------------
@@ -154,7 +157,8 @@ In addition to tesseract, OCRmyPDF uses the following external binaries:
 -  ``jbig2``
 
 In each case OCRmyPDF will search the ``PATH`` environment variable to
-locate the binaries.
+locate the binaries. By modifying the ``PATH`` environment variable, you
+can override the binaries that OCRmyPDF uses.
 
 Changing tesseract configuration variables
 ------------------------------------------

diff --git a/docs/batch.rst b/docs/batch.rst
@@ -46,16 +46,6 @@ place, and printing each filename in between runs:
 
    find . -printf '%p\n' -name '*.pdf' -exec ocrmypdf '{}' '{}' \;
 
-Alternatively, with a Docker container and streaming the file through
-standard input and output:
-
-.. code-block:: bash
-
-   find . -name '*.pdf' -print0 | xargs -0 | while read pdf; do
-       pdfout=$(mktemp)
-       docker run --rm -i jbarlow83/ocrmypdf - - <$pdf >$pdfout && cp $pdfout $pdf
-   done
-
 This only runs one ``ocrmypdf`` process at a time. This variation uses
 ``find`` to create a directory list and ``parallel`` to parallelize runs
 of ``ocrmypdf``, again updating files in place.
@@ -70,6 +60,15 @@ In a Windows batch file, use
 
    for /r %%f in (*.pdf) do ocrmypdf %%f %%f
 
+With a Docker container, you will need to stream through standard input and output:
+
+.. code-block:: bash
+
+   find . -name '*.pdf' -print0 | xargs -0 | while read pdf; do
+       pdfout=$(mktemp)
+       docker run --rm -i jbarlow83/ocrmypdf - - <$pdf >$pdfout && cp $pdfout $pdf
+   done
+
 Sample script
 -------------
 
@@ -88,9 +87,9 @@ package <https://www.synology.com/en-global/dsm/packages/Docker>`__ is
 installed. Attached is a script to address particular quirks of using
 OCRmyPDF on one of these devices.
 
-This is only possible for x86-based Synology products. Some Synology
-products use ARM or Power processors and do not support Docker. Further
-adjustments might be needed to deal with the Synology's relatively
+At the time this script was written, it only worked for x86-based Synology
+products. It is not known if it will work on ARM-based Synology products.
+Further adjustments might be needed to deal with the Synology's relatively
 limited CPU and RAM.
 
 .. literalinclude:: ../misc/synology.py

diff --git a/docs/conf.py b/docs/conf.py
@@ -65,7 +65,7 @@
 # General information about the project.
 project = 'ocrmypdf'
 copyright = (
-    '2022, James R. Barlow. Licensed under Creative Commons Attribution-ShareAlike 4.0.'
+    '2023, James R. Barlow. Licensed under Creative Commons Attribution-ShareAlike 4.0.'
 )
 author = 'James R. Barlow'
 

diff --git a/docs/contributing.rst b/docs/contributing.rst
@@ -29,12 +29,17 @@ Tests
 
 New features should come with tests that confirm their correctness.
 
-New Python dependencies
-=======================
+New dependencies
+================
 
-If you are proposing a change that will require a new Python dependency, we
+If you are proposing a change that will require a new dependency, we
 prefer dependencies that are already packaged by Debian or Red Hat. This makes
-life much easier for our downstream package maintainers.
+life much easier for our downstream package maintainers. A package that is only
+available on PyPI or GitHub, and not more widely packaged, may not be accepted.
+
+We are unlikely to accept a dependency on CUDA or other GPU-based libraries,
+because these are still difficult to package and install on many systems.
+We recommend implementing these changes as plugins.
 
 Python dependencies must also be license-compatible. GPLv3 or AGPLv3 are likely
 incompatible with the project's license, but LGPLv3 is compatible.
@@ -43,7 +48,19 @@ New non-Python dependencies
 ===========================
 
 OCRmyPDF uses several external programs (Tesseract, Ghostscript and others) for
-its functionality. In general we prefer to avoid adding new external programs.
+its functionality. In general we prefer to avoid adding new external programs,
+and if we are to add external programs, we prefer those that are already
+packaged by Debian or Red Hat.
+
+Plugins
+=======
+
+Some new features may be a good fit for a plugin. Plugins are a way to add
+features to OCRmyPDF without adding them to the core program. Plugins are
+installed separately from OCRmyPDF. They are written in Python and can be
+installed from PyPI. See the `plugin documentation <https://ocrmypdf.readthedocs.io/en/latest/plugins.html>`_.
+
+We are happy to link users to your plugin from the documentation.
 
 Style guide: Is it OCRmyPDF or ocrmypdf?
 ========================================
@@ -53,8 +70,8 @@ The program/project is OCRmyPDF and the name of the executable or library is ocr
 Copyright and license
 =====================
 
-For contributions over 10 lines of code, please include your name to list of
+For contributions over 10 lines of code, please add your name to list of
 copyright holders for that file. The core program is licensed under MPL-2.0,
 test files and documentation under CC-BY-SA 4.0, and miscellaneous files under
-MIT. Please contribute code only that you wrote and you have the permission to
-contribute or license to us.
+MIT, with a few minor exceptions. Please contribute only content that you own
+or have the right to contribute under these licenses.
diff --git a/docs/cookbook.rst b/docs/cookbook.rst
@@ -231,13 +231,20 @@ Don't actually OCR my PDF
 =========================
 
 If you set ``--tesseract-timeout 0`` OCRmyPDF will apply its image
-processing without performing OCR, if all you want to is to apply image
-processing or PDF/A conversion.
+processing without performing OCR (by causing OCR to time out). This works
+if all you want to is to apply image processing or PDF/A conversion.
 
 .. code-block:: bash
 
     ocrmypdf --tesseract-timeout=0 --remove-background input.pdf output.pdf
 
+.. versionchanged:: v14.1.0
+
+    Prior to this version, ``--tesseract-timeout 0`` would prevent other
+    uses of Tesseract, such as deskewing, from working. This is no longer
+    the case. Use ``--tesseract-non-ocr-timeout`` to control the timeout
+    for non-OCR operations, if needed.
+
 Optimize images without performing OCR
 --------------------------------------
 
@@ -261,9 +268,10 @@ Hyphens denote a range of pages and commas separate page numbers. If you prefer
 to use spaces, quote all of the page numbers: ``--pages '2, 3, 5, 7'``.
 
 OCRmyPDF will warn if your list of page numbers contains duplicates or
-overlap pages. OCRmyPDF does not currently account for document page numbers,
+overlapping pages. OCRmyPDF does not currently account for document page numbers,
 such as an introduction section of a book that uses Roman numerals. It simply
-counts the number of virtual pieces of paper since the start.
+counts the number of virtual pieces of paper since the start. If your list of
+pages is out of numerical order, OCRmyPDF will sort it for you.
 
 Regardless of the argument to ``--pages``, OCRmyPDF will optimize all pages/images
 in the file and convert it to PDF/A, unless you disable those options. Both of these

diff --git a/docs/design_notes.rst b/docs/design_notes.rst
@@ -0,0 +1,32 @@
+.. SPDX-FileCopyrightText: 2023 James R. Barlow
+.. SPDX-License-Identifier: CC-BY-SA-4.0
+
+============
+Design notes
+============
+
+Why doesn't OCRmyPDF use PyTesseract?
+=====================================
+
+PyTesseract is a Python wrapper around the Tesseract OCR engine. When OCRmyPDF was
+first written, PyTesseract used ABI bindings to call the Tesseract library. This
+was not a good fit for OCRmyPDF because ABI bindings can be fragile.
+
+PyTesseract has since evolved calling the Tesseract executable, abandoning the ABI
+approach and using the CLI instead, just like OCRmyPDF does. If it were written from
+scratch today, OCRmyPDF might use PyTesseract.
+
+PyTesseract has more features don't particularly need PDF output, but less features
+than OCRmyPDF's API for creating PDFs.
+
+What is ``executor()``?
+=======================
+
+OCRmyPDF uses a custom concurrent executor which can support either threads or
+processes with the same interface. This is useful because OCRmyPDF can use
+either threads or processes to parallelize work, whichever is more appropriate
+for the task at hand.
+
+The interface is currently private and subject to change. In particular, if
+experiments with asyncio and anyio are successful, the interface will change.
+
diff --git a/docs/index.rst b/docs/index.rst
@@ -44,6 +44,7 @@ image processing and OCR to existing PDFs.
    api
    plugins
    apiref
+   design_notes
    contributing
    maintainers
 

diff --git a/docs/pdfsecurity.rst b/docs/pdfsecurity.rst
@@ -54,31 +54,72 @@ into the existing PDF or it may essentially reconstruct ("re-fry") a
 visually identical PDF that may be quite different at the binary level.
 That said, OCRmyPDF is not a tool designed for sanitizing PDFs.
 
-Password protection, digital signatures and certification
-=========================================================
+Password protected PDFs
+=======================
 
 Password protected PDFs usually have two passwords, and owner and user
 password. When the user password is set to empty, PDF readers will open
-the file automatically and marked it as "(SECURED)". While not as
-reliable as a digital signature, this indicates that whoever set the
-password approved of the file at that time. When the user password is
-set, the document cannot be viewed without the password.
+the file automatically and mark it as "(SECURED)". Password security can
+also request certain restrictions on the PDF, but anyone can remove these
+restrictions if they have either the owner *or* user password. Passwords
+mainly present a barrier for casual users.
 
-Either way, OCRmyPDF does not remove passwords from PDFs and exits with
-an error on encountering them.
+OCRmyPDF cannot remove passwords from PDFs. If you want to remove a
+password from a PDF, you must use other software, such as ``qpdf``.
 
-``qpdf`` can remove passwords. If the owner and user password are set, a
+If the owner and user password are set, a
 password is required for ``qpdf``. If only the owner password is set, then the
-password can be stripped, even if one does not have the owner password.
+password can be stripped, even if one does not have the owner password. To
+remove the password from a using QPDF, use:
 
-After OCR is applied, password protection is not permitted on PDF/A
-documents but the file can be converted to regular PDF.
+.. code-block:: bash
+
+   qpdf --decrypt --password='abc123' input.pdf no_password.pdf
+
+Then you can run OCRmyPDF on the file.
+
+In its default mode, OCRmyPDF generates PDF/A. Passwords may not be set on PDF/A
+documents. If you want to set a password on the output PDF, you must
+specify ``--output-type pdf``.
+
+Signature images
+================
 
 Many programs exist which are capable of inserting an image of someone's
 signature. On its own, this offers no security guarantees. It is trivial
 to remove the signature image and apply it to other files. This practice
 offers no real security.
 
+Digital signatures
+==================
+
 Important documents can be digitally signed and certified to attest to
-their authorship. OCRmyPDF cannot do this. Open source tools such as
-pdfbox (Java) have this capability as does Adobe Acrobat.
+their authorship, approval or execution of a legal agreement. OCRmyPDF
+will detect signed PDFs and will not modify them, unless the
+``--invalidate-digital-signatures`` option is used, which will
+invalidate any signatures. (The signature may still be present in the PDF
+if opened, but PDF readers will not validate it.)
+
+A digital signature adds a cryptographic hash of the document to the
+document, so tamper protection is provided. That also precludes OCRmyPDF
+from modifying the document and preserving the signature.
+
+Digital signatures are not the same as a signature image. A digital
+signature is a cryptographic hash of the document that is encrypted with
+the author's private key. The signature is decrypted with the author's
+public key. The public key is usually distributed by a certificate
+authority. The signature is then verified by the PDF reader. If the
+document is modified, the signature will be invalidated.
+
+Certificate-encrypted PDFs
+==========================
+
+PDFs can be encrypted with a certificate. This is a more secure form of
+encryption than a password. The certificate is usually issued by a
+certificate authority. A certificate is used to encrypt the document using
+the public key for the benefit of a specific recipient who possesses
+the private key.
+
+OCRmyPDF cannot open certificate-encrypted PDFs. If you have the
+certificate, you can use other PDF software, such as Acrobat, to
+decrypt the PDF.
diff --git a/docs/release_notes.rst b/docs/release_notes.rst
@@ -34,7 +34,7 @@ v14.4.0
 -  Digitally signed PDFs are now detected. If the PDF is signed, OCRmyPDF will
    refuse to modify it. Previously, only encrypted PDFs were detected, not
    those that were signed but not encrypted. :issue:`1040`
--  In addition, `--invalidate-digital-signatures` can be used to override the
+-  In addition, ``--invalidate-digital-signatures`` can be used to override the
    above behavior and modify the PDF anyway. :issue:`1040`
 -  tqdm progress bars replaced with "rich" progress bars. The rich library is
    a new dependency. Certain APIs that used tqdm are now deprecated and will
@@ -67,7 +67,7 @@ v14.2.1
 v14.2.0
 =======
 
--  Added `--tesseract-downsample-above` to downsample larger images even when
+-  Added ``--tesseract-downsample-above`` to downsample larger images even when
    they do not exceed Tesseract's internal limits. This can be used to speed
    up OCR, possibly sacrificing accuracy.
 -  Fixed resampling AttributeError on older Pillow. :issue:`1096`

diff --git a/src/ocrmypdf/builtin_plugins/concurrency.py b/src/ocrmypdf/builtin_plugins/concurrency.py
@@ -68,7 +68,7 @@ def process_init(q: Queue, user_init: UserInit, loglevel) -> None:
         # Windows and Cygwin do not have pthread_sigmask or SIGBUS
         signal.signal(signal.SIGBUS, process_sigbus)
 
-    # Remove any log handlers that belong to the parent process
+    # Remove any log handlers inherited from the parent process
     root = logging.getLogger()
     remove_all_log_handlers(root)
-Original file line number
+Diff line change
@@ Expand Up / @@ -44,6 +44,7 @@ image processing and OCR to existing PDFs. @@
        api
        plugins
        apiref
+       design_notes
        contributing
        maintainers
@@ Expand Down @@