Length-0 streams are read incorrectly, which breaks some PDFs #3052

benrg · 2025-01-13T23:23:31Z

DictionaryObject.read_from_stream contains this code:

            if length is None:  # if the PDF is damaged
                length = -1
            pstart = stream.tell()
            if length > 0:
                data["__streamdata__"] = stream.read(length)
            else:
                data["__streamdata__"] = read_until_regex(
                    stream, re.compile(b"endstream")
                )

Since read_until_regex doesn't strip the trailing newline, this will read almost all length-0 streams as b"\n" or b"\r\n" instead of b"".

I have some PDFs with creator PFU ScanSnap Manager 5.1.30 #S1500 that contain JBIG2-encoded pages with /JBIG2Globals pointing to an empty stream object. After loading and saving them with pypdf, the /JBIG2Globals stream is invalid, and some (not all) PDF viewers fail to render the pages.

Suggested fix:

If there exist broken PDFs in the wild with /Length 0 followed by a stream of nonzero length that pypdf needs to support, check for stream\r?\n\r?\n?endstream as a special case first before falling back to read_until_regex, to ensure that valid PDFs with length-0 streams are always read correctly.
Or, if there are no such PDFs, and length > 0 was just meant to catch the -1 case, change the test to length >= 0.
In the read_until_regex case, if endstream is preceded by \r then strip it, or if it's preceded by \r\n then strip the \n, and strip the \r also iff stream was followed by \r. That isn't guaranteed to work, but it's probably the best one can do.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2025-01-14T06:31:06Z

Thanks for the report. You are invited to propose a corresponding PR including tests of course.

For the future, please report bugs with the actual bug template to not omit relevant aspects for reproduction.

benrg · 2025-01-19T00:00:51Z

Here's a pair of PDFs showing the problem. correct.pdf should display the text "Hello, World!" in any PDF reader (such as Acrobat Reader). broken.pdf fails to display the text in Acrobat Reader (with the message "Insufficient data for an image"), though some other readers such as SumatraPDF display it. The only difference between them is that the JBIG2Globals stream is empty in correct.pdf and contains a single 0A byte in broken.pdf. The latter is obtained from the former by

>>> import pypdf
>>> r = pypdf.PdfReader("correct.pdf")
>>> w = pypdf.PdfWriter(r)
>>> w.write("broken.pdf")

jbig2_with_0_length_symbols.zip

stefan6419846 added generic The generic submodule is affected needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Jan 14, 2025

stefan6419846 added is-robustness-issue From a users perspective, this is about robustness and removed needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Length-0 streams are read incorrectly, which breaks some PDFs #3052

Length-0 streams are read incorrectly, which breaks some PDFs #3052

benrg commented Jan 13, 2025

stefan6419846 commented Jan 14, 2025

benrg commented Jan 19, 2025

Length-0 streams are read incorrectly, which breaks some PDFs #3052

Length-0 streams are read incorrectly, which breaks some PDFs #3052

Comments

benrg commented Jan 13, 2025

stefan6419846 commented Jan 14, 2025

benrg commented Jan 19, 2025