Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Length-0 streams are read incorrectly, which breaks some PDFs #3052

Open
benrg opened this issue Jan 13, 2025 · 2 comments
Open

Length-0 streams are read incorrectly, which breaks some PDFs #3052

benrg opened this issue Jan 13, 2025 · 2 comments
Labels
generic The generic submodule is affected is-robustness-issue From a users perspective, this is about robustness

Comments

@benrg
Copy link

benrg commented Jan 13, 2025

DictionaryObject.read_from_stream contains this code:

            if length is None:  # if the PDF is damaged
                length = -1
            pstart = stream.tell()
            if length > 0:
                data["__streamdata__"] = stream.read(length)
            else:
                data["__streamdata__"] = read_until_regex(
                    stream, re.compile(b"endstream")
                )

Since read_until_regex doesn't strip the trailing newline, this will read almost all length-0 streams as b"\n" or b"\r\n" instead of b"".

I have some PDFs with creator PFU ScanSnap Manager 5.1.30 #S1500 that contain JBIG2-encoded pages with /JBIG2Globals pointing to an empty stream object. After loading and saving them with pypdf, the /JBIG2Globals stream is invalid, and some (not all) PDF viewers fail to render the pages.

Suggested fix:

  • If there exist broken PDFs in the wild with /Length 0 followed by a stream of nonzero length that pypdf needs to support, check for stream\r?\n\r?\n?endstream as a special case first before falling back to read_until_regex, to ensure that valid PDFs with length-0 streams are always read correctly.
  • Or, if there are no such PDFs, and length > 0 was just meant to catch the -1 case, change the test to length >= 0.
  • In the read_until_regex case, if endstream is preceded by \r then strip it, or if it's preceded by \r\n then strip the \n, and strip the \r also iff stream was followed by \r. That isn't guaranteed to work, but it's probably the best one can do.
@stefan6419846
Copy link
Collaborator

Thanks for the report. You are invited to propose a corresponding PR including tests of course.

For the future, please report bugs with the actual bug template to not omit relevant aspects for reproduction.

@stefan6419846 stefan6419846 added generic The generic submodule is affected needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Jan 14, 2025
@benrg
Copy link
Author

benrg commented Jan 19, 2025

Here's a pair of PDFs showing the problem. correct.pdf should display the text "Hello, World!" in any PDF reader (such as Acrobat Reader). broken.pdf fails to display the text in Acrobat Reader (with the message "Insufficient data for an image"), though some other readers such as SumatraPDF display it. The only difference between them is that the JBIG2Globals stream is empty in correct.pdf and contains a single 0A byte in broken.pdf. The latter is obtained from the former by

>>> import pypdf
>>> r = pypdf.PdfReader("correct.pdf")
>>> w = pypdf.PdfWriter(r)
>>> w.write("broken.pdf")

jbig2_with_0_length_symbols.zip

@stefan6419846 stefan6419846 added is-robustness-issue From a users perspective, this is about robustness and removed needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generic The generic submodule is affected is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

No branches or pull requests

2 participants