Length-0 streams are read incorrectly, which breaks some PDFs #3052
Labels
generic
The generic submodule is affected
is-robustness-issue
From a users perspective, this is about robustness
DictionaryObject.read_from_stream
contains this code:Since
read_until_regex
doesn't strip the trailing newline, this will read almost all length-0 streams asb"\n"
orb"\r\n"
instead ofb""
.I have some PDFs with creator
PFU ScanSnap Manager 5.1.30 #S1500
that contain JBIG2-encoded pages with/JBIG2Globals
pointing to an empty stream object. After loading and saving them with pypdf, the/JBIG2Globals
stream is invalid, and some (not all) PDF viewers fail to render the pages.Suggested fix:
/Length 0
followed by a stream of nonzero length that pypdf needs to support, check forstream\r?\n\r?\n?endstream
as a special case first before falling back toread_until_regex
, to ensure that valid PDFs with length-0 streams are always read correctly.length > 0
was just meant to catch the-1
case, change the test tolength >= 0
.read_until_regex
case, ifendstream
is preceded by\r
then strip it, or if it's preceded by\r\n
then strip the\n
, and strip the\r
also iffstream
was followed by\r
. That isn't guaranteed to work, but it's probably the best one can do.The text was updated successfully, but these errors were encountered: