Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfMerger breaks PDF/A compliance #1012

Open
MartinThoma opened this issue Jun 19, 2022 · 16 comments
Open

PdfMerger breaks PDF/A compliance #1012

MartinThoma opened this issue Jun 19, 2022 · 16 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-pdf/a-compliance Anything related to PDF/A compliance PdfMerger The PdfMerger component is affected

Comments

@MartinThoma
Copy link
Member

Use PdfMerger with a single PDF/A compliant document I would expect almost exactly the same output file as the input file. But it's way different - and PDF/A compliance is broken.

Code + PDF

Using this as an example document: https://www.pdfa.org/wp-content/uploads/2011/08/PDFA-in-a-Nutshell_1b.pdf

And https://demo.verapdf.org/ to verify if the document is compliant.

from PyPDF2 import PdfReader, PdfMerger

reader = PdfReader("PDFA-in-a-Nutshell_1b.pdf")
metadata = reader.metadata

merger = PdfMerger()
merger.append(reader)
merger.add_metadata(metadata)

with open("merged.pdf", "wb") as fp:
    merger.write(fp)

Issues

  • PDFA-in-a-Nutshell_1b.pdf has 6.4 MB and is PDF/A compliant
  • merged.pdf has 5.0 MB and is NOT PDF/compliant

verapdf.org mentions that 100 issues were detected. It lists the following 3:

  1. The document catalog dictionary of a conforming file shall contain the Metadata key. (docs) x1
  2. DeviceCMYK may be used only if the file has a PDF/A-1 OutputIntent that uses a CMYK colour space (docs) x97
  3. The file trailer dictionary shall contain the ID keyword. The file trailer referred to is either the last trailer dictionary in a PDF file, as described in PDF Reference 3.4.4 and 3.4.5, or the first page trailer in a linearized PDF file, as described in PDF Reference F.2 (docs) x1
  4. If a document information dictionary does appear at a document, then all of its entries that have analogous properties in predefined XMP schemas, shall also be embedded in the file in XMP form with equivalent values. (docs) x1
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 19, 2022
@MartinThoma
Copy link
Member Author

The metadata section that is missing looks like this in the original file:

1 0 obj
<< /Metadata 3 0 R /Outlines 4 0 R /OutputIntents [ << /DestOutputProfile 5 0 R /Info (ISO Coated v2 \(ECI\)) /OutputConditionIdentifier (ISO Coated v2 \(ECI\)) /RegistryName (http://www.color.org) /S /GTS_PDFX /Type /OutputIntent >> << /DestOutputProfile 5 0 R /Info (ISO Coated v2 \(ECI\)) /OutputConditionIdentifier (ISO Coated v2 \(ECI\)) /RegistryName (http://www.color.org) /S /GTS_PDFA1 /Type /OutputIntent >> ] /PageLabels 6 0 R /Pages 7 0 R /Type /Catalog /ViewerPreferences << /Direction /L2R >> >>
endobj
2 0 obj
<< /Author (PDF/A Competence Center) /CreationDate (D:20110818145925+02'00') /Creator (Adobe InDesign CS5 \(7.0.4\)) /GTS_PDFXConformance (PDF/X-1a:2003) /GTS_PDFXVersion (PDF/X-1a:2003) /ModDate (D:20110818150035+02'00') /Producer (Adobe PDF Library 9.9) /Title (PDF/A in a Nutshell) /Trapped /False >>
endobj
3 0 obj
<< /Subtype /XML /Type /Metadata /Length 12889 >>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:56:37        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#">
         <xmpMM:InstanceID>uuid:2b5a9f85-f518-7b4a-a756-79898ed7b891</xmpMM:InstanceID>
         <xmpMM:OriginalDocumentID>adobe:docid:indd:fbe35371-5d32-11dc-b86a-8404c5e05271</xmpMM:OriginalDocumentID>
         <xmpMM:DocumentID>adobe:docid:indd:fbe35371-5d32-11dc-b86a-8404c5e05271</xmpMM:DocumentID>
         <xmpMM:RenditionClass>proof:pdf</xmpMM:RenditionClass>
         <xmpMM:VersionID>1</xmpMM:VersionID>
         <xmpMM:History>
            <rdf:Seq>
               <rdf:li rdf:parseType="Resource">
                  <stEvt:action>converted</stEvt:action>
                  <stEvt:instanceID>uuid:d6a6d0f6-aead-f743-bda5-13d4cf5c3c11</stEvt:instanceID>
                  <stEvt:parameters>converted to PDF/A-1b</stEvt:parameters>
                  <stEvt:softwareAgent>pdfaPilot</stEvt:softwareAgent>
                  <stEvt:when>2011-08-18T15:00:32+02:00</stEvt:when>
               </rdf:li>
            </rdf:Seq>
         </xmpMM:History>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:CreatorTool>Adobe InDesign CS5 (7.0.4)</xmp:CreatorTool>
         <xmp:CreateDate>2011-08-18T14:59:25+02:00</xmp:CreateDate>
         <xmp:ModifyDate>2011-08-18T15:00:35+02:00</xmp:ModifyDate>
         <xmp:MetadataDate>2011-08-18T15:00:35+02:00</xmp:MetadataDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">PDF/A in a Nutshell</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>PDF/A Competence Center</rdf:li>
            </rdf:Seq>
         </dc:creator>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>Adobe PDF Library 9.9</pdf:Producer>
         <pdf:Trapped>False</pdf:Trapped>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdfxid="http://www.npes.org/pdfx/ns/id/">
         <pdfxid:GTS_PDFXVersion>PDF/X-1a:2003</pdfxid:GTS_PDFXVersion>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
         <pdfx:GTS_PDFXVersion>PDF/X-1a:2003</pdfx:GTS_PDFXVersion>
         <pdfx:GTS_PDFXConformance>PDF/X-1a:2003</pdfx:GTS_PDFXConformance>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
         <pdfaid:part>1</pdfaid:part>
         <pdfaid:conformance>B</pdfaid:conformance>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/"
            xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#"
            xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#">
         <pdfaExtension:schemas>
            <rdf:Bag>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://ns.adobe.com/pdf/1.3/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>pdf</pdfaSchema:prefix>
                  <pdfaSchema:schema>Adobe PDF</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>A name object indicating whether the document has been modified to include trapping information</pdfaProperty:description>
                           <pdfaProperty:name>Trapped</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://ns.adobe.com/pdfx/1.3/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>pdfx</pdfaSchema:prefix>
                  <pdfaSchema:schema>PDF/X ID Schema</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>ID of PDF/X standard</pdfaProperty:description>
                           <pdfaProperty:name>GTS_PDFXVersion</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Conformance level of PDF/X standard</pdfaProperty:description>
                           <pdfaProperty:name>GTS_PDFXConformance</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Company creating the PDF</pdfaProperty:description>
                           <pdfaProperty:name>Company</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Date when document was last modified</pdfaProperty:description>
                           <pdfaProperty:name>SourceModified</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://ns.adobe.com/xap/1.0/mm/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>xmpMM</pdfaSchema:prefix>
                  <pdfaSchema:schema>XMP Media Management</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>UUID based identifier for specific incarnation of a document</pdfaProperty:description>
                           <pdfaProperty:name>InstanceID</pdfaProperty:name>
                           <pdfaProperty:valueType>URI</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>The common identifier for all versions and renditions of a document.</pdfaProperty:description>
                           <pdfaProperty:name>OriginalDocumentID</pdfaProperty:name>
                           <pdfaProperty:valueType>URI</pdfaProperty:valueType>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://www.aiim.org/pdfa/ns/id/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>pdfaid</pdfaSchema:prefix>
                  <pdfaSchema:schema>PDF/A ID Schema</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Part of PDF/A standard</pdfaProperty:description>
                           <pdfaProperty:name>part</pdfaProperty:name>
                           <pdfaProperty:valueType>Integer</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Amendment of PDF/A standard</pdfaProperty:description>
                           <pdfaProperty:name>amd</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>Conformance level of PDF/A standard</pdfaProperty:description>
                           <pdfaProperty:name>conformance</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
               </rdf:li>
               <rdf:li rdf:parseType="Resource">
                  <pdfaSchema:namespaceURI>http://www.npes.org/pdfx/ns/id/</pdfaSchema:namespaceURI>
                  <pdfaSchema:prefix>pdfxid</pdfaSchema:prefix>
                  <pdfaSchema:schema>PDF/X ID Schema</pdfaSchema:schema>
                  <pdfaSchema:property>
                     <rdf:Seq>
                        <rdf:li rdf:parseType="Resource">
                           <pdfaProperty:category>internal</pdfaProperty:category>
                           <pdfaProperty:description>ID of PDF/X standard</pdfaProperty:description>
                           <pdfaProperty:name>GTS_PDFXVersion</pdfaProperty:name>
                           <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                        </rdf:li>
                     </rdf:Seq>
                  </pdfaSchema:property>
               </rdf:li>
            </rdf:Bag>
         </pdfaExtension:schemas>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

@MartinThoma
Copy link
Member Author

The file trailer of the original looks like this:

trailer << /Info 2 0 R /Root 1 0 R /Size 810 /ID [<cc24aff220034a578f97b292897ecfb3><482975899621bd41f614731fdea45046>] >>

whereas the merged one looks like this:

trailer << /Info 2 0 R /Root 1 0 R /Size 783 /ID [<1916dc78292f2472ad342a4f13641c7f><1916dc78292f2472ad342a4f13641c7f>] >>

So we do have the ID keyword, but we violate

(isLinearized == true && firstPageID != null) || ((isLinearized != true) && lastID != null)

@MartinThoma MartinThoma added the PdfMerger The PdfMerger component is affected label Jun 22, 2022
@MartinThoma
Copy link
Member Author

https://avepdf.com/pdfa-validation might also help us

@MartinThoma MartinThoma added the is-pdf/a-compliance Anything related to PDF/A compliance label Sep 24, 2022
@stefan6419846
Copy link
Collaborator

This probably should be updated to reflect the deprecation of PdfMerger in favor PdfWriter.

@stefan6419846
Copy link
Collaborator

According to pdfinfo PDFA-in-a-Nutshell_1b.pdf states its conformance as Level A, Accessible, as well as the generated PDF file:

>>> from pypdf import PdfReader, PdfWriter
>>> reader = PdfReader('PDFA-in-a-Nutshell_1b.pdf')
>>> metadata = reader.metadata
>>> writer = PdfWriter(clone_from=reader)
>>> writer.add_metadata(metadata)
>>> writer.write('merged.pdf')
(True, <_io.FileIO [closed]>)
>>> 

Running this through VeraPDF with the PDF/A-1A profile, I get some different issues:

  1. The document catalog dictionary shall include a MarkInfo dictionary with a Marked entry in it, whose value shall be true. (docs) ×1
  2. The font dictionary shall include a ToUnicode entry whose value is a CMap stream object that maps character codes to Unicode values, as described in PDF Reference 5.9, unless the font meets any of the following three conditions: () fonts that use the predefined encodings MacRomanEncoding, MacExpertEncoding or WinAnsiEncoding, or that use the predefined Identity-H or Identity-V CMaps; () Type 1 fonts whose character names are taken from the Adobe standard Latin character set or the set of named characters in the Symbol font, as defined in PDF Reference Appendix D; (*) Type 0 fonts whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1 or Adobe-Korea1 character collections. (docs) ×65
  3. The logical structure of the conforming file shall be described by a structure hierarchy rooted in the StructTreeRoot entry of the document catalog dictionary, as described in PDF Reference 9.6. (docs) ×1
  4. A Level A conforming file shall specify the value of pdfaid:conformance as A. (docs) ×1
  5. The file trailer dictionary shall contain the ID keyword. The file trailer referred to is either the last trailer dictionary in a PDF file, as described in PDF Reference 3.4.4 and 3.4.5, or the first page trailer in a linearized PDF file, as described in PDF Reference F.2. (docs) ×1

Using the automatically detected profile (PDF/A-1B) only item 5 is being reported.

@pubpub-zz
Copy link
Collaborator

@stefan6419846
We have to try with Incremental writing. also the only way to ensure the output meets PDF/A-1A would be to have inputs respecting the standard.

What do you think we should do about this issue ? close as it as not planned ?

@stefan6419846
Copy link
Collaborator

We are recommending the PdfWriter as the replacement for PdfMerger. Thus, I would recommend to at least verify that given the above PDF/A-compliant document, using PdfWriter(clone_from="PDFA-in-a-Nutshell_1b.pdf").save("out.pdf") does not destroy the document (possibly using the incremental mode) and verifying that this is indeed documented properly.

@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 7, 2024

The output file with the following code passed!

writer = PdfWriter(clone_from="PDFA-in-a-Nutshell_1b.pdf")
writer.write("out.pdf")

@stefan6419846
Copy link
Collaborator

Ideally, we find a way to check this in CI as well to ensure that our changes do not accidentally break anything about this.

@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 7, 2024

That is certainly true.
That would be a powerful checking tool.

@stefan6419846
Copy link
Collaborator

fpdf2 seems to already have some parts of this implemented in the CI, although ignoring PDF/A issues: https://github.com/py-pdf/fpdf2/blob/7784099dadeec551aa78511c06a6d7f525428265/.github/workflows/continuous-integration-workflow.yml#L45-L58

@pubpub-zz
Copy link
Collaborator

Ideally, we find a way to check this in CI as well to ensure that our changes do not accidentally break anything about this.

We should

The output file with the following code passed!

writer = PdfWriter(clone_from="PDFA-in-a-Nutshell_1b.pdf")
writer.write("out.pdf")

Can you indicate against which standard you've checked the document and using which tool/website ?
I've tried verapdf and still got some errors in the XMP form (present in the original)

@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 8, 2024

Can you indicate against which standard you've checked the document and using which tool/website ?
I've tried verapdf and still got some errors in the XMP form (present in the original)

Sorry, I chose “PDF/A-1b Basic”. I should have chosen “PDF/A-1a”.

@stefan6419846
Copy link
Collaborator

Ideally, we find a way to check this in CI as well to ensure that our changes do not accidentally break anything about this.

We should

Seems like some words got lost here? ;)

@pubpub-zz
Copy link
Collaborator

We should/might prepare a dedicated set of tests to confirm. however I see two limitation:
a) verapdf could be a candidate : we need to set it in a workflow ?
b) we need to identify files that are passing at least PDF/A-1a but preferably go to PDF/A-2 a/b/u : We will have to be clear that pypdf has no capability to automatically create/convert to a file compliant with PDF/A standard

@stefan6419846
Copy link
Collaborator

There are indeed multiple ways for verification. veraPDF is a Java application and should be no real issue in CI.

For the PDF/A standard, we should start with a basic example like the file initially referenced in this issue. IMHO we never claimed that we would be able to generate such a file and I have no plans to change this for now. This does not prevent us from running basic validation like mentioned before, id est that passing through an existing PDF/A file does not break just to document the current behavior to avoid side effects of other changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-pdf/a-compliance Anything related to PDF/A compliance PdfMerger The PdfMerger component is affected
Projects
None yet
Development

No branches or pull requests

4 participants