Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Form rendering differences for textboxes and checkboxes when generated with pypdf and Acrobat #3115

Open
alpepi opened this issue Feb 10, 2025 · 13 comments
Labels
workflow-forms From a users perspective, forms is the affected feature/workflow

Comments

@alpepi
Copy link

alpepi commented Feb 10, 2025

I have a PDF form I am trying to autofill. The form was made with Adobe Acrobat Pro, and it contains various textboxes and checkboxes.

I am able to add text entries in the textboxes, and when I open that output PDF in Adobe Acrobat, the file opens and the entries "render" properly.

However, the issue arises when I also check checkboxes. When I fill textboxes and checkboxes with pypdf, then opening the PDF form in Adobe Acrobat, the checkboxes aren't truly checked and all the formatting of the textboxes are broken (i.e. not rendered). Let me explain further.

Text in pypdf filled textboxes look like this:
Image

The formatting only fixes itself if I manually modify the text entry with Adobe Acrobat:
Image

Also, visually the checkboxes are checked :
Image

However, if I manually click this checkbox once in Acrobat, the box remains checked (i.e. it was never ON in Acrobat's eyes?)
Image
A second click then unchecks the box. In contract, checkboxes not touched by pypdf don't behave this way, the first click switches the checkbox to the opposite state, as expected.

Once I manually interact with the pypdf filled checkboxes in Acrobat, and save the document, then all the textboxes render properly upon re-opening. So in summary, checking checkboxes with pypdf breaks the rendering of all textboxes in Acrobat. And it seems like Adobe doesn't recognized the checkboxes as checked by pypdf.

Environment

Python 3.10.11, pypdf 5.2.0
After pypdf outputs the PDF, I view in Adobe Acrobat.

Note that this doesn't seem to be an issue when opening the PDF with Chrome. But for my purposes I must use Adobe Acrobat.

Code + PDF

from pypdf import PdfReader, PdfWriter

def write_to_form(input_name: str, dict_to_write: dict, output_name: str, auto_regen):

    reader = PdfReader(input_name)
    writer = PdfWriter()

    #fields = reader.get_form_text_fields()
    fields = reader.get_fields()
    writer.append(reader)

    writer.update_page_form_field_values(
        writer.pages[0],
        dict_to_write,
        auto_regenerate= auto_regen,
    )

    with open(output_name, "wb") as output_stream:
        writer.write(output_stream)


textboxes = {"TextBox1": "my text entry.",
                   }

buttons = {"Other": '/On',
           }

to_write = buttons | textboxes
write_to_form("blank-form.pdf", to_write, "filled-out.pdf", False)

Unfortunately I can't share the PDF file itself.
Here is a redacted version of my PDF form. I cleared the PDF, other than the boxes of interest. The behaviour described above is still present in the remaining boxes.
blank-form.pdf

Test PDF Form - pypdf filled.pdf
Test PDF Form - Acrobat filled.pdf

@stefan6419846
Copy link
Collaborator

Please fill the complete template, including the code and a PDF file to allow us to reproduce your issues. Additionally, debugging might be easier if you are able to provide the original and the form as filled by Acrobat to look for the relevant differences.

@stefan6419846 stefan6419846 added needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem labels Feb 10, 2025
@alpepi
Copy link
Author

alpepi commented Feb 10, 2025

@stefan6419846 Apologies, I accidentally pressed Ctrl+Enter and submitted my issue half complete. I finished editing it now and it is complete. Let me know if more information is needed. Thanks!

@stefan6419846 stefan6419846 removed the needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem label Feb 10, 2025
@stefan6419846
Copy link
Collaborator

No worries. Unfortunately, without the requested PDF files (either provided publicly or privately) there is not much we can/will do here as this leads to just guessing.

@alpepi
Copy link
Author

alpepi commented Feb 10, 2025

Ok understood. I will try to make a new PDF form that I can share, and reproduce the error. Stay tuned.

@alpepi
Copy link
Author

alpepi commented Feb 10, 2025

@stefan6419846 I attached a PDF form with the described issue.

@alpepi
Copy link
Author

alpepi commented Feb 10, 2025

I troubleshooted further, and here are some observations I have:

  • I created a new PDF form from scratch using Word and Acrobat. I was able to re-create the "checkbox first-click behaviour", but not the text render/format behaviour.

  • In a copy of the original form, I added a new textbox using Acrobat. Only the old text box has formatting/rendering issues when filled by pypdf. The new textbox renders properly despite any checkbox state.
    Image

  • When pypdf fills the old textbox only (no other boxes or checkboxes are filled by pypdf), the text doesn't render properly if auto_regenerate= False, only if auto_regenerate=True ). New textboxes don't behave that way.

@stefan6419846
Copy link
Collaborator

Thanks for providing an example file, although analysis probably would be easier if we have a version of the form filled using Acrobat as well - from our experience, there might just be one attribute which is different.

@stefan6419846 stefan6419846 added workflow-forms From a users perspective, forms is the affected feature/workflow and removed needs-pdf The issue needs a PDF file to show the problem labels Feb 10, 2025
@alpepi
Copy link
Author

alpepi commented Feb 10, 2025

Ok, I added two more PDF files to my post. One is the pypdf output with described issues, and one that I manually filled using Acrobat which doesn't have the issues. Both used the same blank-form.pdf to start.

Hope it helps. Thanks for your time!

@stefan6419846
Copy link
Collaborator

The text box looks rather similar, except that Adobe seems to position each character separately:

17 0 obj

<<
/P 2 0 R
/T (TextBox1)
/V (My text box and checkbox filled by pypdf)
/DA (/Helv 9 Tf 0 g)
/Rect [38.9527 437.411 586.654 549.011]
/Ff 4096
/TU (TB1)
/Subtype /Widget
/F 4
/Type /Annot
/DR 
<<
/Font 
<<
/Helv 30 0 R
>>
/Encoding 
<<
/PDFDocEncoding 31 0 R
>>
>>
/FT /Tx
/AP 
<<
/N 32 0 R
>>
>>
endobj

32 0 obj

<<
/Subtype /Form
/Type /XObject
/Matrix [1 0.0 0.0 1 0.0 0.0]
/FormType 1
/Resources 
<<
/Font 
<<
/Helv 30 0 R
>>
>>
/BBox [0.0 0.0 547.7013 111.6]
/Length 271
>>
stream
q
/Tx BMC 
q
1 1 546.7013 110.59999999999997 re
W
BT
/Helv 9.0 Tf 0 g
2 101.59999999999997 Td
<004d00790020007400650078007400200062006f007800200061006e006400200063006800650063006b0062006f0078002000660069006c006c00650064002000620079002000700079007000640066> Tj
ET
Q
EMC
Q

endstream 
endobj
389 0 obj

<<
/P 189 0 R
/T (TextBox1)
/V (my text box and checkbox manually filled with Adobe Acrobat)
/DA (/Helv 9 Tf 0 g)
/Rect [38.9527 437.411 586.654 549.011]
/Ff 4096
/TU (TB1)
/Subtype /Widget
/F 4
/Type /Annot
/DR 
<<
/Font 
<<
/Helv 391 0 R
>>
/Encoding 
<<
/PDFDocEncoding 392 0 R
>>
>>
/FT /Tx
/AP 
<<
/N 393 0 R
>>
>>
endobj

393 0 obj

<<
/Resources 
<<
/ProcSet [/PDF]
/Font 
<<
/Helv 390 0 R
>>
>>
/BBox [0.0 0.0 547.701 111.6]
/Length 277
>>
stream
/Tx BMC 
BT
/Helv 9 Tf 0 g
0 g
2 99.196 Td
(my ) Tj
14.496 0 Td
(text ) Tj
16.992 0 Td
(box ) Tj
17.004 0 Td
(and ) Tj
17.508 0 Td
(checkbox ) Tj
40.512 0 Td
(manually ) Tj
38.508 0 Td
(filled ) Tj
20.994 0 Td
(with ) Tj
18.492 0 Td
(Adobe ) Tj
28.506 0 Td
(Acrobat) Tj
ET
EMC

endstream 
endobj

There is not much difference for the checkbox as well:

16 0 obj

<<
/AS /On
/P 2 0 R
/T (Other)
/V (/On)
/DA (/ZaDb 0 Tf 0 g)
/MK 
<<
/CA (n)
>>
/Rect [41.76 578.28 50.0945 585.96]
/TU (Other)
/Subtype /Widget
/F 4
/Type /Annot
/FT /Btn
/AP 
<<
/D 
<<
/Off 26 0 R
/On 27 0 R
>>
/N 
<<
/On 28 0 R
>>
>>
>>
endobj


26 0 obj

<<
/Subtype /Form
/Matrix [1 0.0 0.0 1 0.0 0.0]
/Type /XObject
/FormType 1
/Resources 
<<
/ProcSet [/PDF]
>>
/BBox [0.0 0.0 8.33455 7.67999]
/Length 36
>>
stream
q
0.749023 g
0 0 8.3345 7.68 re
f
Q

endstream 
endobj

27 0 obj

<<
/Subtype /Form
/Matrix [1 0.0 0.0 1 0.0 0.0]
/Type /XObject
/FormType 1
/Resources 
<<
/ProcSet [/PDF /Text]
/Font 
<<
/ZaDb 29 0 R
>>
>>
/BBox [0.0 0.0 8.33455 7.67999]
/Length 119
>>
stream
q
0.749023 g
0 0 8.3345 7.68 re
f
Q
q
1 1 6.3345 5.68 re
W
n
BT
/ZaDb 4 Tf
2.6453 2.486 Td
3.852 TL
0 0 Td
(n) Tj
ET
Q

endstream 
endobj

29 0 obj

<<
/Name /ZaDb
/Subtype /Type1
/BaseFont /ZapfDingbats
/Type /Font
>>
endobj

28 0 obj

<<
/Subtype /Form
/Matrix [1 0.0 0.0 1 0.0 0.0]
/Type /XObject
/FormType 1
/Resources 
<<
/ProcSet [/PDF /Text]
/Font 
<<
/ZaDb 29 0 R
>>
>>
/BBox [0.0 0.0 8.33455 7.67999]
/Length 83
>>
stream
q
1 1 6.3345 5.68 re
W
n
BT
/ZaDb 4 Tf
2.6453 2.486 Td
3.852 TL
0 0 Td
(n) Tj
ET
Q

endstream 
endobj
388 0 obj

<<
/AS /On
/P 189 0 R
/T (Other)
/V /On
/DA (/ZaDb 0 Tf 0 g)
/MK 
<<
/CA (n)
>>
/Rect [41.76 578.28 50.0945 585.96]
/TU (Other)
/Subtype /Widget
/F 4
/Type /Annot
/FT /Btn
/AP 
<<
/D 
<<
/Off 383 0 R
/On 384 0 R
>>
/N 
<<
/On 381 0 R
>>
>>
>>
endobj

383 0 obj

<<
/Subtype /Form
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Type /XObject
/FormType 1
/Resources 
<<
/ProcSet [/PDF]
>>
/BBox [0.0 0.0 8.33455 7.67999]
/Length 36
>>
stream
q
0.749023 g
0 0 8.3345 7.68 re
f
Q

endstream 
endobj

384 0 obj

<<
/Subtype /Form
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Type /XObject
/FormType 1
/Resources 
<<
/ProcSet [/PDF /Text]
/Font 
<<
/ZaDb 382 0 R
>>
>>
/BBox [0.0 0.0 8.33455 7.67999]
/Length 119
>>
stream
q
0.749023 g
0 0 8.3345 7.68 re
f
Q
q
1 1 6.3345 5.68 re
W
n
BT
/ZaDb 4 Tf
2.6453 2.486 Td
3.852 TL
0 0 Td
(n) Tj
ET
Q

endstream 
endobj

382 0 obj

<<
/Name /ZaDb
/Subtype /Type1
/BaseFont /ZapfDingbats
/Type /Font
>>
endobj

381 0 obj

<<
/Subtype /Form
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Type /XObject
/FormType 1
/Resources 
<<
/ProcSet [/PDF /Text]
/Font 
<<
/ZaDb 382 0 R
>>
>>
/BBox [0.0 0.0 8.33455 7.67999]
/Length 83
>>
stream
q
1 1 6.3345 5.68 re
W
n
BT
/ZaDb 4 Tf
2.6453 2.486 Td
3.852 TL
0 0 Td
(n) Tj
ET
Q

endstream 
endobj

You are of course invited to further look into it and check which changes do indeed make the difference here. For now, I do not have enough time to dig deeper into this stuff myself.

@stefan6419846 stefan6419846 changed the title Issue checking PDF form Check Boxes Form rendering differences for textboxes and checkboxes when generated with pypdf and Acrobat Feb 11, 2025
@alpepi
Copy link
Author

alpepi commented Feb 11, 2025

I've had time to compare the data behind the checkboxes. I spotted a difference in the /V values:

pypdf checkbox

16 0 obj

<<
/AS /On
/P 2 0 R
/T (Other)
/V (/On)
/DA (/ZaDb 0 Tf 0 g)
/MK 
<<

Acrobat checkbox

388 0 obj

<<
/AS /On
/P 189 0 R
/T (Other)
/V /On
/DA (/ZaDb 0 Tf 0 g)
/MK 
<<

It's subtle, but there are no parentheses around the"/On" value (/V) in Acrobat's file.

Test Fix
Using a text editor, I changed /V (/On) to /V /On in pypdf's file. I then opened it in Acrobat, and the box was checked. Clicking it once unchecked it, so it now behaves as expected in Acrobat!

Test PDF Form - pypdf filled - On value manual fix.pdf

P.S.
Also, when I open pypdf's PDF in a Visual Code (UTF-8), I actually see the following:

>>
/P 4 0 R
/Rect [ 41.76 578.28 50.0945 585.96 ]
/Subtype /Widget
/T (Other)
/TU (Other)
/Type /Annot
/V (\057On)
>>

Changing it to /V \057On makes the box disappear completely. Without parentheses, only /V /On seems to work.

At first glance, it seems like we can fix the checkbox behaviour, but the text formatting still remains "broken" without manual intervention. That's all I can dig into for now.

@stefan6419846
Copy link
Collaborator

It's subtle, but there are no parentheses around the"/On" value (/V) in Acrobat's file.

Feel free to submit a PR for this, although the PDF 2.0 specification states in section 12.7.5.2.3:

The V entry in the field dictionary [...] holds a name object representing the check box’s appearance state, which shall be used to select the appropriate appearance from the appearance dictionary. The value of the V key shall also be the value of the AS key. If they are not equal, then the value of the AS key shall be used instead of the V key to determine which appearance to use.

(/On) is a string object and thus indeed wrong, while /On correctly is a name object (section 7.3.4 and 7.3.5). Nevertheless, according to the last sentence, Acrobat should (in theory) not care about the (possibly wrong) /V value.

Changing it to /V \057On makes the box disappear completely. Without parentheses, only /V /On seems to work.

\057 is the encoded slash and only allowed in string objects, but not in name objects.

@stefan6419846
Copy link
Collaborator

I have been able to pinpoint this to

pypdf/pypdf/_writer.py

Lines 1107 to 1114 in 2263dcb

parent_annotation[NameObject(FA.V)] = TextStringObject(value)
if parent_annotation.get(FA.FT) in ("/Btn"):
# Checkbox button (no /FT found in Radio widgets)
v = NameObject(value)
if v not in annotation[NameObject(AA.AP)][NameObject("/N")]:
v = NameObject("/Off")
# other cases will be updated through the for loop
annotation[NameObject(AA.AS)] = v

Line 1107 sets the /V value to a string object, regardless of the widget type. The following lines do not correct this to a name object.

Could you please test whether adding

                    parent_annotation[NameObject(FA.V)] = v

after line 1114 solves your issue with the checkboxes?

@alpepi
Copy link
Author

alpepi commented Feb 12, 2025

I added parent_annotation[NameObject(FA.V)] = v after line 1114 in _writer.py. I think the issue is fixed. The check box is responsive, and the text renders appropriately. Here is the output now, usingthe same blank form in my initial post.

Test PDF Form - pypdf filled - line 1115 test fix.pdf

Note that in my initial code, I had auto_regenerate = False. With this test fix, the text doesn't render automatically unless auto_regenerate = True, (or omit auto_regenerate altogether). I used auto_regenerate = True to generate the attached pdf output. So slight difference from initial code. But to me that is the intended behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-forms From a users perspective, forms is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants