Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browser Arabic Reading Inconsistent after update_page_form_field_values #1874

Open
yassinsameh opened this issue Jun 7, 2023 · 10 comments
Open
Labels
workflow-forms From a users perspective, forms is the affected feature/workflow

Comments

@yassinsameh
Copy link

yassinsameh commented Jun 7, 2023

After using update_page_form_field_values to update certain fields, some browsers read the filled arabic fields incorrectly (Unformatted and Reversed):
Expected: ياسين حسام , Result in some cases : م ا س ح ن ي س ا ن(Reversed and as if there are spaces)

To clarify further, the Form arabic text itself is read correctly however the filled text using update_page_form_field_values is not.

Working on: Chrome Mobile, Safari Mobile, Safari Web, Edge Mobile
Not Working: Chrome Web, Edge Web

This is a minimal, complete example that shows the issue:
PDF Sample:
github_issue_sample.pdf

data = {
            'user_full_name': "ياسين حسام",
        }
        #Update page 1
        pdf_writer.update_page_form_field_values(
            pdf_writer.pages[0], data,
            flags=1
        )
@yassinsameh yassinsameh changed the title Browser Arabic Reading Inconsistent Browser Arabic Reading Inconsistent after update_page_form_field_values Jun 7, 2023
@MartinThoma MartinThoma added workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text and removed workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text labels Jun 7, 2023
@pubpub-zz
Copy link
Collaborator

@yassinsameh

Can you please provide the pypdf version

@pubpub-zz
Copy link
Collaborator

If the behavior changes between the viewers, it means that the NeedAppearance flag does not have the same behavior on the different applications.
The solution would be to generate the appearance in the PDF. to do that we use the font (and size) proposed in "/DR" and "/DA". inhere your document is proposing a font that can not display arabic characters. Do you have a way to fix this ?

@yassinsameh
Copy link
Author

pypdf version: 3.9.1 , i used multiple fonts supporting RTL like Adobe Arabic but also fails.

Could you clarify further what you mean by "The solution would be to generate the appearance in the PDF."

@pubpub-zz Thank you for the efforts!

@pubpub-zz
Copy link
Collaborator

I've bee a little too fast.
a PDF is mainly an electronic paper. the Fields are some "special" areas where the user can input data. the "source" input data are stored in the "/V" entry but at the same time the editor must generate the "printed" input stored in "/AP". Normally in order to get all the viewers to produce the same print, they use the font type / size specified within the Form by the PDF producer application which are stored in"/DA" and "/DR".
Your document indicates a font solution not compatible with arabic input.
As an alternative, PDF have an global flag "/NeedAppearances" that ask the viewer to regenerate the appearances when opening the document. Based on your report, This fallback solution may not work for your usecase with all the viewers.
From my analysis, some viewers such as Acrobat Reader, add their own font entry to be compatible with arabic rendering

As said, you should check that you can complete / modify your empty form to embody a compatible font.

@pubpub-zz
Copy link
Collaborator

@yassinsameh ok to close this issue or do you have an upgraded PDF?

@yassinsameh
Copy link
Author

Apologies for my late reply, Here is a pdf github_issue_sample.pdf
with the Adobe Arabic font, it also does not work with the browsers & Platforms mentioned above. Hope i understood you correctly, do you need any other items to help debug this? @pubpub-zz

@yassinsameh
Copy link
Author

@pubpub-zz Hello, any other info i can provide to help?

@pubpub-zz
Copy link
Collaborator

Many fonts will not be able to handle arabic
I'm still looking to define/use a default font that will be fully compatible with UTF-16. I'm starting to get some ideas but very tough problem 🤯

@arvo95
Copy link

arvo95 commented Jun 29, 2023

Hi! I have previously filled out PDF forms with Arabic text and had the exactly same problem. The issue is not on pypdf's side, but with the way Python treats Arabic. Since Arabic is written with letters connected together, it means that the letters next to each letter determine how it will look like and those are called ligatures (a specific rendering of a letter). Python by default treats Arabic exactly the same way as it does English - character by character, while ignoring the different ways the letters need to be rendered to make coherent words and thus the text comes out broken.

To fix this, you need to use a package called arabic_reshaper, that will take the letters and turn them into the appropriate ligatures and make the text nice and connected. To fix it being the other way around you need to use bidi package:

It kinda looks like this in the end:

import arabic_reshaper
from bidi.algorithm import get_display
reshaper = arabic_reshaper.ArabicReshaper(arabic_reshaper.config_for_true_type_font('/optional/directory/to_load_fonts_for_ligatures/font.ttf', arabic_reshaper.ENABLE_ALL_LIGATURES))
text_fixed = get_display(reshaper.reshape(broken_text))

Very important to keep ENABLE_ALL_LIGATURES, so that all Arabic special ligatures will work.

@stefan6419846 stefan6419846 added the workflow-forms From a users perspective, forms is the affected feature/workflow label Feb 15, 2024
@stefan6419846
Copy link
Collaborator

@arvo95 Are you interested in adding your example to the docs for further reference? Feel free to submit a corresponding PR in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-forms From a users perspective, forms is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

5 participants