fitz - pdf to image conversion - some text characters are getting converted to junk #1627

Raxidi · 2022-03-07T09:55:20Z

Raxidi
Mar 7, 2022

When pdf file is converted to images, text gets changed or becomes junk in some parts of the image.

To Reproduce (mandatory)

import fitz
dpi = 200
dpi_matrix = fitz.Matrix(dpi / 72, dpi / 72)
file_path = "test.pdf"
with fitz.open(file_path) as pdf_file:
for page in pdf_file:
page_pixel = page.get_pixmap(matrix=dpi_matrix)
page_pixel.set_dpi(dpi, dpi)
page_pixel.save(f"{page.number}.png")

Expected behavior (optional)

Image should be a copy of pdf page.

Screenshots (optional)

I can not upload the files here due to some constraints. I will mail you the same.

Screen shot from PDF file

Converted image of the same page.

Your configuration (mandatory)

OS Win10/Win11 64bit
3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]
win32

PyMuPDF 1.19.6: Python bindings for the MuPDF 1.19.0 library.
Version date: 2022-03-03 00:00:01.
Built for Python 3.8 on win32 (64-bit).

Thank you.

Answered by JorjMcKie

Mar 7, 2022

Thanks for reporting this so well prepared. I also received the material via e-mail.
Unfortunately I cannot do anything about this, because it is an upstream (MuPDF) issue. Your files are created in the wrong way:
They are using non-embedded fonts like Times Roman, but use Identity encoding instead of e.g. WinAnsiEncoding. Font using Identity-H encoding must be embedded, however.
This problem is exhibited by any PDF viewer if you try to copy / paste those problematic text portions: it will copy garbage.

One could argue that most viewers still render the page ok, so why does MuPDF rendering not do this?
This I cannot answer, so I must refer you to MuPDF's bug tracker https://bugs.ghostscri…

View full answer

JorjMcKie · 2022-03-07T11:16:39Z

JorjMcKie
Mar 7, 2022
Maintainer

Thanks for reporting this so well prepared. I also received the material via e-mail.
Unfortunately I cannot do anything about this, because it is an upstream (MuPDF) issue. Your files are created in the wrong way:
They are using non-embedded fonts like Times Roman, but use Identity encoding instead of e.g. WinAnsiEncoding. Font using Identity-H encoding must be embedded, however.
This problem is exhibited by any PDF viewer if you try to copy / paste those problematic text portions: it will copy garbage.

One could argue that most viewers still render the page ok, so why does MuPDF rendering not do this?
This I cannot answer, so I must refer you to MuPDF's bug tracker https://bugs.ghostscript.com/enter_bug.cgi.
Choose an unproblematic / non-confidential page and report that mutool -o page.png your.pdf produces a garbled image.

The font marked yellow for document-page1.pdf is the problem:

2 replies

Raxidi Mar 7, 2022
Author

Thank you very much for the quick response. I have reported this to MuPDF(Bug 705038).

JorjMcKie Mar 7, 2022
Maintainer

Thanks, I will add myself to their CC list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fitz - pdf to image conversion - some text characters are getting converted to junk #1627

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

fitz - pdf to image conversion - some text characters are getting converted to junk #1627

Raxidi Mar 7, 2022

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Replies: 1 comment · 2 replies

JorjMcKie Mar 7, 2022 Maintainer

Raxidi Mar 7, 2022 Author

JorjMcKie Mar 7, 2022 Maintainer

Raxidi
Mar 7, 2022

Replies: 1 comment 2 replies

JorjMcKie
Mar 7, 2022
Maintainer

Raxidi Mar 7, 2022
Author

JorjMcKie Mar 7, 2022
Maintainer