FItz - Error when processing images from pdf to html #1410

carlaTV · 2021-11-11T11:20:11Z

carlaTV
Nov 11, 2021

Please provide all mandatory information!

Describe the bug (mandatory)

When page.get_text() function is applied to some pdfs that contain images, it returns the image turned upside down + it adds a black image with the same size as the original image behind it. This does not happen to other documents.

To Reproduce (mandatory)

I attach two documents:

document_OK.pdf --> document where the error does not happen
document_OK.pdf
document_ERROR.pdf --> document where the error happens.
document_ERROR.pdf

In order to reproduce the error, please run the following code:

`
import fitz

file_ok = 'document_OK.pdf'
file_error = 'document_ERROR.pdf'

doc_ok = fitz.open(file_ok)
doc_error = fitz.open(file_error)

html_text_ok = ''
for page in doc_ok:
html_text_ok += page.get_text("html")

html_text_error = ''
for page in doc_error:
html_text_error += page.get_text('html')
`

Expected behavior (optional)

The expected behavior is the following:

Which generates the image in base64 in the document image_base64_OK.txt and the html in the file document_OK.txt (it is not permitted to upload html files, so I attach it as txt).

document_OK.txt
image_base64_OK.txt

However, the image in document_ERROR.pdf is converted to two images:

A black rectangle.
The image in the pdf turned upside down.
I also attach the base64 files of both images and the html file generated by fitz, document_ERROR.txt (again, I attach it as txt despite it is a html file).

document_ERROR.txt
image_ERROR1.txt
image_ERROR2.txt

Screenshots (optional)

Screenshots added in the previous section.

Your configuration (mandatory)

Operating system: MacOS 12.0.1

Python and PyMuPDF versions:
3.8.12 (default, Oct 22 2021, 18:39:35)
[Clang 13.0.0 (clang-1300.0.29.3)]
darwin

PyMuPDF 1.19.1: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-10-23 00:00:01.
Built for Python 3.8 on darwin (64-bit).

Additional context (optional)

Of course, I am aware that the problem might come from the pdf, but I don't really know why the image is recognized as two images and one of them is turned upside down.
I would really appreciate if you give me some insights on how to solve this. Thanks a lot!

JorjMcKie · 2021-11-11T13:50:12Z

JorjMcKie
Nov 11, 2021
Maintainer

First of all I have to mention, that I cannot do anything about (X)HTML / XML output, because these are direct wrappers of MuPDF code.

But apart from that:
The error PDF is really a weird thing: if you look at the list page.get_images() you will see 2 items, one at xref 25, one at xref 26. Both have the same image dimensions and both are displayed at the same bbox on the page - one (26) over the other(25).
This is why you cannot see xref 25 in the PDF display.
But img 25 simply is a black rectangle, which you see when you save the pixmap fitz.Pixmap(doc, 25).
Image 26 is indeed stored as an upside down image - again try and save fitz.Pixmap(doc, 26).
So the page neutralizes this fact before it display the image. This can bee seen right at the beginning of the page's /Contents object - looks like this:

q
123.36 0 0 -33.36 42.48 799.4004 cm  % matrix, turns upside down (matrix.d = -33.36)
/Im0 Do   % xref 25
Q
q
123.36 0 0 -33.36 42.48 799.4004 cm  % matrix, turns upside down
/Im1 Do   % xref 26
Q
...

You can also verify this in Python directly:

>>> p1.get_image_rects(25, transform=True)
[(Rect(42.47999954223633, 42.599609375, 165.83999633789062, 75.95960998535156), Matrix(123.36000061035156, 0.0, -0.0, -33.36000061035156, 42.47999954223633, 75.95960998535156))]
>>> p1.get_image_rects(26, transform=True)
[(Rect(42.47999954223633, 42.599609375, 165.83999633789062, 75.95960998535156), Matrix(123.36000061035156, 0.0, -0.0, -33.36000061035156, 42.47999954223633, 75.95960998535156))]

All that is being however ignored by the html output - I have no idea why!
BTW the page looks quite ok apart from the upside down image. The black image is there, but invisible.

Interesting: If you convert the page to an SVG page.get_svg_image(...) (which can also be displayed by browsers) then the image is correct 🤔!

So what can you do?

You could refrain from showing images at all (set the flags in get_text)
You could send an error report to MuPDF at https://bugs.ghostscript.com/enter_bug.cgi

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FItz - Error when processing images from pdf to html #1410

{{title}}

Replies: 1 comment

{{title}}

Select a reply

FItz - Error when processing images from pdf to html #1410

carlaTV Nov 11, 2021

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Replies: 1 comment

JorjMcKie Nov 11, 2021 Maintainer

carlaTV
Nov 11, 2021

JorjMcKie
Nov 11, 2021
Maintainer