FItz - Error when processing images from pdf to html #1410
Replies: 1 comment
-
First of all I have to mention, that I cannot do anything about (X)HTML / XML output, because these are direct wrappers of MuPDF code. But apart from that: q
123.36 0 0 -33.36 42.48 799.4004 cm % matrix, turns upside down (matrix.d = -33.36)
/Im0 Do % xref 25
Q
q
123.36 0 0 -33.36 42.48 799.4004 cm % matrix, turns upside down
/Im1 Do % xref 26
Q
... You can also verify this in Python directly: >>> p1.get_image_rects(25, transform=True)
[(Rect(42.47999954223633, 42.599609375, 165.83999633789062, 75.95960998535156), Matrix(123.36000061035156, 0.0, -0.0, -33.36000061035156, 42.47999954223633, 75.95960998535156))]
>>> p1.get_image_rects(26, transform=True)
[(Rect(42.47999954223633, 42.599609375, 165.83999633789062, 75.95960998535156), Matrix(123.36000061035156, 0.0, -0.0, -33.36000061035156, 42.47999954223633, 75.95960998535156))] All that is being however ignored by the html output - I have no idea why! Interesting: If you convert the page to an SVG So what can you do?
|
Beta Was this translation helpful? Give feedback.
-
Please provide all mandatory information!
Describe the bug (mandatory)
When page.get_text() function is applied to some pdfs that contain images, it returns the image turned upside down + it adds a black image with the same size as the original image behind it. This does not happen to other documents.
To Reproduce (mandatory)
I attach two documents:
document_OK.pdf --> document where the error does not happen
document_OK.pdf
document_ERROR.pdf --> document where the error happens.
document_ERROR.pdf
In order to reproduce the error, please run the following code:
`
import fitz
file_ok = 'document_OK.pdf'
file_error = 'document_ERROR.pdf'
doc_ok = fitz.open(file_ok)
doc_error = fitz.open(file_error)
html_text_ok = ''
for page in doc_ok:
html_text_ok += page.get_text("html")
html_text_error = ''
for page in doc_error:
html_text_error += page.get_text('html')
`
Expected behavior (optional)
The expected behavior is the following:
Which generates the image in base64 in the document image_base64_OK.txt and the html in the file document_OK.txt (it is not permitted to upload html files, so I attach it as txt).
document_OK.txt
image_base64_OK.txt
However, the image in document_ERROR.pdf is converted to two images:
I also attach the base64 files of both images and the html file generated by fitz, document_ERROR.txt (again, I attach it as txt despite it is a html file).
document_ERROR.txt
image_ERROR1.txt
image_ERROR2.txt
Screenshots (optional)
Screenshots added in the previous section.
Your configuration (mandatory)
Operating system: MacOS 12.0.1
Python and PyMuPDF versions:
3.8.12 (default, Oct 22 2021, 18:39:35)
[Clang 13.0.0 (clang-1300.0.29.3)]
darwin
PyMuPDF 1.19.1: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-10-23 00:00:01.
Built for Python 3.8 on darwin (64-bit).
Additional context (optional)
Of course, I am aware that the problem might come from the pdf, but I don't really know why the image is recognized as two images and one of them is turned upside down.
I would really appreciate if you give me some insights on how to solve this. Thanks a lot!
Beta Was this translation helpful? Give feedback.
All reactions