-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract_text leads to Chinese characters instead of ASCII #1672
Comments
https://superuser.com/q/278562/64857 might be worth a try as well to fix the PDF |
I've analyzed the PDF and I'm full of doubt:
and the content of ToUnicode is:
the codespacerange shows 2-bytes encoding as stated in : when you decode the binary sequence with utf-16-be as expected for 2 bytes encoded glyphs, you get some chinese characters : this is why the decoding is not good Adobe / pdfminer / pdf.js are extracting successfully but I do not understand how they can guess that the decoding should be done on one-byte only. Help is welcomed ! 😣😫 |
@MasterOdin, |
@MasterOdin any chance for you to have a look ? |
You all are awesome. If you only use this document, the following code seems to work fine: Lines 57 to 59 in 3b89062
# 32000-1:2008 p267
# When the font has no Encoding entry, or the font descriptor's Symbolic flag is set(in which case the
# Encoding entry is ignored), this shall occur.
#
# - If the font contains a (3, 0) subtable, the range of character codes shall be one these: 0x0000 - 0x00FF,
# 0xF000 - 0xF0FF, 0xF100 - 0xF1FF, 0xF200 - 0xF2FF. Depending on the range of codes, each byte from the
# string shall be prepended with the high byte of the range, to form a two-byte character, which shall be
# used to select the associated glyph description from the subtable.
# - Otherwise, if the font contains a (1,0) subtable, single bytes from the string shall be used to look up
# the associated glyph descriptions from the subtable.
#
# If a character cannot be mapped in any of the ways described previously, a conforming reader my supply a
# mapping of its choosing.
is_symbolic: bool = False
if "/FontDescriptor" in ft and "/Flags" in cast(
DictionaryObject, ft["/FontDescriptor"]
):
font_flags = '{0:b}'.format(cast(int, ft["/FontDescriptor"]["/Flags"]))
if font_flags[-3] == "1":
is_symbolic = True
if is_symbolic:
encoding = "charmap" It's actually more complicated than that, since there is an implementation of "the font contains a (n, 0)". |
I'm trying to extract text (see https://stackoverflow.com/q/75587416/562769 )
Environment
Which environment were you using when you encountered the problem?
$ python -m platform Linux-5.4.0-139-generic-x86_64-with-glibc2.31 $ python -c "import pypdf;print(pypdf.__version__)" 3.5.0
Code + PDF
This is a minimal, complete example that shows the issue:
The PDF: https://efast2-filings-public.s3.amazonaws.com/prd/2013/09/13/20130913143132P030383431491001.pdf
The extracted output
The expected output
Other interesting stuff
pdftotext gives:
But the 3Heights PDF validator says it's ok:
PyMuPDF (fitz) manages to get the right text (although the whitespaces / text positions are not correct). I tried to clean it with
mutool clean -daf 20130913143132P030383431491001.pdf in.pdf
and then feed it into pypdf. Still the same issue.Also using
qpdf --linearize 20130913143132P030383431491001.pdf in.pdf
leads to the same result in pypdf.The text was updated successfully, but these errors were encountered: