extract_text leads to Chinese characters instead of ASCII #1672

MartinThoma · 2023-02-28T07:20:04Z

I'm trying to extract text (see https://stackoverflow.com/q/75587416/562769 )

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-139-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.5.0

Code + PDF

This is a minimal, complete example that shows the issue:

from io import BytesIO

from pypdf import PdfReader


def get_pdf_from_url(url: str, name: str):
    """Download the file"""
    import ssl
    import urllib.request
    from pathlib import Path
    from urllib.error import HTTPError

    cache_path = Path(name)
    ssl._create_default_https_context = ssl._create_unverified_context
    cpt = 3
    while cpt > 0:
        try:
            with urllib.request.urlopen(url) as response, cache_path.open(
                "wb",
            ) as out_file:
                out_file.write(response.read())
            cpt = 0
        except HTTPError as e:
            if cpt > 0:
                cpt -= 1
            else:
                raise e
    with open(cache_path, "rb") as fp:
        data = fp.read()
    return data


url = "https://efast2-filings-public.s3.amazonaws.com/prd/2013/09/13/20130913143132P030383431491001.pdf"
reader = PdfReader(BytesIO(get_pdf_from_url(url, "20130913143132P030383431491001.pdf")))
page_41 = reader.pages[40].extract_text()
print(page_41)

The PDF: https://efast2-filings-public.s3.amazonaws.com/prd/2013/09/13/20130913143132P030383431491001.pdf

The extracted output

Schedule H, line 4i
Schedule of A ssets (Held A t End of Year)
For the plan year beginning and ending
Name of plan
Employer Identification Number Three-digit
plan number
(a) (b) Identity of issue, borrower, lessor, or similar party(c) Description of investment including maturity date,
rate of interest, collateral, par, or maturity value(d) Cost (e) Current value〱⼰ㄯ㈰ㄲ ㄲ⼳ㄯ㈰ㄲ
1-800LOANMART 401(k) Plan
㤵ⴴ㠶㌳㠹 001
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䅧杲敳獩癥 102,734
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䝲潷瑨 159,791
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䉡污湣敤 285,623
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䵯摥牡瑥 9,130
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䍯湳敲癡瑩癥 ㌰ⰶ㔰
䩯桮⁈慮捯捫⁕十 Real Est. Securities Fund ㄵⰷ㔹
䩯桮⁈慮捯捫⁕十 䑆䄠䕭敲杩湧⁍慲步瑳⁖慬略 3,957
䩯桮⁈慮捯捫⁕十 佰灥湨敩浥爠䑥癥汯灩湧⁍歴 ㈷ⰶ㤳
䩯桮⁈慮捯捫⁕十 䵩搠䍡瀠却潣欠䙵湤 156
䩯桮⁈慮捯捫⁕十 DFA U.S. Small Cap Fund ㄳⰵ㠲
䩯桮⁈慮捯捫⁕十 卭慬氠䍡瀠䝲潷瑨⁉湤數 7,444
䩯桮⁈慮捯捫⁕十 䥮瑬⁅煵楴礠䥮摥砠䙵湤 3,980
䩯桮⁈慮捯捫⁕十 EuroPacific Growth Fund 6,866
䩯桮⁈慮捯捫⁕十 International Growth Fund 138
䩯桮⁈慮捯捫⁕十 SSgA Mid Value Index Fund 1,739
䩯桮⁈慮捯捫⁕十 Small Cap Value Index 1,617
䩯桮⁈慮捯捫⁕十 噡汵攠䙵湤 5,106
䩯桮⁈慮捯捫⁕十 T. Rowe Price Sml Cap Val 5,510
䩯桮⁈慮捯捫⁕十 Fidelity ContraFund ㄴⰴ㔶
䩯桮⁈慮捯捫⁕十 噡汵攠䥮摥砠䙵湤 ㄴⰸ〸
䩯桮⁈慮捯捫⁕十 㔰〠䥮摥砠䙵湤 5,983
䩯桮⁈慮捯捫⁕十 䍡灩瑡氠䥮捯浥⁂畩汤敲 ㄰ⰵ㐰
䩯桮⁈慮捯捫⁕十 䅭敲楣慮⁂慬慮捥搠䙵湤 9,078
䩯桮⁈慮捯捫⁕十 PIMCO Global Bond ㄲⰸ㈶
䩯桮⁈慮捯捫⁕十 偉䵃传呯瑡氠剥瑵牮 ㄲⰸ㈶
䩯桮⁈慮捯捫⁕十 Money Market Fund 1,041
䩯桮⁈慮捯捫⁕十 卨潲琠呥牭⁆敤敲慬 0

The expected output

Schedule H, line 4i
Schedule of Assets (Held At End of Year)
For the plan year beginning   01/01/2012
and ending 12/31/2012
Name of plan        1-800LOANMART 401(k) Plan
Employer Identification Number   95-4863389
Three-digit
plan number    001
(a)       (b) Identity of issue, borrower, lessor, or similar party (c) Description of investment including maturity date,  rate of interest, collateral, par, or maturity value   (d) Cost     (e) Current value

John Hancock USA              Lifestyle Aggressive         102,734
John Hancock USA              Lifestyle Growth               159,791
John Hancock USA              Lifestyle Balanced            285,623
John Hancock USA              Lifestyle Moderate           9,130
John Hancock USA              Lifestyle Conservative     30,650
John Hancock USA              Real Est. Securities Fund  15,759
John Hancock USA              DFA Emerging Markets Value      3,957
John Hancock USA              Oppenheimer Developing Mkt        27,693
John Hancock USA              Mid Cap Stock Fund                      156
John Hancock USA             DFA U.S. Small Cap Fund          13,582
John Hancock USA              Small Cap Growth Index           7,444
John Hancock USA              Intl Equity Index Fund              3,980
John Hancock USA             EuroPacific Growth Fund            6,866
John Hancock USA             International Growth Fund         138
John Hancock USA             SSgA Mid Value Index Fund          1,739
John Hancock USA             Small Cap Value Index                     1,617
John Hancock USA            Value Fund                                   5,106
John Hancock USA            T. Rowe Price Sml Cap Val           5,510
John Hancock USA            Fidelity ContraFund                  14,456
John Hancock USA            Value Index Fund                   14,808
John Hancock USA           500 Index Fund                         5,983
John Hancock USA           Capital Income Builder            10,540
John Hancock USA           American Balanced Fund          9,078
John Hancock USA           PIMCO Global Bond                12,826
John Hancock USA          PIMCO Total Return                12,826
John Hancock USA          Money Market Fund             1,041
John Hancock USA          Short Term Federal               0
...

Other interesting stuff

pdftotext gives:

Internal Error: xref num 403 not found but needed, try to reconstruct<0a>

But the 3Heights PDF validator says it's ok:

The document does conform to the PDF 1.4 standard.

PyMuPDF (fitz) manages to get the right text (although the whitespaces / text positions are not correct). I tried to clean it with mutool clean -daf 20130913143132P030383431491001.pdf in.pdf and then feed it into pypdf. Still the same issue.

Also using qpdf --linearize 20130913143132P030383431491001.pdf in.pdf leads to the same result in pypdf.

The text was updated successfully, but these errors were encountered:

MartinThoma · 2023-02-28T09:06:03Z

https://superuser.com/q/278562/64857 might be worth a try as well to fix the PDF

pubpub-zz · 2023-03-06T20:45:47Z

I've analyzed the PDF and I'm full of doubt:

the contentstream contains the text fully readable : it consists of 1 byte text.
the font referenced for this text is /F11
the content is the following:

{'/Name': '/F11', '/Subtype': '/TrueType', '/FirstChar': 32, '/Type': '/Font', '/BaseFont': '/IMZSPX+CourierNew,Bold', '/FontDescriptor': IndirectObject(459, 0, 1920817586256), '/ToUnicode': IndirectObject(462, 0, 1920817586256), '/LastChar': 255, '/Widths': IndirectObject(463, 0, 1920817586256)}

and the content of ToUnicode is:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering (UCS) def
/Supplement 0 def
end def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
/WMode 0 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
3 beginbfchar
<0000> <0000>
<0001> <0000>
<0002> <0000>
endbfchar
endcmap
CMapName currentdict /CMap
defineresource pop
end end

the codespacerange shows 2-bytes encoding as stated in :
https://adobe-type-tools.github.io/font-tech-notes/pdfs/5014.CIDFont_Spec.pdf (page 49,50)

when you decode the binary sequence with utf-16-be as expected for 2 bytes encoded glyphs, you get some chinese characters : this is why the decoding is not good

Adobe / pdfminer / pdf.js are extracting successfully but I do not understand how they can guess that the decoding should be done on one-byte only.

Help is welcomed ! 😣😫

pubpub-zz · 2023-03-15T21:53:03Z

@MasterOdin,
Any ideas ?

pubpub-zz · 2023-07-21T07:21:32Z

@MasterOdin any chance for you to have a look ?

pubpub-zz · 2023-07-27T09:04:20Z

note to be analysed from pdf spec 1.7 page 432

ssjkamei · 2024-09-27T04:16:25Z

You all are awesome.
I don't understand "the font contains a (3, 0)". Where can I find the answer?

If you only use this document, the following code seems to work fine:

pypdf/pypdf/_cmap.py

Lines 57 to 59 in 3b89062

    
           map_dict, space_code, int_entry = parse_to_unicode(ft, space_code) 
        
           # encoding can be either a string for decode

    # 32000-1:2008 p267
    # When the font has no Encoding entry, or the font descriptor's Symbolic flag is set(in which case the
    # Encoding entry is ignored), this shall occur.
    #
    # - If the font contains a (3, 0) subtable, the range of character codes shall be one these: 0x0000 - 0x00FF,
    # 0xF000 - 0xF0FF, 0xF100 - 0xF1FF, 0xF200 - 0xF2FF. Depending on the range of codes, each byte from the
    # string shall be prepended with the high byte of the range, to form a two-byte character, which shall be
    # used to select the associated glyph description from the subtable.
    # - Otherwise, if the font contains a (1,0) subtable, single bytes from the string shall be used to look up
    # the associated glyph descriptions from the subtable.
    #
    # If a character cannot be mapped in any of the ways described previously, a conforming reader my supply a
    # mapping of its choosing.
    is_symbolic: bool = False
    if "/FontDescriptor" in ft and "/Flags" in cast(
        DictionaryObject, ft["/FontDescriptor"]
    ):
        font_flags = '{0:b}'.format(cast(int, ft["/FontDescriptor"]["/Flags"]))
        if font_flags[-3] == "1":
            is_symbolic = True

    if is_symbolic:
        encoding = "charmap"

It's actually more complicated than that, since there is an implementation of "the font contains a (n, 0)".

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-robustness-issue From a users perspective, this is about robustness labels Feb 28, 2023

pubpub-zz added help wanted We appreciate help everywhere - this one might be an easy start! and removed is-robustness-issue From a users perspective, this is about robustness labels Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_text leads to Chinese characters instead of ASCII #1672

extract_text leads to Chinese characters instead of ASCII #1672

MartinThoma commented Feb 28, 2023 •

edited

Loading

MartinThoma commented Feb 28, 2023

pubpub-zz commented Mar 6, 2023 •

edited

Loading

pubpub-zz commented Mar 15, 2023

pubpub-zz commented Jul 21, 2023

pubpub-zz commented Jul 27, 2023

ssjkamei commented Sep 27, 2024 •

edited

Loading

extract_text leads to Chinese characters instead of ASCII #1672

extract_text leads to Chinese characters instead of ASCII #1672

Comments

MartinThoma commented Feb 28, 2023 • edited Loading

Environment

Code + PDF

The extracted output

The expected output

Other interesting stuff

MartinThoma commented Feb 28, 2023

pubpub-zz commented Mar 6, 2023 • edited Loading

pubpub-zz commented Mar 15, 2023

pubpub-zz commented Jul 21, 2023

pubpub-zz commented Jul 27, 2023

ssjkamei commented Sep 27, 2024 • edited Loading

MartinThoma commented Feb 28, 2023 •

edited

Loading

pubpub-zz commented Mar 6, 2023 •

edited

Loading

ssjkamei commented Sep 27, 2024 •

edited

Loading