Farsi (or other non latin characters) not in markdown from image #774

chregu · 2025-01-19T07:21:41Z

Question

When converting an image with eg. farsi text in it, it shows up in the Docling Document, but not in the generated markdown (since its parent is a picture, I guess). The latin characters are in the markdown.

Is that to be supposed like that?

Tried with EasyOCR and TesseractCli. And also supplying the correct lang parameters (hopefully). Same also happens, when I supply an image with just Farsi and no latin characters.

I can of course write my own DoclingDocument parser for that, just wondering, why it doesn't show up in the default markdown converter and why it's a picture, while latin characters are not.

This is the image

And this is the produced document.

 {
    "schema_name": "DoclingDocument",
    "version": "1.0.0",
    "name": "docling_4fc1bc1e-5b05-491e-a7a7-4dece794501e_f738b51a5a2c415281ab32b2b42397db-1762493015",
    "origin": {
        "mimetype": "application/pdf",
        "binary_hash": 12687658831323329000,
        "filename": "docling_4fc1bc1e-5b05-491e-a7a7-4dece794501e_f738b51a5a2c415281ab32b2b42397db-1762493015.jpg"
    },
    "furniture": {
        "self_ref": "#/furniture",
        "children": [],
        "name": "_root_",
        "label": "unspecified"
    },
    "body": {
        "self_ref": "#/body",
        "children": [
            {
                "$ref": "#/pictures/0"
            },
            {
                "$ref": "#/texts/1"
            },
            {
                "$ref": "#/texts/2"
            },
            {
                "$ref": "#/texts/3"
            }
        ],
        "name": "_root_",
        "label": "unspecified"
    },
    "groups": [],
    "texts": [
        {
            "self_ref": "#/texts/0",
            "parent": {
                "$ref": "#/pictures/0"
            },
            "children": [],
            "label": "section_header",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 15.954619957683569,
                        "t": 834,
                        "r": 1073.544136848707,
                        "b": 584.6887967043365,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        39
                    ]
                }
            ],
            "orig": "تو دیگرم شدی شدی شدم من من من تو ازي من",
            "text": "تو دیگرم شدی شدی شدم من من من تو ازي من",
            "level": 1
        },
        {
            "self_ref": "#/texts/1",
            "parent": {
                "$ref": "#/body"
            },
            "children": [],
            "label": "text",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 63.33333206176758,
                        "t": 531.3333129882812,
                        "r": 1023.6666870117188,
                        "b": 447,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        106
                    ]
                }
            ],
            "orig": "Mun tu shudam tu mun shudimun tun shudam tu jaan shudi Taakas na guyad baad azeen, ٨٧٨ deegaram tu deegari",
            "text": "Mun tu shudam tu mun shudimun tun shudam tu jaan shudi Taakas na guyad baad azeen, ٨٧٨ deegaram tu deegari"
        },
        {
            "self_ref": "#/texts/2",
            "parent": {
                "$ref": "#/body"
            },
            "children": [],
            "label": "text",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 147,
                        "t": 411,
                        "r": 928,
                        "b": 250,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        131
                    ]
                }
            ],
            "orig": "I have become you, and you I am the you soul; So that no one can say hereafter, That you are are someone, and me someone else body,",
            "text": "I have become you, and you I am the you soul; So that no one can say hereafter, That you are are someone, and me someone else body,"
        },
        {
            "self_ref": "#/texts/3",
            "parent": {
                "$ref": "#/body"
            },
            "children": [],
            "label": "section_header",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 412,
                        "t": 205.6666717529297,
                        "r": 664,
                        "b": 162.6666717529297,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        12
                    ]
                }
            ],
            "orig": "Ameer Khusro",
            "text": "Ameer Khusro",
            "level": 1
        }
    ],
    "pictures": [
        {
            "self_ref": "#/pictures/0",
            "parent": {
                "$ref": "#/body"
            },
            "children": [
                {
                    "$ref": "#/texts/0"
                }
            ],
            "label": "picture",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 23.88103675842285,
                        "t": 831.9660034179688,
                        "r": 1062.5076904296875,
                        "b": 582.7559814453125,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        0
                    ]
                }
            ],
            "captions": [],
            "references": [],
            "footnotes": [],
            "annotations": []
        }
    ],
    "tables": [],
    "key_value_items": [],
    "pages": {
        "1": {
            "size": {
                "width": 1080,
                "height": 1080
            },
            "page_no": 1
        }
    }
}

The text was updated successfully, but these errors were encountered:

dolfim-ibm · 2025-01-20T07:51:03Z

Here are some findings:

The Farsi text is indeed detected as "content of a picture", hence not exported (by default) in the markdown.
The rest of the text instead is not detected as picture.

We have an example (soon to be merged) which shows how to work with the text in pictures: https://github.com/DS4SD/docling/pull/624/files.

chregu added the question Further information is requested label Jan 19, 2025

PeterStaar-IBM assigned cau-git Jan 20, 2025

dolfim-ibm self-assigned this Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Farsi (or other non latin characters) not in markdown from image #774

Farsi (or other non latin characters) not in markdown from image #774

chregu commented Jan 19, 2025

dolfim-ibm commented Jan 20, 2025

Farsi (or other non latin characters) not in markdown from image #774

Farsi (or other non latin characters) not in markdown from image #774

Comments

chregu commented Jan 19, 2025

Question

dolfim-ibm commented Jan 20, 2025