Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Farsi (or other non latin characters) not in markdown from image #774

Open
chregu opened this issue Jan 19, 2025 · 1 comment
Open

Farsi (or other non latin characters) not in markdown from image #774

chregu opened this issue Jan 19, 2025 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@chregu
Copy link

chregu commented Jan 19, 2025

Question

When converting an image with eg. farsi text in it, it shows up in the Docling Document, but not in the generated markdown (since its parent is a picture, I guess). The latin characters are in the markdown.

Is that to be supposed like that?

Tried with EasyOCR and TesseractCli. And also supplying the correct lang parameters (hopefully). Same also happens, when I supply an image with just Farsi and no latin characters.

I can of course write my own DoclingDocument parser for that, just wondering, why it doesn't show up in the default markdown converter and why it's a picture, while latin characters are not.

This is the image
Image

And this is the produced document.

 {
    "schema_name": "DoclingDocument",
    "version": "1.0.0",
    "name": "docling_4fc1bc1e-5b05-491e-a7a7-4dece794501e_f738b51a5a2c415281ab32b2b42397db-1762493015",
    "origin": {
        "mimetype": "application/pdf",
        "binary_hash": 12687658831323329000,
        "filename": "docling_4fc1bc1e-5b05-491e-a7a7-4dece794501e_f738b51a5a2c415281ab32b2b42397db-1762493015.jpg"
    },
    "furniture": {
        "self_ref": "#/furniture",
        "children": [],
        "name": "_root_",
        "label": "unspecified"
    },
    "body": {
        "self_ref": "#/body",
        "children": [
            {
                "$ref": "#/pictures/0"
            },
            {
                "$ref": "#/texts/1"
            },
            {
                "$ref": "#/texts/2"
            },
            {
                "$ref": "#/texts/3"
            }
        ],
        "name": "_root_",
        "label": "unspecified"
    },
    "groups": [],
    "texts": [
        {
            "self_ref": "#/texts/0",
            "parent": {
                "$ref": "#/pictures/0"
            },
            "children": [],
            "label": "section_header",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 15.954619957683569,
                        "t": 834,
                        "r": 1073.544136848707,
                        "b": 584.6887967043365,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        39
                    ]
                }
            ],
            "orig": "تو دیگرم شدی شدی شدم من من من تو ازي من",
            "text": "تو دیگرم شدی شدی شدم من من من تو ازي من",
            "level": 1
        },
        {
            "self_ref": "#/texts/1",
            "parent": {
                "$ref": "#/body"
            },
            "children": [],
            "label": "text",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 63.33333206176758,
                        "t": 531.3333129882812,
                        "r": 1023.6666870117188,
                        "b": 447,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        106
                    ]
                }
            ],
            "orig": "Mun tu shudam tu mun shudimun tun shudam tu jaan shudi Taakas na guyad baad azeen, ٨٧٨ deegaram tu deegari",
            "text": "Mun tu shudam tu mun shudimun tun shudam tu jaan shudi Taakas na guyad baad azeen, ٨٧٨ deegaram tu deegari"
        },
        {
            "self_ref": "#/texts/2",
            "parent": {
                "$ref": "#/body"
            },
            "children": [],
            "label": "text",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 147,
                        "t": 411,
                        "r": 928,
                        "b": 250,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        131
                    ]
                }
            ],
            "orig": "I have become you, and you I am the you soul; So that no one can say hereafter, That you are are someone, and me someone else body,",
            "text": "I have become you, and you I am the you soul; So that no one can say hereafter, That you are are someone, and me someone else body,"
        },
        {
            "self_ref": "#/texts/3",
            "parent": {
                "$ref": "#/body"
            },
            "children": [],
            "label": "section_header",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 412,
                        "t": 205.6666717529297,
                        "r": 664,
                        "b": 162.6666717529297,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        12
                    ]
                }
            ],
            "orig": "Ameer Khusro",
            "text": "Ameer Khusro",
            "level": 1
        }
    ],
    "pictures": [
        {
            "self_ref": "#/pictures/0",
            "parent": {
                "$ref": "#/body"
            },
            "children": [
                {
                    "$ref": "#/texts/0"
                }
            ],
            "label": "picture",
            "prov": [
                {
                    "page_no": 1,
                    "bbox": {
                        "l": 23.88103675842285,
                        "t": 831.9660034179688,
                        "r": 1062.5076904296875,
                        "b": 582.7559814453125,
                        "coord_origin": "BOTTOMLEFT"
                    },
                    "charspan": [
                        0,
                        0
                    ]
                }
            ],
            "captions": [],
            "references": [],
            "footnotes": [],
            "annotations": []
        }
    ],
    "tables": [],
    "key_value_items": [],
    "pages": {
        "1": {
            "size": {
                "width": 1080,
                "height": 1080
            },
            "page_no": 1
        }
    }
}
@chregu chregu added the question Further information is requested label Jan 19, 2025
@dolfim-ibm
Copy link
Contributor

Here are some findings:

  1. The Farsi text is indeed detected as "content of a picture", hence not exported (by default) in the markdown.
  2. The rest of the text instead is not detected as picture.

We have an example (soon to be merged) which shows how to work with the text in pictures: https://github.com/DS4SD/docling/pull/624/files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants