You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When converting an image with eg. farsi text in it, it shows up in the Docling Document, but not in the generated markdown (since its parent is a picture, I guess). The latin characters are in the markdown.
Is that to be supposed like that?
Tried with EasyOCR and TesseractCli. And also supplying the correct lang parameters (hopefully). Same also happens, when I supply an image with just Farsi and no latin characters.
I can of course write my own DoclingDocument parser for that, just wondering, why it doesn't show up in the default markdown converter and why it's a picture, while latin characters are not.
This is the image
And this is the produced document.
{
"schema_name": "DoclingDocument",
"version": "1.0.0",
"name": "docling_4fc1bc1e-5b05-491e-a7a7-4dece794501e_f738b51a5a2c415281ab32b2b42397db-1762493015",
"origin": {
"mimetype": "application/pdf",
"binary_hash": 12687658831323329000,
"filename": "docling_4fc1bc1e-5b05-491e-a7a7-4dece794501e_f738b51a5a2c415281ab32b2b42397db-1762493015.jpg"
},
"furniture": {
"self_ref": "#/furniture",
"children": [],
"name": "_root_",
"label": "unspecified"
},
"body": {
"self_ref": "#/body",
"children": [
{
"$ref": "#/pictures/0"
},
{
"$ref": "#/texts/1"
},
{
"$ref": "#/texts/2"
},
{
"$ref": "#/texts/3"
}
],
"name": "_root_",
"label": "unspecified"
},
"groups": [],
"texts": [
{
"self_ref": "#/texts/0",
"parent": {
"$ref": "#/pictures/0"
},
"children": [],
"label": "section_header",
"prov": [
{
"page_no": 1,
"bbox": {
"l": 15.954619957683569,
"t": 834,
"r": 1073.544136848707,
"b": 584.6887967043365,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
0,
39
]
}
],
"orig": "تو دیگرم شدی شدی شدم من من من تو ازي من",
"text": "تو دیگرم شدی شدی شدم من من من تو ازي من",
"level": 1
},
{
"self_ref": "#/texts/1",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "text",
"prov": [
{
"page_no": 1,
"bbox": {
"l": 63.33333206176758,
"t": 531.3333129882812,
"r": 1023.6666870117188,
"b": 447,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
0,
106
]
}
],
"orig": "Mun tu shudam tu mun shudimun tun shudam tu jaan shudi Taakas na guyad baad azeen, ٨٧٨ deegaram tu deegari",
"text": "Mun tu shudam tu mun shudimun tun shudam tu jaan shudi Taakas na guyad baad azeen, ٨٧٨ deegaram tu deegari"
},
{
"self_ref": "#/texts/2",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "text",
"prov": [
{
"page_no": 1,
"bbox": {
"l": 147,
"t": 411,
"r": 928,
"b": 250,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
0,
131
]
}
],
"orig": "I have become you, and you I am the you soul; So that no one can say hereafter, That you are are someone, and me someone else body,",
"text": "I have become you, and you I am the you soul; So that no one can say hereafter, That you are are someone, and me someone else body,"
},
{
"self_ref": "#/texts/3",
"parent": {
"$ref": "#/body"
},
"children": [],
"label": "section_header",
"prov": [
{
"page_no": 1,
"bbox": {
"l": 412,
"t": 205.6666717529297,
"r": 664,
"b": 162.6666717529297,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
0,
12
]
}
],
"orig": "Ameer Khusro",
"text": "Ameer Khusro",
"level": 1
}
],
"pictures": [
{
"self_ref": "#/pictures/0",
"parent": {
"$ref": "#/body"
},
"children": [
{
"$ref": "#/texts/0"
}
],
"label": "picture",
"prov": [
{
"page_no": 1,
"bbox": {
"l": 23.88103675842285,
"t": 831.9660034179688,
"r": 1062.5076904296875,
"b": 582.7559814453125,
"coord_origin": "BOTTOMLEFT"
},
"charspan": [
0,
0
]
}
],
"captions": [],
"references": [],
"footnotes": [],
"annotations": []
}
],
"tables": [],
"key_value_items": [],
"pages": {
"1": {
"size": {
"width": 1080,
"height": 1080
},
"page_no": 1
}
}
}
The text was updated successfully, but these errors were encountered:
Question
When converting an image with eg. farsi text in it, it shows up in the Docling Document, but not in the generated markdown (since its parent is a picture, I guess). The latin characters are in the markdown.
Is that to be supposed like that?
Tried with EasyOCR and TesseractCli. And also supplying the correct
lang
parameters (hopefully). Same also happens, when I supply an image with just Farsi and no latin characters.I can of course write my own DoclingDocument parser for that, just wondering, why it doesn't show up in the default markdown converter and why it's a picture, while latin characters are not.
This is the image
And this is the produced document.
The text was updated successfully, but these errors were encountered: