ChatGPT OCR results are generated in different languages #864

tanreinama · 2025-02-16T05:03:19Z

The text written in Japanese on the image is translated into English and output.

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")  ### Japanese Language Image
print(result.text_content)  ### English output

In some cases, the entire document will be in English, while in other cases only part of the document (only the title) will be in English.

Depending on the requirements of your RAG, this may not be desirable, so it is better to be able to specify the output language or to fix it to the original language found in the image.

The text was updated successfully, but these errors were encountered:

Si-ddhartha · 2025-02-18T05:27:04Z

Hi, I was looking into this issue and couldn't find anything related to OCR in the code. Based on my understanding, the library processes the image by passing it to the provided LLM along with a prompt. If no custom prompt is given, it defaults to:

"Write a detailed caption for this image."

Since the prompt is in English, the LLM likely assumes the response should also be in English. This might explain why captions are always generated in English, even if the image contains text in another language.

A way to address this could be to allow users to specify a preferred language or check if the LLM itself supports automatic language detection and leveraging that if possible.

I’d love to work on this issue and implement a fix! Let me know if this approach makes sense or if you have any suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChatGPT OCR results are generated in different languages #864

ChatGPT OCR results are generated in different languages #864

tanreinama commented Feb 16, 2025

Si-ddhartha commented Feb 18, 2025

ChatGPT OCR results are generated in different languages #864

ChatGPT OCR results are generated in different languages #864

Comments

tanreinama commented Feb 16, 2025

Si-ddhartha commented Feb 18, 2025