Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChatGPT OCR results are generated in different languages #864

Open
tanreinama opened this issue Feb 16, 2025 · 1 comment
Open

ChatGPT OCR results are generated in different languages #864

tanreinama opened this issue Feb 16, 2025 · 1 comment

Comments

@tanreinama
Copy link

The text written in Japanese on the image is translated into English and output.

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")  ### Japanese Language Image
print(result.text_content)  ### English output

In some cases, the entire document will be in English, while in other cases only part of the document (only the title) will be in English.

Depending on the requirements of your RAG, this may not be desirable, so it is better to be able to specify the output language or to fix it to the original language found in the image.

@Si-ddhartha
Copy link

Hi, I was looking into this issue and couldn't find anything related to OCR in the code. Based on my understanding, the library processes the image by passing it to the provided LLM along with a prompt. If no custom prompt is given, it defaults to:

"Write a detailed caption for this image."

Since the prompt is in English, the LLM likely assumes the response should also be in English. This might explain why captions are always generated in English, even if the image contains text in another language.

A way to address this could be to allow users to specify a preferred language or check if the LLM itself supports automatic language detection and leveraging that if possible.

I’d love to work on this issue and implement a fix! Let me know if this approach makes sense or if you have any suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants