Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Drive - Presentation to Markdown #3073

Open
that-dom opened this issue Jan 6, 2025 · 1 comment
Open

Google Drive - Presentation to Markdown #3073

that-dom opened this issue Jan 6, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@that-dom
Copy link

that-dom commented Jan 6, 2025

Problem Description

When using Google Drive, integration presentations are just blobs of text that, when used as an RAG source, can cause issues with the LLM understanding of what goes together. Visual elements are also lost.

Proposed Solution

Using a vision model to extract text and meaning from each slides visuals.

Alternatives

Additional Context

@seanstory
Copy link
Member

Hi @that-dom, thanks for filing.

It's on our radar that formats like Markdown are much more useful for LLM consumption. Today, we're primarily relying on Apache Tika (either through the Data Extraction Service or through the Elasticsearch Attachment Ingest plugin) to get textual data from non-text file formats, and this is known to lose formatting.

Until we're able to integrate with different tooling, one option you have is to pre-process your files into markdown, and then skip running those through "binary content extraction". This would allow you to preserve your initial formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants