Provide better control over the RAG ingestion stages (conversion, chunking, embedding, storing) #1061

ilya-kolchinsky · 2025-02-12T10:40:58Z

🚀 Describe the new functionality needed

As of now, the RAG ingestion documents chunks the documents using a trivial algorithm of overlapping chunks and converts PDFs (and PDFs only) using pypdf.
The entire ingestion process should be made more general, flexible and user-controllable by introducing the respective configuration settings - similarly to the way the embedding model can be specified today via config.

💡 Why is this needed? What if we don't build it?

Having a higher degree of control over the ingestion process in an enabler to a wide range of customer use cases. To mention a few examples:

Structured chunking aware of the document format (e.g., .json, .md);
Controllable chunking granularity;
Defining document converters on a per format basis;
Support for sparse DBs by omitting the embedding stage.

Other thoughts

No response

hardikjshah · 2025-02-12T17:10:41Z

Very timely ! We (@ehhuang, @yanxi0830 and I) are also looking at how to make this more configurable while providing reasonable default starting points ( and ootb solutions) to make RAG performance much better. At the same time, we are already talking to some providers like mongo, elastic search to see how we can get them integrated.

My be we start with some RFC for how we might modularize this and the core primitives / data flow before we jump right into implementation. Do you happen to have one ongoing ? or would you want to propose something ?

ilya-kolchinsky · 2025-02-12T18:59:49Z

I definitely agree that starting with an RFC would be a great way to proceed here.

As the first step and perhaps a partial/temporary solution, I was planning to do something very similar to #1062. Eventually though, we would like to introduce a dedicated endpoint for document preprocessing (including, but potentially not limited to, document parsing/conversion and chunking), going with Docling as our preferred default choice. The main open question is how to incorporate this endpoint in the ingestion pipeline in the most convenient and user-friendly way.

Designing this extension is a WIP which we would be happy to collaborate on. As soon as we have a complete draft (which I expect to happen very soon), I can go ahead and create an RFC.

yanxi0830 · 2025-02-12T19:23:42Z

@ilya-kolchinsky Thanks for the proposal! It definitely is aligned with what we are thinking. Looking forward to your RFC!

jwm4 · 2025-02-12T20:14:24Z

@ilya-kolchinsky , I recommend referencing #1048 in your RFC too. As I understand it, what your proposing is bolder and more comprehensive than the issues in that discussion (which is at this point just a discussion without a real proposal), but the topics are closely related so we want to make sure everything is cross-linked.

cdoern · 2025-02-12T20:43:47Z

are also looking at how to make this more configurable while providing reasonable default starting point

does this idea of OOTB configuration a user can apply relate to #993 at all? If so, we should combine efforts here a bit. A general purpose user friendly configuration API for setting up more complex workflows seems like a good idea to me

ilya-kolchinsky · 2025-02-13T10:17:04Z

@jwm4, @cdoern - no worries, the RFC will go through your (and the team's) approval before being published, and of course I'll be more than happy to join forces if you'd like to contribute to it.

ilya-kolchinsky added the enhancement New feature or request label Feb 12, 2025

hardikjshah assigned ilya-kolchinsky Feb 12, 2025

yanxi0830 added the RAG Relates to RAG functionality of the agents API label Feb 12, 2025

ilya-kolchinsky mentioned this issue Feb 24, 2025

[RFC] Preprocessing endpoint for RAG and other uses #1232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide better control over the RAG ingestion stages (conversion, chunking, embedding, storing) #1061

Provide better control over the RAG ingestion stages (conversion, chunking, embedding, storing) #1061

ilya-kolchinsky commented Feb 12, 2025

hardikjshah commented Feb 12, 2025

ilya-kolchinsky commented Feb 12, 2025

yanxi0830 commented Feb 12, 2025

jwm4 commented Feb 12, 2025

cdoern commented Feb 12, 2025

ilya-kolchinsky commented Feb 13, 2025

Provide better control over the RAG ingestion stages (conversion, chunking, embedding, storing) #1061

Provide better control over the RAG ingestion stages (conversion, chunking, embedding, storing) #1061

Comments

ilya-kolchinsky commented Feb 12, 2025

🚀 Describe the new functionality needed

💡 Why is this needed? What if we don't build it?

Other thoughts

hardikjshah commented Feb 12, 2025

ilya-kolchinsky commented Feb 12, 2025

yanxi0830 commented Feb 12, 2025

jwm4 commented Feb 12, 2025

cdoern commented Feb 12, 2025

ilya-kolchinsky commented Feb 13, 2025