Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide better control over the RAG ingestion stages (conversion, chunking, embedding, storing) #1061

Open
ilya-kolchinsky opened this issue Feb 12, 2025 · 6 comments
Assignees
Labels
enhancement New feature or request RAG Relates to RAG functionality of the agents API

Comments

@ilya-kolchinsky
Copy link

🚀 Describe the new functionality needed

As of now, the RAG ingestion documents chunks the documents using a trivial algorithm of overlapping chunks and converts PDFs (and PDFs only) using pypdf.
The entire ingestion process should be made more general, flexible and user-controllable by introducing the respective configuration settings - similarly to the way the embedding model can be specified today via config.

💡 Why is this needed? What if we don't build it?

Having a higher degree of control over the ingestion process in an enabler to a wide range of customer use cases. To mention a few examples:

  • Structured chunking aware of the document format (e.g., .json, .md);
  • Controllable chunking granularity;
  • Defining document converters on a per format basis;
  • Support for sparse DBs by omitting the embedding stage.

Other thoughts

No response

@ilya-kolchinsky ilya-kolchinsky added the enhancement New feature or request label Feb 12, 2025
@hardikjshah
Copy link
Contributor

Very timely ! We (@ehhuang, @yanxi0830 and I) are also looking at how to make this more configurable while providing reasonable default starting points ( and ootb solutions) to make RAG performance much better. At the same time, we are already talking to some providers like mongo, elastic search to see how we can get them integrated.

My be we start with some RFC for how we might modularize this and the core primitives / data flow before we jump right into implementation. Do you happen to have one ongoing ? or would you want to propose something ?

@yanxi0830 yanxi0830 added the RAG Relates to RAG functionality of the agents API label Feb 12, 2025
@ilya-kolchinsky
Copy link
Author

I definitely agree that starting with an RFC would be a great way to proceed here.

As the first step and perhaps a partial/temporary solution, I was planning to do something very similar to #1062. Eventually though, we would like to introduce a dedicated endpoint for document preprocessing (including, but potentially not limited to, document parsing/conversion and chunking), going with Docling as our preferred default choice. The main open question is how to incorporate this endpoint in the ingestion pipeline in the most convenient and user-friendly way.

Designing this extension is a WIP which we would be happy to collaborate on. As soon as we have a complete draft (which I expect to happen very soon), I can go ahead and create an RFC.

@yanxi0830
Copy link
Contributor

@ilya-kolchinsky Thanks for the proposal! It definitely is aligned with what we are thinking. Looking forward to your RFC!

@jwm4
Copy link
Contributor

jwm4 commented Feb 12, 2025

@ilya-kolchinsky , I recommend referencing #1048 in your RFC too. As I understand it, what your proposing is bolder and more comprehensive than the issues in that discussion (which is at this point just a discussion without a real proposal), but the topics are closely related so we want to make sure everything is cross-linked.

@cdoern
Copy link
Contributor

cdoern commented Feb 12, 2025

are also looking at how to make this more configurable while providing reasonable default starting point

does this idea of OOTB configuration a user can apply relate to #993 at all? If so, we should combine efforts here a bit. A general purpose user friendly configuration API for setting up more complex workflows seems like a good idea to me

@ilya-kolchinsky
Copy link
Author

@jwm4, @cdoern - no worries, the RFC will go through your (and the team's) approval before being published, and of course I'll be more than happy to join forces if you'd like to contribute to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RAG Relates to RAG functionality of the agents API
Projects
None yet
Development

No branches or pull requests

5 participants