-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide better control over the RAG ingestion stages (conversion, chunking, embedding, storing) #1061
Comments
Very timely ! We (@ehhuang, @yanxi0830 and I) are also looking at how to make this more configurable while providing reasonable default starting points ( and ootb solutions) to make RAG performance much better. At the same time, we are already talking to some providers like mongo, elastic search to see how we can get them integrated. My be we start with some RFC for how we might modularize this and the core primitives / data flow before we jump right into implementation. Do you happen to have one ongoing ? or would you want to propose something ? |
I definitely agree that starting with an RFC would be a great way to proceed here. As the first step and perhaps a partial/temporary solution, I was planning to do something very similar to #1062. Eventually though, we would like to introduce a dedicated endpoint for document preprocessing (including, but potentially not limited to, document parsing/conversion and chunking), going with Docling as our preferred default choice. The main open question is how to incorporate this endpoint in the ingestion pipeline in the most convenient and user-friendly way. Designing this extension is a WIP which we would be happy to collaborate on. As soon as we have a complete draft (which I expect to happen very soon), I can go ahead and create an RFC. |
@ilya-kolchinsky Thanks for the proposal! It definitely is aligned with what we are thinking. Looking forward to your RFC! |
@ilya-kolchinsky , I recommend referencing #1048 in your RFC too. As I understand it, what your proposing is bolder and more comprehensive than the issues in that discussion (which is at this point just a discussion without a real proposal), but the topics are closely related so we want to make sure everything is cross-linked. |
does this idea of OOTB configuration a user can apply relate to #993 at all? If so, we should combine efforts here a bit. A general purpose user friendly configuration API for setting up more complex workflows seems like a good idea to me |
🚀 Describe the new functionality needed
As of now, the RAG ingestion documents chunks the documents using a trivial algorithm of overlapping chunks and converts PDFs (and PDFs only) using pypdf.
The entire ingestion process should be made more general, flexible and user-controllable by introducing the respective configuration settings - similarly to the way the embedding model can be specified today via config.
💡 Why is this needed? What if we don't build it?
Having a higher degree of control over the ingestion process in an enabler to a wide range of customer use cases. To mention a few examples:
Other thoughts
No response
The text was updated successfully, but these errors were encountered: