This repository provides a comprehensive framework for extracting and retrieving information from PDFs and text documents. It combines OCR capabilities with state-of-the-art language models to process text or image-based PDFs, create semantic embeddings, and retrieve information efficiently using a variety of advanced techniques.
- Utilizes MiniCPM-Llama3-V-2_5, an advanced model from Hugging Face, for OCR-based text extraction from image-based PDFs.
- Provides simple text extraction for text-based PDFs using the
PyPDF2
library.
Model Reference: MiniCPM-Llama3-V-2_5 on Hugging Face
- Built on LangChain, a robust framework for building information retrieval systems.
- Employs Ollama as a chat and embedding model for advanced natural language interaction and retrieval.
LangChain Reference: LangChain Documentation
Ollama installation guide: Ollama Installation and Usage Guide
Ollama available models: Ollama Installation and Usage Guide
- Supports semantic chunking of text and multi-vector retrieval for enhanced accuracy.
- Leverages
Chroma
andFAISS
for vector storage and similarity search.
main.py
Allows users to pass a question (query) along with the pdf document path to retrieve answers based on context.
- Clone the repository:
git clone https://github.com/balbakri1/SimpleRAG.git cd rag-retrieval