RAG System for PDF Documents

This project implements a Retrieval-Augmented Generation (RAG) system to query PDF documents using advanced language models.

Main Features

Parsing PDF documents into markdown format
Splitting content into chapters and chunks
Creating a vector store for semantic search
Querying the document using natural language
- Multi-query support to enhance search results
LLM models (OpenAI and DeepSeek) for output based on the content retrived

Technologies Used

This project leverages various state-of-the-art technologies to ensure efficient document processing, semantic search, and interaction with language models.

LlamaParse: For high-quality PDF parsing, we use LlamaParse by LlamaIndex. As stated by LlamaIndex:
"At LlamaIndex we have a mission to connect your data to LLMs. A key factor in the effectiveness of presenting your data to LLMs is that it be easily understood by the model. Our experiments show that high-quality parsing makes a significant difference to the outcomes of your generative AI applications. So we compiled all of our expertise in document parsing into LlamaParse, to make it easy for you to get your data into the best possible shape for your LLMs."
- llamaa Parse
ChromaDB: A vector database to store and retrieve document embeddings for fast and accurate semantic search.
OpenAI & DeepSeek LLMs: Used for natural language processing and generating responses based on retrieved document context.
LangChain: A framework that helps integrate LLMs with external data sources, enhancing retrieval-augmented generation (RAG) capabilities.

Requirements

Python 3.8+
API keys for OpenAI and/or DeepSeek
API keys for llamaCloude, llamaa Parse

Run options

python main.py -f path/to/document.pdf -q "Your question here"

Available parameters:
- -f/--file: Path to the PDF file to be processed
- -q/--query: Question to ask the system
- -k: Number of retrieved documents (default: 1)
- --persist-dir: Directory to save/load the vector store (default: ./chroma_db)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
document_processing.py		document_processing.py
llm_models.py		llm_models.py
main.py		main.py
query_processing.py		query_processing.py
vector_store.py		vector_store.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG System for PDF Documents

Main Features

Technologies Used

Requirements

Run options

About

Releases

Packages

Languages

typegaro/AskToPDF

Folders and files

Latest commit

History

Repository files navigation

RAG System for PDF Documents

Main Features

Technologies Used

Requirements

Run options

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages