Skip to content

A toolkit to create optimal Production-readyRetrieval Augmented Generation(RAG) setup for your data

License

Notifications You must be signed in to change notification settings

KruxAI/ragbuilder

Folders and files

NameName
Last commit message
Last commit date
Jul 20, 2024
Jul 20, 2024
Jul 22, 2024
Aug 23, 2024
Feb 3, 2025
Sep 12, 2024
Jan 7, 2025
Jul 20, 2024
Aug 2, 2024
Sep 21, 2024
Oct 19, 2024
Jul 20, 2024
Dec 30, 2024
Sep 13, 2024
Jul 22, 2024
Jul 22, 2024
Dec 22, 2024
Jun 24, 2024
Dec 28, 2024
Dec 31, 2024
Oct 22, 2024
Oct 22, 2024
Dec 20, 2024
Oct 23, 2024

Repository files navigation

RagBuilder logo RagBuilder logo

made-with-python GitHub release GitHub license GitHub commits

11926

RagBuilder is a toolkit that helps you create optimal Production-ready Retrieval-Augmented-Generation (RAG) setup for your data automatically. By performing hyperparameter tuning on various RAG parameters (Eg: chunking strategy: semantic, character etc., chunk size: 1000, 2000 etc.), RagBuilder evaluates these configurations against a test dataset to identify the best-performing setup for your data. Additionally, RagBuilder includes several state-of-the-art, pre-defined RAG templates that have shown strong performance across diverse datasets. So just bring your data, and RagBuilder will generate a production-grade RAG setup in just minutes.

Features

  • Hyperparameter Tuning: Efficiently optimize your RAG configurations using Bayesian optimization
  • Pre-defined RAG Templates: Use state-of-the-art templates that have demonstrated strong performance Eg: Graph retriever, Contextual chunker etc.)
  • Evaluation Dataset Options: Generate synthetic test dataset or provide your own
  • Component Access: Direct access to vectorstore, retriever, and generator components
  • API Deployment: Easily deploy as an API service
  • Project Persistence: Save and load optimized RAG pipelines

Installation

# Create a new venv
uv venv ragbuilder

# Activate the new venv
source ragbuilder/bin/activate

# Install
uv pip install ragbuilder

See other installation options here (link)

Quick Start

from ragbuilder import RAGBuilder

# Initialize and optimize with defaults
builder = RAGBuilder.from_source_with_defaults(input_source='https://lilianweng.github.io/posts/2023-06-23-agent/')
results = builder.optimize()

# Run a query through the complete pipeline
response = results.invoke("What is HNSW?")

# View optimization summary
print(results.summary())

Setting Default Models

You can specify default LLM and embedding models that will be used throughout the pipeline:

from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

# Initialize with custom defaults
builder = RAGBuilder.from_source_with_defaults(
    input_source='data.pdf',
    default_llm=AzureChatOpenAI(model="gpt-4o", temperature=0.0),
    default_embeddings=AzureOpenAIEmbeddings(model="text-embedding-3-large"),
    n_trials=20  # Set number of optimization trials
)

# Or when creating a RAGBuilder instance with fine grained custom configuration
builder = RAGBuilder(
    data_ingest_config=data_ingest_config, # Custom Data Ingestion parameters
    default_llm=AzureChatOpenAI(model="gpt-4o", temperature=0.0),
    default_embeddings=AzureOpenAIEmbeddings(model="text-embedding-3-large")
)

Configuration Guide

Basic Configuration

For most use cases, the default configuration provides good results:

builder = RAGBuilder.from_source_with_defaults(
    input_source='path/to/your/data',
    test_dataset='path/to/test/data'  # Optional
)

Advanced Configuration

For fine-grained control over your RAG pipeline, you can customize every aspect:

from ragbuilder.config import (
    DataIngestOptionsConfig,
    RetrievalOptionsConfig,
    GenerationOptionsConfig
)

# Configure data ingestion
data_ingest_config = DataIngestOptionsConfig(
    input_source="data.pdf",
    document_loaders=[
        {"type": "pymupdf"},
        {"type": "unstructured"}
    ],
    chunking_strategies=[{
        "type": "RecursiveCharacterTextSplitter",
        "chunker_kwargs": {"separators": ["\n\n", "\n", " ", ""]}
    }],
    chunk_size={"min": 500, "max": 2000, "stepsize": 500},
    embedding_models=[{
        "type": "openai",
        "model_kwargs": {"model": "text-embedding-3-large"}
    }]
)

# Initialize with custom configs
builder = RAGBuilder(
    data_ingest_config=data_ingest_config,
    default_llm=AzureChatOpenAI(model="gpt-4o", temperature=0.0),
    default_embeddings=AzureOpenAIEmbeddings(model="text-embedding-3-large")
)

# Run individual module level optimization
builder.optimize_data_ingest()


# Configure retrieval options
retrieval_config = RetrievalOptionsConfig(
    retrievers=[
        {
            "type": "vector_similarity",
            "retriever_k": [20],
            "weight": 0.5
        },
        {
            "type": "bm25",
            "retriever_k": [20],
            "weight": 0.5
        }
    ],
    rerankers=[{
        "type": "BAAI/bge-reranker-base"
    }],
    top_k=[3, 5]
)


# Run retrieval optimization with custom config
builder.optimize_retrieval(retrieval_config)

# Configure Generation related options
gen_config = GenerationOptionsConfig(
    llms = [
        LLMConfig(type="azure_openai", model_kwargs={'model':'gpt-4o-mini', 'temperature':0.2}),
        LLMConfig(type="azure_openai", model_kwargs={'model':'gpt-4o', 'temperature':0.2}),
    ],
    optimization={
        "n_trials": 10, 
        "n_jobs": 1,
        "study_name": "lillog_agents_study",
        "optimization_direction": "maximize"
    },
    evaluation_config={"type": "ragas"},
)

# Run generation optimization with custom config
builder.optimize_generation(gen_config)

results = builder.optimization_results
response = adv_results.invoke("What is HNSW?")

Component Options Reference

Document Loaders

  • unstructured: General-purpose loader
  • pymupdf: Optimized for PDFs
  • pypdf: Alternative PDF loader
  • web: Web page loader
  • Custom loaders via custom_class

Chunking Strategies

  • RecursiveCharacterTextSplitter: Recursive character text splitter
  • CharacterTextSplitter: Character text splitter
  • MarkdownHeaderTextSplitter: Markdown-header based splitter
  • HTMLHeaderTextSplitter: HTML-header based splitter
  • SemanticChunker: Semantic chunker
  • TokenTextSplitter: Token-based splitter
  • Custom splitters via custom_class

Retrievers

  • vector_similarity: Vector similarity search
  • vector_mmr: Vector MMR search
  • bm25: Keyword-based search using BM25
  • multi_query: Multi-query retrievers
  • parent_doc_full: Parent document full-doc retrieval
  • parent_doc_large: Parent document large-chunks retrieval
  • graph: Graph-based retrieval (requires Neo4j)
  • Custom retrievers via custom_class

Rerankers

  • BAAI/bge-reranker-base: BGE base reranker
  • mixedbread-ai/mxbai-rerank-base-v1: mxbai reranker base v1
  • mixedbread-ai/mxbai-rerank-large-v1: mxbai reranker large v1
  • cohere: Cohere's reranking model
  • jina: Jina reranker
  • flashrank: Flaskrank reranker
  • rankllm: RankLLM reranker
  • colbert: Colbert reranker
  • Custom rerankers via custom_class

Environment Variables

Create a .env file in your project directory:

# Required
OPENAI_API_KEY=your_key_here

# Optional - For additional features
MISTRAL_API_KEY=your_key_here
COHERE_API_KEY=your_key_here
AZURE_OPENAI_API_KEY=your_key_here
AZURE_OPENAI_ENDPOINT=your_endpoint_here

# For Graph-based RAG
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password

Advanced Topics

Custom Evaluation Metrics

from ragbuilder import EvaluationConfig

config = EvaluationConfig(
    type="custom",
    custom_class="your_module.CustomEvaluator",
    evaluator_kwargs={
        "metrics": ["precision", "recall", "f1_score"]
    }
)

Optimization Configuration

Fine-tune the optimization parameters:

from ragbuilder import OptimizationConfig

config = OptimizationConfig(
    n_trials=20,
    n_jobs=1,
    study_name="my_optimization",
    optimization_direction="maximize"
)

API Deployment

RAGBuilder can be deployed as an API service:

# Initialize and optimize
builder = RAGBuilder.from_source_with_defaults('data.pdf')
results = builder.optimize()

# Deploy as API
builder.serve(host="0.0.0.0", port=8000)

Access via:

  • POST /query - Run queries through the RAG pipeline

Project Management

Save and load optimized RAG pipelines:

# Save project
builder.save('rag_project/')

# Load existing project
builder = RAGBuilder.load('rag_project/')

# Access components
vectorstore = builder.data_ingest.get_vectorstore()
retriever = builder.retrieval.get_retriever()
generator = builder.generation.get_generator()

Best Practices

  1. Start Simple

    • Begin with from_source_with_defaults()
    • Add complexity only when needed
  2. Test Data Quality

    • Provide representative test queries
    • Use domain-specific evaluation metrics
  3. Resource Management

    • Monitor memory usage with large datasets
    • Use chunking for large documents
  4. Production Deployment

    • Save optimized projects for reuse
    • Monitor API performance metrics
    • Implement rate limiting for API endpoints

Usage Analytics

We collect anonymous usage metrics to improve RAGBuilder:

  • Number of optimization runs
  • Success/failure rates
  • No personal or business data is collected

To opt-out set ENABLE_ANALYTICS=False in .env:

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.