Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PDF Wizard #1186

Merged
merged 4 commits into from
Mar 9, 2025
Merged

Added PDF Wizard #1186

merged 4 commits into from
Mar 9, 2025

Conversation

tulu-g559
Copy link
Contributor

@tulu-g559 tulu-g559 commented Mar 8, 2025

Pull Request for PyVerse 💡

Requesting to submit a pull request to the PyVerse repository.


Issue Title

[Code Addition Request]: PDF Wizard - AI-Powered Document Q&A Tool with multiple pdfs📄✨

  • I have provided the issue title.

Info about the Related Issue

What's the goal of the project?
The goal of PDF Wizard is to provide an AI-powered interface that allows users to interact with multiple PDF documents. It extracts text from uploaded PDFs, converts them into embeddings using Google's Gemini AI, and stores them in a FAISS vector database. Users can then ask questions, and the system retrieves relevant content to generate accurate responses.

  • I have described the aim of the project.

Name

Please mention your name.
Arnab Ghosh

  • I have provided my name.

GitHub ID

Please mention your GitHub ID.
https://github.com/tulu-g559

  • I have provided my GitHub ID.

Email ID

Please mention your email ID for further communication.
[email protected]

  • I have provided my email ID.

Identify Yourself

Mention in which program you are contributing (e.g., WoB, GSSOC, SSOC, SWOC).
JWoC

  • I have mentioned my participant role.

Closes

Enter the issue number that will be closed through this PR.
*Closes: #1185 *

  • I have provided the issue number.

Describe the Add-ons or Changes You've Made

Give a clear description of what you have added or modified.
Describe your changes here.

  • I have described my changes.

Type of Change

Select the type of change:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Code style update (formatting, local variables)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Describe how your changes have been tested.
Functionality Testing:

Uploaded multiple PDFs with different formats and structures.
Verified that text extraction from PDFs is accurate.
AI Response Testing:

Asked context-based questions and compared the answers to ensure relevance.
Tested edge cases where answers might not be available in the context.
Performance & Efficiency:

Measured response time for similarity search in FAISS.
Tested with various chunk sizes to optimize retrieval speed.

  • I have described my testing process.

Checklist

Please confirm the following:

  • My code follows the guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly wherever it was hard to understand.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added things that prove my fix is effective or that my feature works.
  • Any dependent changes have been merged and published in downstream modules.

Summary by CodeRabbit

  • New Features
    • Introduced an interactive web application that lets users upload PDFs, extract their text, and receive AI-powered responses to queries.
  • Documentation
    • Added comprehensive setup and usage guides, including installation instructions and an overview of key features.
  • Chores
    • Updated configuration settings for Google service integration and streamlined dependency management for a smoother setup experience.
    • Added a new environment variable for Google API key management.
    • Created a requirements file listing essential dependencies for the application.

Copy link

coderabbitai bot commented Mar 8, 2025

Walkthrough

The pull request introduces a new environment variable (GOOGLE_API_KEY) in the .env file for storing a Google API key. Two new Python files implement Streamlit applications that process PDF files—extracting text, chunking it, creating a FAISS vector store, and handling conversational Q&A using Google Generative AI. Additionally, new documentation and a requirements file have been added to outline the application’s features, setup, and required packages.

Changes

File(s) Change Summary
Generative-AI/PDF Wizard/.env Added a new GOOGLE_API_KEY environment variable with a placeholder value.
Generative-AI/PDF Wizard/app.py and Generative-AI/PDF Wizard/faiss_index/app.py Introduced new Streamlit apps for PDF processing and conversational Q&A. Added functions for reading PDF text, splitting text into chunks, creating/storing a FAISS vector store, and setting up a conversational chain for answering user queries.
Generative-AI/PDF Wizard/readme_pdf_wiz.md Added documentation explaining the project's features, installation instructions, usage details, tech stack, and project structure.
Generative-AI/PDF Wizard/requirements.txt Created a dependency file listing required packages: streamlit, google-generativeai, python-dotenv, langchain, PyPDF2, faiss-cpu, and langchain_google_genai.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant S as Streamlit App
    participant P as PDF Processor
    participant F as FAISS VectorStore
    participant Q as Conversational Chain

    U->>S: Upload PDF files
    S->>P: get_pdf_text()
    P-->>S: Return extracted text
    S->>P: get_text_chunks(text)
    P-->>S: Return text chunks
    S->>F: get_vector_store(text_chunks)
    F-->>S: Vector store created
    U->>S: Ask a question
    S->>F: user_input(question)
    F-->>S: Return similar docs
    S->>Q: get_conversational_chain()
    Q-->>S: Return generated answer
    S->>U: Display answer
Loading

Poem

I'm a hopping rabbit in a code-filled glen,
Turning pages and texts again and again.
With chunks and vectors, and an AI so keen,
I navigate PDFs in a digital scene.
Each line of code is a carrot so sweet—
Hopping with joy in every byte I meet! 🐇

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

github-actions bot commented Mar 8, 2025

👋 Thank you for opening this pull request! We're excited to review your contribution. Please give us a moment, and we'll get back to you shortly!

Feel free to join our community on Discord to discuss more!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (15)
Generative-AI/PDF Wizard/.env (1)

1-1: Remove trailing semicolon for consistent formatting.

Typically, .env files follow the pattern KEY=VALUE without a semicolon at the end; consider removing it.

-GOOGLE_API_KEY = "Your_GEMIN_API_KEY";
+GOOGLE_API_KEY="Your_GEMIN_API_KEY"
Generative-AI/PDF Wizard/requirements.txt (1)

1-7: Consider pinning or bounding versions of these dependencies.

Pinning specific versions helps ensure consistent installation across environments and prevents unexpected issues caused by major version updates.

-streamlit
-google-generativeai
-python-dotenv
-langchain
-PyPDF2
-faiss-cpu
-langchain_google_genai
+streamlit==1.24.0
+google-generativeai==0.7.3
+python-dotenv==1.0.0
+langchain==0.0.223
+PyPDF2==3.1.1
+faiss-cpu==1.7.4
+langchain_google_genai==0.0.4
Generative-AI/PDF Wizard/readme_pdf_wiz.md (4)

4-4: Switch to atx-style headings for consistency and compliance with Markdown linting.

The setext-style headings (underscores or hyphens) trigger markdownlint warnings. Converting them to ## Heading or a suitable ATX level is recommended.

-📌 Overview
+## Overview

-🚀 Features
+## Features

-🛠️ Tech Stack
+## Tech Stack

-📦 Installation
+## Installation

-🎯 Usage
+## Usage

-📂 ScreenShots
+## ScreenShots

-📂 Project Structure
+## Project Structure

-🌟 Acknowledgments
+## Acknowledgments

-👤 Contributor 
+## Contributor

Also applies to: 9-9, 19-19, 31-31, 56-56, 66-66, 70-70, 81-81, 88-88

🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

4-4: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


34-34: Specify a language for fenced code blocks.

Adding a language identifier (e.g., bash, python) improves syntax highlighting and readability.

-```
+```bash
 git clone https://github.com/UTSAVS26/PyVerse.git
 cd Generative-AI 
 cd PDF-Wizard

- +bash
python -m venv venv


-```
+```bash
 source venv/bin/activate # For macOS/Linux
 venv\Scripts\activate # For Windows

- +bash
PDF-Wizard
│-- faiss_index/
...



Also applies to: 40-40, 43-43, 72-72

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

34-34: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

</details>

</details>

---

`90-92`: **Use asterisk for unordered lists to match recommended style guidelines.**

Markdown linting suggests `*` instead of `-` for unordered lists.

```diff
- - **Name:** Arnab Ghosh
- - **GitHub:** [tulug-559](https://github.com/tulu-g559)
- - **Contact:** [email]([email protected])
+ * **Name:** Arnab Ghosh
+ * **GitHub:** [tulug-559](https://github.com/tulu-g559)
+ * **Contact:** [email]([email protected])
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

90-90: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


91-91: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


92-92: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


92-92: Fix the typo in the contributor email address.

Consider correcting "[email protected]" to "[email protected]".

- **Contact:** [email]([email protected])
+ **Contact:** [email]([email protected])
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

92-92: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

Generative-AI/PDF Wizard/app.py (5)

15-15: Remove unused os.getenv("GOOGLE_API_KEY") call.

This call does nothing with the returned value. Consider removing or assigning it if needed for validation/logging.

15 -os.getenv("GOOGLE_API_KEY")

19-19: Correct minor spelling/grammar in the comment.

Change "reads the pdd" to "reads the PDF".

-##Function that reads the pdd goes through each and every page
+## Function that reads the PDF and processes every page

30-34: Validate large chunk size to avoid memory overhead.

A chunk_size of 10000 may lead to excessive memory usage for large PDFs. Consider testing smaller sizes to balance performance and resource usage.

-def get_text_chunks(text):
-    text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000)
+def get_text_chunks(text, chunk_size=2000, chunk_overlap=200):
+    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

20-26: Add docstrings and robust error handling.

The helper functions (get_pdf_text, get_text_chunks, get_vector_store, get_conversational_chain, user_input, and main) lack docstrings and might not handle edge cases such as invalid PDFs, empty documents, or missing environment variables. Consider adding docstrings that explain parameters, return values, and potential errors, plus relevant try/except blocks or validations where appropriate.

+# Example docstring snippet:
+def get_pdf_text(pdf_docs):
+    """
+    Extracts all text from the list of uploaded PDF files.
+    
+    :param pdf_docs: A list of PDF files.
+    :return: A concatenated string of text from all pages of the PDFs.
+    :raises ValueError: If no PDF files are provided or if any file is invalid.
+    """
     text = ""
     for pdf in pdf_docs:
         ...

Also applies to: 30-35, 39-49, 52-68, 70-92, 94-117


90-91: Remove or toggle off print statements in production code.

Consider using Streamlit logs or a dedicated logger instead of raw print statements for a more controlled logging approach.

-    print(response)
+    # st.write(response)  # or consider a logging system
Generative-AI/PDF Wizard/faiss_index/app.py (4)

33-37: Make chunk size and overlap configurable parameters

The chunk size and overlap values are hardcoded, which reduces flexibility. Consider making them configurable parameters with defaults.

-def get_text_chunks(text):
-    # Adjust chunk size and overlap as needed
-    text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000)
+def get_text_chunks(text, chunk_size=10000, chunk_overlap=1000):
+    """Split text into chunks with specified size and overlap.
+    
+    Args:
+        text: The text to split
+        chunk_size: Size of each chunk (default: 10000)
+        chunk_overlap: Overlap between chunks (default: 1000)
+    
+    Returns:
+        List of text chunks
+    """
+    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
     chunks = text_splitter.split_text(text)
     return chunks

59-74: Extract prompt template and model parameters for better maintainability

The prompt template and model parameters are hardcoded in the function. Consider extracting them for better maintainability and configurability.

-def get_conversational_chain():
+def get_conversational_chain(model_name="gemini-pro", temperature=0.3):
+    """Create a conversational chain for question answering.
+    
+    Args:
+        model_name: Name of the LLM model to use
+        temperature: Temperature parameter for the model
+        
+    Returns:
+        A question answering chain
+    """
+    # Define the prompt template outside the function or load from a file
+    prompt_template = """
+    Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in
+    provided context just say, "answer is not available in the context", don't provide the wrong answer\n\n
+    Context:\n {context}?\n
+    Question: \n{question}\n
 
-    prompt_template = """
-    Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in
-    provided context just say, "answer is not available in the context", don't provide the wrong answer\n\n
-    Context:\n {context}?\n
-    Question: \n{question}\n
-
-    Answer:
-    """
-    model = ChatGoogleGenerativeAI(model="gemini-pro",temperature=0.3)
+    Answer:
+    """
+    
+    try:
+        model = ChatGoogleGenerativeAI(model=model_name, temperature=temperature)
+        prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
+        chain = load_qa_chain(model, chain_type="stuff", prompt=prompt)
+        return chain
+    except Exception as e:
+        st.error(f"Error creating conversation chain: {str(e)}")
+        return None
-    prompt = PromptTemplate(template = prompt_template, input_variables = ["context", "question"])
-    chain = load_qa_chain(model, chain_type="stuff", prompt=prompt)
-
-    return chain

107-125: Improve state management and user feedback in the main function

The main function lacks proper state management when users upload new PDFs after asking questions. Also, there's no loading state when processing user questions unlike PDF processing.

def main():
-    st.set_page_config("PDF Wizard")
+    st.set_page_config(page_title="PDF Wizard", page_icon="📄")
     st.header("Chat with multiple PDFs📄")
+    
+    # Initialize session state variables if they don't exist
+    if 'processed_pdfs' not in st.session_state:
+        st.session_state.processed_pdfs = False

     user_question = st.text_input("📎Ask a Question from the PDF Files")

     if user_question:
+        if not st.session_state.processed_pdfs:
+            st.warning("Please upload and process PDF files first")
+            return
         user_input(user_question)

     with st.sidebar:
         st.title("Menu:")
         pdf_docs = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", accept_multiple_files=True)
         if st.button("Submit & Process"):
-            with st.spinner("Processing..."):
-                raw_text = get_pdf_text(pdf_docs)
-                text_chunks = get_text_chunks(raw_text)
-                get_vector_store(text_chunks)
-                st.success("Done")
+            if not pdf_docs:
+                st.error("Please upload at least one PDF file")
+            else:
+                with st.spinner("Processing PDFs..."):
+                    raw_text = get_pdf_text(pdf_docs)
+                    if raw_text:
+                        text_chunks = get_text_chunks(raw_text)
+                        get_vector_store(text_chunks)
+                        st.session_state.processed_pdfs = True
+                        st.success("PDF processing complete! You can now ask questions.")
+                    else:
+                        st.error("Could not extract text from the uploaded PDFs")

1-12: Add proper documentation and organize imports

The file lacks proper module-level documentation explaining its purpose and usage. Also, imports could be better organized by grouping standard library imports, third-party library imports, and local imports.

+"""
+PDF Wizard - A Streamlit application for interacting with PDF documents.
+
+This application allows users to upload multiple PDF files, which are processed to extract text 
+and convert it into vector embeddings using Google Generative AI. Users can then ask questions 
+about the content of the PDFs and receive accurate answers based on the content.
+
+Author: Arnab Ghosh ([email protected])
+"""
+
+# Standard library imports
+import os
+
+# Third-party imports
 import streamlit as st
 from PyPDF2 import PdfReader
 from langchain.text_splitter import RecursiveCharacterTextSplitter
-import os
 from langchain_google_genai import GoogleGenerativeAIEmbeddings
 import google.generativeai as genai
 from langchain.vectorstores import FAISS
 from langchain_google_genai import ChatGoogleGenerativeAI
 from langchain.chains.question_answering import load_qa_chain
 from langchain.prompts import PromptTemplate
 from dotenv import load_dotenv
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09451f5 and a68606c.

⛔ Files ignored due to path filters (2)
  • Generative-AI/PDF Wizard/faiss_index/index.pkl is excluded by !**/*.pkl
  • Generative-AI/PDF Wizard/image.png is excluded by !**/*.png
📒 Files selected for processing (5)
  • Generative-AI/PDF Wizard/.env (1 hunks)
  • Generative-AI/PDF Wizard/app.py (1 hunks)
  • Generative-AI/PDF Wizard/faiss_index/app.py (1 hunks)
  • Generative-AI/PDF Wizard/readme_pdf_wiz.md (1 hunks)
  • Generative-AI/PDF Wizard/requirements.txt (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
Generative-AI/PDF Wizard/readme_pdf_wiz.md

4-4: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


9-9: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


19-19: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


31-31: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


34-34: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


40-40: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


43-43: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


56-56: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


66-66: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


70-70: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


72-72: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


81-81: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


88-88: Heading style
Expected: atx; Actual: setext

(MD003, heading-style)


90-90: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


91-91: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)


92-92: Unordered list style
Expected: asterisk; Actual: dash

(MD004, ul-style)

@tulu-g559
Copy link
Contributor Author

@UTSAVS26

plz check the PR

  • Assign to me
  • agg JWoc and a hard label
  • merge if ok
    that will help me

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
Generative-AI/PDF Wizard/app.py (4)

21-21: Fix typographical error in comment.
"pdd" should be "PDF" for clarity and consistency.

-##Function that reads the pdd goes through each and every page
+##Function that reads the PDF and processes each page

42-43: Assess large chunk size and overlap for performance.
A chunk size of 10,000 with an overlap of 1,000 may cause high memory usage and slower processing for large PDFs. Consider testing smaller values to optimize performance and memory usage.


49-76: Consider storage alternatives or concurrency handling.
Storing the FAISS index on the local filesystem (“faiss_index”) can be sufficient for small-scale demos. For production usage, explore concurrency-safe or distributed storage mechanisms to enable faster parallel access and avoid file contention or potential corruption under heavy loads.


122-140: Enhance user feedback and validation for PDF uploads.
In the main function, consider validating PDF uploads or displaying specific error messages for invalid or empty uploads. This helps guide users better, especially if they accidentally upload non-PDF files or zero-page documents.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a68606c and 8c76b2e.

📒 Files selected for processing (1)
  • Generative-AI/PDF Wizard/app.py (1 hunks)

@UTSAVS26 UTSAVS26 added Contributor Denotes issues or PRs submitted by contributors to acknowledge their participation. Status: Requested Changes⚙️ Indicates that a reviewer has requested changes to a PR before it can be approved. ADVANCE JWOC labels Mar 9, 2025
@UTSAVS26
Copy link
Owner

UTSAVS26 commented Mar 9, 2025

@tulu-g559 make necessary changes

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
Generative-AI/PDF Wizard/app.py (5)

21-37: Consider removing or fully enabling the commented-out exception handling block.
The code that reads PDF pages is duplicated in commented form (lines 31-36). If you need robust error handling for file reading, wrap the existing PdfReader logic in a try-except. Otherwise, cleaning up the commented code helps maintain clarity.

 for pdf in pdf_docs:
     pdf_reader = PdfReader(pdf)
     for page in pdf_reader.pages:
         text += page.extract_text()
-    # Optionally wrap the above block in try-except if needed:
-    # try:
-    #     pdf_reader = PdfReader(pdf)
-    #     for page in pdf_reader.pages:
-    #         text += page.extract_text()
-    # except Exception as e:
-    #     st.error(f"Error reading PDF '{pdf.name}': {str(e)}")

40-46: Offer user-configurable chunk sizes.
Hardcoding chunk_size=10000 and chunk_overlap=1000 might cause large memory usage for very large PDFs. Consider making these parameters configurable via Streamlit widgets or constants so users can tune performance.


50-77: Review concurrency risks and indexing strategy.
When multiple users process PDFs simultaneously, saving the FAISS index to the same directory could lead to race conditions or index corruption. If multi-user support is expected, consider adding locking mechanisms or storing each user’s index separately.


98-120: Validate local index usage before Q&A.
Good job verifying the existence of the FAISS index file. However, if multiple users frequently re-upload PDFs, you could end up with outdated or partial data. Consider showing a timestamp or version of the index to help users confirm correctness.


122-141: Consider adding automated tests for PDF processing and vector indexing.
Implementing unit and integration tests (e.g., a small test PDF) would ensure that PDF reading, chunking, vector storage, and Q&A flow all work correctly. This helps catch regressions early.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c76b2e and d28b2e2.

📒 Files selected for processing (1)
  • Generative-AI/PDF Wizard/app.py (1 hunks)
🔇 Additional comments (1)
Generative-AI/PDF Wizard/app.py (1)

14-18: Good practice checking for missing environment variables.
Using raise ValueError is a proper way to fail fast when GOOGLE_API_KEY is not found. This helps ensure the application doesn’t run in a misconfigured state.

@tulu-g559
Copy link
Contributor Author

@UTSAVS26
Done!!🚀
Pls label the PR hard

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
Generative-AI/PDF Wizard/faiss_index/app.py (3)

33-37: Improve text chunking with parameters and validation

The current implementation uses hardcoded values for chunk size and overlap. Consider making these configurable parameters with defaults, and add validation for empty input and reasonable chunk sizes.

-def get_text_chunks(text):
-    # Adjust chunk size and overlap as needed
-    text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000)
-    chunks = text_splitter.split_text(text)
-    return chunks
+def get_text_chunks(text, chunk_size=10000, chunk_overlap=1000):
+    """Split text into chunks using RecursiveCharacterTextSplitter.
+    
+    Args:
+        text: The text to split
+        chunk_size: Size of each chunk
+        chunk_overlap: Overlap between chunks
+        
+    Returns:
+        List of text chunks
+    """
+    if not text:
+        return []
+    
+    # Validate parameters
+    if chunk_size <= 0:
+        raise ValueError("Chunk size must be positive")
+    if chunk_overlap < 0 or chunk_overlap >= chunk_size:
+        raise ValueError("Chunk overlap must be non-negative and less than chunk size")
+    
+    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
+    chunks = text_splitter.split_text(text)
+    return chunks

59-74: Add error handling and model validation to conversational chain

The current implementation lacks error handling for API issues or model configuration. Consider adding validation and error handling to ensure robustness.

-def get_conversational_chain():
+def get_conversational_chain(model_name="gemini-1.5-flash", temperature=0.3):
+    """Create a conversation chain for question answering.
+    
+    Args:
+        model_name: Name of the LLM model to use
+        temperature: Temperature setting for the model
+        
+    Returns:
+        QA chain or None if an error occurs
+    """
     prompt_template = """
     Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in
     provided context just say, "answer is not available in the context", don't provide the wrong answer\n\n
     Context:\n {context}?\n
     Question: \n{question}\n

     Answer:
     """
-    model = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3)
-
-    prompt = PromptTemplate(template = prompt_template, input_variables = ["context", "question"])
-    chain = load_qa_chain(model, chain_type="stuff", prompt=prompt)
-
-    return chain
+    try:
+        model = ChatGoogleGenerativeAI(model=model_name, temperature=temperature)
+        prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
+        chain = load_qa_chain(model, chain_type="stuff", prompt=prompt)
+        return chain
+    except Exception as e:
+        st.error(f"Error creating conversation chain: {str(e)}")
+        return None

107-125: Improve user experience and error handling in the main function

The main function doesn't provide adequate feedback if a user tries to ask a question without first uploading and processing PDFs, and lacks error handling for the PDF processing pipeline.

-def main():
-    st.set_page_config("PDF Wizard")
-    st.header("Chat with multiple PDFs📄")
-
-    user_question = st.text_input("📎Ask a Question from the PDF Files")
-
-    if user_question:
-        user_input(user_question)
-
-    with st.sidebar:
-        st.title("Menu:")
-        pdf_docs = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", accept_multiple_files=True)
-        if st.button("Submit & Process"):
-            with st.spinner("Processing..."):
-                raw_text = get_pdf_text(pdf_docs)
-                text_chunks = get_text_chunks(raw_text)
-                get_vector_store(text_chunks)
-                st.success("Done")
+def main():
+    st.set_page_config(page_title="PDF Wizard", page_icon="📄")
+    st.header("Chat with multiple PDFs📄")
+    
+    # Create session state variables if they don't exist
+    if 'pdfs_processed' not in st.session_state:
+        st.session_state.pdfs_processed = False
+    if 'pdf_count' not in st.session_state:
+        st.session_state.pdf_count = 0
+    
+    # Main area for questions and answers
+    user_question = st.text_input("📎Ask a Question from the PDF Files")
+    
+    if user_question:
+        if not st.session_state.pdfs_processed:
+            st.warning("Please upload and process PDFs before asking questions.")
+        else:
+            user_input(user_question)
+    
+    # Sidebar for PDF upload and processing
+    with st.sidebar:
+        st.title("Menu:")
+        pdf_docs = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", 
+                                   accept_multiple_files=True)
+        
+        process_button = st.button("Submit & Process")
+        
+        if process_button:
+            if not pdf_docs:
+                st.error("Please upload at least one PDF file.")
+            else:
+                with st.spinner(f"Processing {len(pdf_docs)} PDFs..."):
+                    try:
+                        # Process PDFs
+                        raw_text = get_pdf_text(pdf_docs)
+                        if not raw_text:
+                            st.error("No text could be extracted from the PDFs.")
+                        else:
+                            text_chunks = get_text_chunks(raw_text)
+                            get_vector_store(text_chunks)
+                            
+                            # Update session state
+                            st.session_state.pdfs_processed = True
+                            st.session_state.pdf_count = len(pdf_docs)
+                            
+                            st.success(f"Successfully processed {len(pdf_docs)} PDFs with {len(text_chunks)} text chunks.")
+                    except Exception as e:
+                        st.error(f"Error processing PDFs: {str(e)}")
+        
+        # Show processing status
+        if st.session_state.pdfs_processed:
+            st.success(f"{st.session_state.pdf_count} PDFs processed and ready for queries.")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d28b2e2 and 5e2827a.

📒 Files selected for processing (2)
  • Generative-AI/PDF Wizard/app.py (1 hunks)
  • Generative-AI/PDF Wizard/faiss_index/app.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Generative-AI/PDF Wizard/app.py
🔇 Additional comments (4)
Generative-AI/PDF Wizard/faiss_index/app.py (4)

14-16: Add error handling for the API key retrieval

Line 15 retrieves the API key but doesn't store the result, making it redundant. Additionally, there's no validation to ensure the API key exists and is valid before configuring the GenAI client.

load_dotenv()
-os.getenv("GOOGLE_API_KEY")
-genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
+api_key = os.getenv("GOOGLE_API_KEY")
+if not api_key:
+    raise ValueError("GOOGLE_API_KEY environment variable is missing. Please add it to your .env file.")
+genai.configure(api_key=api_key)

22-28: Add error handling for PDF operations

The function lacks error handling for PDF reading operations which could fail due to corrupted files, password-protected PDFs, or other issues. This might cause the application to crash with unhelpful error messages.

def get_pdf_text(pdf_docs):
+    if not pdf_docs:
+        return ""
     text=""
     for pdf in pdf_docs:
-        pdf_reader= PdfReader(pdf)
-        for page in pdf_reader.pages:
-            text+= page.extract_text()
+        try:
+            pdf_reader = PdfReader(pdf)
+            for page in pdf_reader.pages:
+                text += page.extract_text()
+        except Exception as e:
+            st.error(f"Error reading PDF '{pdf.name}': {str(e)}")
     return  text

44-54: Extract hardcoded values and add error handling to the vector store creation

The function uses hardcoded values for the embedding model and storage location. It also lacks error handling for the embedding and storage operations.

-def get_vector_store(text_chunks):
-    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
-    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
-    
-    # Ensure the directory exists
-    if not os.path.exists("faiss_index"):
-        os.makedirs("faiss_index")
-    
-    # Save the vector store index in the directory
-    vector_store.save_local("faiss_index")
+def get_vector_store(text_chunks, embedding_model="models/embedding-001", store_dir="faiss_index"):
+    """Create and save vector store from text chunks.
+    
+    Args:
+        text_chunks: List of text chunks to embed
+        embedding_model: Name of the embedding model to use
+        store_dir: Directory to save the vector store
+    
+    Returns:
+        None
+    """
+    if not text_chunks:
+        st.warning("No text to process. Please check the PDF content.")
+        return
+        
+    try:
+        embeddings = GoogleGenerativeAIEmbeddings(model=embedding_model)
+        vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
+        
+        # Ensure the directory exists
+        if not os.path.exists(store_dir):
+            os.makedirs(store_dir)
+        
+        # Save the vector store index in the directory
+        vector_store.save_local(store_dir)
+    except Exception as e:
+        st.error(f"Error creating vector store: {str(e)}")

82-102: Remove debugging print statement and address security concern

The function has a debugging print statement and uses allow_dangerous_deserialization=True without explaining the security implications. Also, the embedding model is duplicated from an earlier function.

-def user_input(user_question):
-    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
+def user_input(user_question, embedding_model="models/embedding-001", store_dir="faiss_index"):
+    """Process user question and generate a response.
+    
+    Args:
+        user_question: The user's question
+        embedding_model: Name of the embedding model to use
+        store_dir: Directory where the vector store is saved
+        
+    Returns:
+        None
+    """
+    if not user_question.strip():
+        return
+        
+    try:
+        embeddings = GoogleGenerativeAIEmbeddings(model=embedding_model)
     
-    # Check if the faiss_index file exists before loading
-    if not os.path.exists("faiss_index/index.faiss"):
-        st.error("FAISS index file not found. Please process the PDF files first.")
-        return
+        # Check if the index file exists before loading
+        index_path = f"{store_dir}/index.faiss"
+        if not os.path.exists(index_path):
+            st.error("FAISS index file not found. Please process the PDF files first.")
+            return
     
-    new_db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
-    docs = new_db.similarity_search(user_question)
+        # Note about the security parameter:
+        # allow_dangerous_deserialization=True is required to load FAISS indexes
+        # but should be used with caution in production environments with untrusted data
+        new_db = FAISS.load_local(store_dir, embeddings, allow_dangerous_deserialization=True)
+        docs = new_db.similarity_search(user_question)
 
-    chain = get_conversational_chain()
+        chain = get_conversational_chain()
+        if not chain:
+            return
     
-    response = chain(
-        {"input_documents": docs, "question": user_question},
-        return_only_outputs=True
-    )
+        with st.spinner("Generating response..."):
+            response = chain(
+                {"input_documents": docs, "question": user_question},
+                return_only_outputs=True
+            )
+            st.write("Reply: ", response["output_text"])
+    except Exception as e:
+        st.error(f"Error processing question: {str(e)}")

-    print(response)
-    st.write("Reply: ", response["output_text"])

@UTSAVS26 UTSAVS26 merged commit fb79946 into UTSAVS26:main Mar 9, 2025
1 check passed
@UTSAVS26 UTSAVS26 added Status: Approved ✔️ PRs that have passed review and are approved for merging. Hard and removed Status: Requested Changes⚙️ Indicates that a reviewer has requested changes to a PR before it can be approved. ADVANCE labels Mar 9, 2025
@tulu-g559
Copy link
Contributor Author

@UTSAVS26
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributor Denotes issues or PRs submitted by contributors to acknowledge their participation. Hard JWOC Status: Approved ✔️ PRs that have passed review and are approved for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Code Addition Request]: PDF Wizard - AI-Powered Document Q&A Tool with multiple pdfs📄✨
2 participants