-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added PDF Wizard #1186
Added PDF Wizard #1186
Conversation
WalkthroughThe pull request introduces a new environment variable ( Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant S as Streamlit App
participant P as PDF Processor
participant F as FAISS VectorStore
participant Q as Conversational Chain
U->>S: Upload PDF files
S->>P: get_pdf_text()
P-->>S: Return extracted text
S->>P: get_text_chunks(text)
P-->>S: Return text chunks
S->>F: get_vector_store(text_chunks)
F-->>S: Vector store created
U->>S: Ask a question
S->>F: user_input(question)
F-->>S: Return similar docs
S->>Q: get_conversational_chain()
Q-->>S: Return generated answer
S->>U: Display answer
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
👋 Thank you for opening this pull request! We're excited to review your contribution. Please give us a moment, and we'll get back to you shortly! Feel free to join our community on Discord to discuss more! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (15)
Generative-AI/PDF Wizard/.env (1)
1-1
: Remove trailing semicolon for consistent formatting.Typically,
.env
files follow the patternKEY=VALUE
without a semicolon at the end; consider removing it.-GOOGLE_API_KEY = "Your_GEMIN_API_KEY"; +GOOGLE_API_KEY="Your_GEMIN_API_KEY"Generative-AI/PDF Wizard/requirements.txt (1)
1-7
: Consider pinning or bounding versions of these dependencies.Pinning specific versions helps ensure consistent installation across environments and prevents unexpected issues caused by major version updates.
-streamlit -google-generativeai -python-dotenv -langchain -PyPDF2 -faiss-cpu -langchain_google_genai +streamlit==1.24.0 +google-generativeai==0.7.3 +python-dotenv==1.0.0 +langchain==0.0.223 +PyPDF2==3.1.1 +faiss-cpu==1.7.4 +langchain_google_genai==0.0.4Generative-AI/PDF Wizard/readme_pdf_wiz.md (4)
4-4
: Switch to atx-style headings for consistency and compliance with Markdown linting.The setext-style headings (underscores or hyphens) trigger markdownlint warnings. Converting them to
## Heading
or a suitable ATX level is recommended.-📌 Overview +## Overview -🚀 Features +## Features -🛠️ Tech Stack +## Tech Stack -📦 Installation +## Installation -🎯 Usage +## Usage -📂 ScreenShots +## ScreenShots -📂 Project Structure +## Project Structure -🌟 Acknowledgments +## Acknowledgments -👤 Contributor +## ContributorAlso applies to: 9-9, 19-19, 31-31, 56-56, 66-66, 70-70, 81-81, 88-88
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
4-4: Heading style
Expected: atx; Actual: setext(MD003, heading-style)
34-34
: Specify a language for fenced code blocks.Adding a language identifier (e.g.,
bash
,python
) improves syntax highlighting and readability.-``` +```bash git clone https://github.com/UTSAVS26/PyVerse.git cd Generative-AI cd PDF-Wizard-
+
bash
python -m venv venv-``` +```bash source venv/bin/activate # For macOS/Linux venv\Scripts\activate # For Windows
-
+
bash
PDF-Wizard
│-- faiss_index/
...Also applies to: 40-40, 43-43, 72-72 <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.17.2)</summary> 34-34: Fenced code blocks should have a language specified null (MD040, fenced-code-language) </details> </details> --- `90-92`: **Use asterisk for unordered lists to match recommended style guidelines.** Markdown linting suggests `*` instead of `-` for unordered lists. ```diff - - **Name:** Arnab Ghosh - - **GitHub:** [tulug-559](https://github.com/tulu-g559) - - **Contact:** [email]([email protected]) + * **Name:** Arnab Ghosh + * **GitHub:** [tulug-559](https://github.com/tulu-g559) + * **Contact:** [email]([email protected])
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
90-90: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
91-91: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
92-92: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
92-92
: Fix the typo in the contributor email address.Consider correcting "[email protected]" to "[email protected]".
- **Contact:** [email]([email protected]) + **Contact:** [email]([email protected])🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
92-92: Unordered list style
Expected: asterisk; Actual: dash(MD004, ul-style)
Generative-AI/PDF Wizard/app.py (5)
15-15
: Remove unusedos.getenv("GOOGLE_API_KEY")
call.This call does nothing with the returned value. Consider removing or assigning it if needed for validation/logging.
15 -os.getenv("GOOGLE_API_KEY")
19-19
: Correct minor spelling/grammar in the comment.Change "reads the pdd" to "reads the PDF".
-##Function that reads the pdd goes through each and every page +## Function that reads the PDF and processes every page
30-34
: Validate large chunk size to avoid memory overhead.A
chunk_size
of 10000 may lead to excessive memory usage for large PDFs. Consider testing smaller sizes to balance performance and resource usage.-def get_text_chunks(text): - text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000) +def get_text_chunks(text, chunk_size=2000, chunk_overlap=200): + text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
20-26
: Add docstrings and robust error handling.The helper functions (
get_pdf_text
,get_text_chunks
,get_vector_store
,get_conversational_chain
,user_input
, andmain
) lack docstrings and might not handle edge cases such as invalid PDFs, empty documents, or missing environment variables. Consider adding docstrings that explain parameters, return values, and potential errors, plus relevant try/except blocks or validations where appropriate.+# Example docstring snippet: +def get_pdf_text(pdf_docs): + """ + Extracts all text from the list of uploaded PDF files. + + :param pdf_docs: A list of PDF files. + :return: A concatenated string of text from all pages of the PDFs. + :raises ValueError: If no PDF files are provided or if any file is invalid. + """ text = "" for pdf in pdf_docs: ...Also applies to: 30-35, 39-49, 52-68, 70-92, 94-117
90-91
: Remove or toggle off print statements in production code.Consider using Streamlit logs or a dedicated logger instead of raw print statements for a more controlled logging approach.
- print(response) + # st.write(response) # or consider a logging systemGenerative-AI/PDF Wizard/faiss_index/app.py (4)
33-37
: Make chunk size and overlap configurable parametersThe chunk size and overlap values are hardcoded, which reduces flexibility. Consider making them configurable parameters with defaults.
-def get_text_chunks(text): - # Adjust chunk size and overlap as needed - text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000) +def get_text_chunks(text, chunk_size=10000, chunk_overlap=1000): + """Split text into chunks with specified size and overlap. + + Args: + text: The text to split + chunk_size: Size of each chunk (default: 10000) + chunk_overlap: Overlap between chunks (default: 1000) + + Returns: + List of text chunks + """ + text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) chunks = text_splitter.split_text(text) return chunks
59-74
: Extract prompt template and model parameters for better maintainabilityThe prompt template and model parameters are hardcoded in the function. Consider extracting them for better maintainability and configurability.
-def get_conversational_chain(): +def get_conversational_chain(model_name="gemini-pro", temperature=0.3): + """Create a conversational chain for question answering. + + Args: + model_name: Name of the LLM model to use + temperature: Temperature parameter for the model + + Returns: + A question answering chain + """ + # Define the prompt template outside the function or load from a file + prompt_template = """ + Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in + provided context just say, "answer is not available in the context", don't provide the wrong answer\n\n + Context:\n {context}?\n + Question: \n{question}\n - prompt_template = """ - Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in - provided context just say, "answer is not available in the context", don't provide the wrong answer\n\n - Context:\n {context}?\n - Question: \n{question}\n - - Answer: - """ - model = ChatGoogleGenerativeAI(model="gemini-pro",temperature=0.3) + Answer: + """ + + try: + model = ChatGoogleGenerativeAI(model=model_name, temperature=temperature) + prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"]) + chain = load_qa_chain(model, chain_type="stuff", prompt=prompt) + return chain + except Exception as e: + st.error(f"Error creating conversation chain: {str(e)}") + return None - prompt = PromptTemplate(template = prompt_template, input_variables = ["context", "question"]) - chain = load_qa_chain(model, chain_type="stuff", prompt=prompt) - - return chain
107-125
: Improve state management and user feedback in the main functionThe main function lacks proper state management when users upload new PDFs after asking questions. Also, there's no loading state when processing user questions unlike PDF processing.
def main(): - st.set_page_config("PDF Wizard") + st.set_page_config(page_title="PDF Wizard", page_icon="📄") st.header("Chat with multiple PDFs📄") + + # Initialize session state variables if they don't exist + if 'processed_pdfs' not in st.session_state: + st.session_state.processed_pdfs = False user_question = st.text_input("📎Ask a Question from the PDF Files") if user_question: + if not st.session_state.processed_pdfs: + st.warning("Please upload and process PDF files first") + return user_input(user_question) with st.sidebar: st.title("Menu:") pdf_docs = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", accept_multiple_files=True) if st.button("Submit & Process"): - with st.spinner("Processing..."): - raw_text = get_pdf_text(pdf_docs) - text_chunks = get_text_chunks(raw_text) - get_vector_store(text_chunks) - st.success("Done") + if not pdf_docs: + st.error("Please upload at least one PDF file") + else: + with st.spinner("Processing PDFs..."): + raw_text = get_pdf_text(pdf_docs) + if raw_text: + text_chunks = get_text_chunks(raw_text) + get_vector_store(text_chunks) + st.session_state.processed_pdfs = True + st.success("PDF processing complete! You can now ask questions.") + else: + st.error("Could not extract text from the uploaded PDFs")
1-12
: Add proper documentation and organize importsThe file lacks proper module-level documentation explaining its purpose and usage. Also, imports could be better organized by grouping standard library imports, third-party library imports, and local imports.
+""" +PDF Wizard - A Streamlit application for interacting with PDF documents. + +This application allows users to upload multiple PDF files, which are processed to extract text +and convert it into vector embeddings using Google Generative AI. Users can then ask questions +about the content of the PDFs and receive accurate answers based on the content. + +Author: Arnab Ghosh ([email protected]) +""" + +# Standard library imports +import os + +# Third-party imports import streamlit as st from PyPDF2 import PdfReader from langchain.text_splitter import RecursiveCharacterTextSplitter -import os from langchain_google_genai import GoogleGenerativeAIEmbeddings import google.generativeai as genai from langchain.vectorstores import FAISS from langchain_google_genai import ChatGoogleGenerativeAI from langchain.chains.question_answering import load_qa_chain from langchain.prompts import PromptTemplate from dotenv import load_dotenv
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Generative-AI/PDF Wizard/faiss_index/index.pkl
is excluded by!**/*.pkl
Generative-AI/PDF Wizard/image.png
is excluded by!**/*.png
📒 Files selected for processing (5)
Generative-AI/PDF Wizard/.env
(1 hunks)Generative-AI/PDF Wizard/app.py
(1 hunks)Generative-AI/PDF Wizard/faiss_index/app.py
(1 hunks)Generative-AI/PDF Wizard/readme_pdf_wiz.md
(1 hunks)Generative-AI/PDF Wizard/requirements.txt
(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
Generative-AI/PDF Wizard/readme_pdf_wiz.md
4-4: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
9-9: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
19-19: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
31-31: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
34-34: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
40-40: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
43-43: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
56-56: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
66-66: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
70-70: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
72-72: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
81-81: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
88-88: Heading style
Expected: atx; Actual: setext
(MD003, heading-style)
90-90: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
91-91: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
92-92: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
plz check the PR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
Generative-AI/PDF Wizard/app.py (4)
21-21
: Fix typographical error in comment.
"pdd" should be "PDF" for clarity and consistency.-##Function that reads the pdd goes through each and every page +##Function that reads the PDF and processes each page
42-43
: Assess large chunk size and overlap for performance.
A chunk size of 10,000 with an overlap of 1,000 may cause high memory usage and slower processing for large PDFs. Consider testing smaller values to optimize performance and memory usage.
49-76
: Consider storage alternatives or concurrency handling.
Storing the FAISS index on the local filesystem (“faiss_index”) can be sufficient for small-scale demos. For production usage, explore concurrency-safe or distributed storage mechanisms to enable faster parallel access and avoid file contention or potential corruption under heavy loads.
122-140
: Enhance user feedback and validation for PDF uploads.
In the main function, consider validating PDF uploads or displaying specific error messages for invalid or empty uploads. This helps guide users better, especially if they accidentally upload non-PDF files or zero-page documents.
@tulu-g559 make necessary changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
Generative-AI/PDF Wizard/app.py (5)
21-37
: Consider removing or fully enabling the commented-out exception handling block.
The code that reads PDF pages is duplicated in commented form (lines 31-36). If you need robust error handling for file reading, wrap the existingPdfReader
logic in atry-except
. Otherwise, cleaning up the commented code helps maintain clarity.for pdf in pdf_docs: pdf_reader = PdfReader(pdf) for page in pdf_reader.pages: text += page.extract_text() - # Optionally wrap the above block in try-except if needed: - # try: - # pdf_reader = PdfReader(pdf) - # for page in pdf_reader.pages: - # text += page.extract_text() - # except Exception as e: - # st.error(f"Error reading PDF '{pdf.name}': {str(e)}")
40-46
: Offer user-configurable chunk sizes.
Hardcodingchunk_size=10000
andchunk_overlap=1000
might cause large memory usage for very large PDFs. Consider making these parameters configurable via Streamlit widgets or constants so users can tune performance.
50-77
: Review concurrency risks and indexing strategy.
When multiple users process PDFs simultaneously, saving the FAISS index to the same directory could lead to race conditions or index corruption. If multi-user support is expected, consider adding locking mechanisms or storing each user’s index separately.
98-120
: Validate local index usage before Q&A.
Good job verifying the existence of the FAISS index file. However, if multiple users frequently re-upload PDFs, you could end up with outdated or partial data. Consider showing a timestamp or version of the index to help users confirm correctness.
122-141
: Consider adding automated tests for PDF processing and vector indexing.
Implementing unit and integration tests (e.g., a small test PDF) would ensure that PDF reading, chunking, vector storage, and Q&A flow all work correctly. This helps catch regressions early.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
Generative-AI/PDF Wizard/app.py
(1 hunks)
🔇 Additional comments (1)
Generative-AI/PDF Wizard/app.py (1)
14-18
: Good practice checking for missing environment variables.
Usingraise ValueError
is a proper way to fail fast whenGOOGLE_API_KEY
is not found. This helps ensure the application doesn’t run in a misconfigured state.
@UTSAVS26 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
Generative-AI/PDF Wizard/faiss_index/app.py (3)
33-37
: Improve text chunking with parameters and validationThe current implementation uses hardcoded values for chunk size and overlap. Consider making these configurable parameters with defaults, and add validation for empty input and reasonable chunk sizes.
-def get_text_chunks(text): - # Adjust chunk size and overlap as needed - text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000) - chunks = text_splitter.split_text(text) - return chunks +def get_text_chunks(text, chunk_size=10000, chunk_overlap=1000): + """Split text into chunks using RecursiveCharacterTextSplitter. + + Args: + text: The text to split + chunk_size: Size of each chunk + chunk_overlap: Overlap between chunks + + Returns: + List of text chunks + """ + if not text: + return [] + + # Validate parameters + if chunk_size <= 0: + raise ValueError("Chunk size must be positive") + if chunk_overlap < 0 or chunk_overlap >= chunk_size: + raise ValueError("Chunk overlap must be non-negative and less than chunk size") + + text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) + chunks = text_splitter.split_text(text) + return chunks
59-74
: Add error handling and model validation to conversational chainThe current implementation lacks error handling for API issues or model configuration. Consider adding validation and error handling to ensure robustness.
-def get_conversational_chain(): +def get_conversational_chain(model_name="gemini-1.5-flash", temperature=0.3): + """Create a conversation chain for question answering. + + Args: + model_name: Name of the LLM model to use + temperature: Temperature setting for the model + + Returns: + QA chain or None if an error occurs + """ prompt_template = """ Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in provided context just say, "answer is not available in the context", don't provide the wrong answer\n\n Context:\n {context}?\n Question: \n{question}\n Answer: """ - model = ChatGoogleGenerativeAI(model="gemini-1.5-flash",temperature=0.3) - - prompt = PromptTemplate(template = prompt_template, input_variables = ["context", "question"]) - chain = load_qa_chain(model, chain_type="stuff", prompt=prompt) - - return chain + try: + model = ChatGoogleGenerativeAI(model=model_name, temperature=temperature) + prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"]) + chain = load_qa_chain(model, chain_type="stuff", prompt=prompt) + return chain + except Exception as e: + st.error(f"Error creating conversation chain: {str(e)}") + return None
107-125
: Improve user experience and error handling in the main functionThe main function doesn't provide adequate feedback if a user tries to ask a question without first uploading and processing PDFs, and lacks error handling for the PDF processing pipeline.
-def main(): - st.set_page_config("PDF Wizard") - st.header("Chat with multiple PDFs📄") - - user_question = st.text_input("📎Ask a Question from the PDF Files") - - if user_question: - user_input(user_question) - - with st.sidebar: - st.title("Menu:") - pdf_docs = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", accept_multiple_files=True) - if st.button("Submit & Process"): - with st.spinner("Processing..."): - raw_text = get_pdf_text(pdf_docs) - text_chunks = get_text_chunks(raw_text) - get_vector_store(text_chunks) - st.success("Done") +def main(): + st.set_page_config(page_title="PDF Wizard", page_icon="📄") + st.header("Chat with multiple PDFs📄") + + # Create session state variables if they don't exist + if 'pdfs_processed' not in st.session_state: + st.session_state.pdfs_processed = False + if 'pdf_count' not in st.session_state: + st.session_state.pdf_count = 0 + + # Main area for questions and answers + user_question = st.text_input("📎Ask a Question from the PDF Files") + + if user_question: + if not st.session_state.pdfs_processed: + st.warning("Please upload and process PDFs before asking questions.") + else: + user_input(user_question) + + # Sidebar for PDF upload and processing + with st.sidebar: + st.title("Menu:") + pdf_docs = st.file_uploader("Upload your PDF Files and Click on the Submit & Process Button", + accept_multiple_files=True) + + process_button = st.button("Submit & Process") + + if process_button: + if not pdf_docs: + st.error("Please upload at least one PDF file.") + else: + with st.spinner(f"Processing {len(pdf_docs)} PDFs..."): + try: + # Process PDFs + raw_text = get_pdf_text(pdf_docs) + if not raw_text: + st.error("No text could be extracted from the PDFs.") + else: + text_chunks = get_text_chunks(raw_text) + get_vector_store(text_chunks) + + # Update session state + st.session_state.pdfs_processed = True + st.session_state.pdf_count = len(pdf_docs) + + st.success(f"Successfully processed {len(pdf_docs)} PDFs with {len(text_chunks)} text chunks.") + except Exception as e: + st.error(f"Error processing PDFs: {str(e)}") + + # Show processing status + if st.session_state.pdfs_processed: + st.success(f"{st.session_state.pdf_count} PDFs processed and ready for queries.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
Generative-AI/PDF Wizard/app.py
(1 hunks)Generative-AI/PDF Wizard/faiss_index/app.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- Generative-AI/PDF Wizard/app.py
🔇 Additional comments (4)
Generative-AI/PDF Wizard/faiss_index/app.py (4)
14-16
: Add error handling for the API key retrievalLine 15 retrieves the API key but doesn't store the result, making it redundant. Additionally, there's no validation to ensure the API key exists and is valid before configuring the GenAI client.
load_dotenv() -os.getenv("GOOGLE_API_KEY") -genai.configure(api_key=os.getenv("GOOGLE_API_KEY")) +api_key = os.getenv("GOOGLE_API_KEY") +if not api_key: + raise ValueError("GOOGLE_API_KEY environment variable is missing. Please add it to your .env file.") +genai.configure(api_key=api_key)
22-28
: Add error handling for PDF operationsThe function lacks error handling for PDF reading operations which could fail due to corrupted files, password-protected PDFs, or other issues. This might cause the application to crash with unhelpful error messages.
def get_pdf_text(pdf_docs): + if not pdf_docs: + return "" text="" for pdf in pdf_docs: - pdf_reader= PdfReader(pdf) - for page in pdf_reader.pages: - text+= page.extract_text() + try: + pdf_reader = PdfReader(pdf) + for page in pdf_reader.pages: + text += page.extract_text() + except Exception as e: + st.error(f"Error reading PDF '{pdf.name}': {str(e)}") return text
44-54
: Extract hardcoded values and add error handling to the vector store creationThe function uses hardcoded values for the embedding model and storage location. It also lacks error handling for the embedding and storage operations.
-def get_vector_store(text_chunks): - embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001") - vector_store = FAISS.from_texts(text_chunks, embedding=embeddings) - - # Ensure the directory exists - if not os.path.exists("faiss_index"): - os.makedirs("faiss_index") - - # Save the vector store index in the directory - vector_store.save_local("faiss_index") +def get_vector_store(text_chunks, embedding_model="models/embedding-001", store_dir="faiss_index"): + """Create and save vector store from text chunks. + + Args: + text_chunks: List of text chunks to embed + embedding_model: Name of the embedding model to use + store_dir: Directory to save the vector store + + Returns: + None + """ + if not text_chunks: + st.warning("No text to process. Please check the PDF content.") + return + + try: + embeddings = GoogleGenerativeAIEmbeddings(model=embedding_model) + vector_store = FAISS.from_texts(text_chunks, embedding=embeddings) + + # Ensure the directory exists + if not os.path.exists(store_dir): + os.makedirs(store_dir) + + # Save the vector store index in the directory + vector_store.save_local(store_dir) + except Exception as e: + st.error(f"Error creating vector store: {str(e)}")
82-102
: Remove debugging print statement and address security concernThe function has a debugging print statement and uses
allow_dangerous_deserialization=True
without explaining the security implications. Also, the embedding model is duplicated from an earlier function.-def user_input(user_question): - embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001") +def user_input(user_question, embedding_model="models/embedding-001", store_dir="faiss_index"): + """Process user question and generate a response. + + Args: + user_question: The user's question + embedding_model: Name of the embedding model to use + store_dir: Directory where the vector store is saved + + Returns: + None + """ + if not user_question.strip(): + return + + try: + embeddings = GoogleGenerativeAIEmbeddings(model=embedding_model) - # Check if the faiss_index file exists before loading - if not os.path.exists("faiss_index/index.faiss"): - st.error("FAISS index file not found. Please process the PDF files first.") - return + # Check if the index file exists before loading + index_path = f"{store_dir}/index.faiss" + if not os.path.exists(index_path): + st.error("FAISS index file not found. Please process the PDF files first.") + return - new_db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True) - docs = new_db.similarity_search(user_question) + # Note about the security parameter: + # allow_dangerous_deserialization=True is required to load FAISS indexes + # but should be used with caution in production environments with untrusted data + new_db = FAISS.load_local(store_dir, embeddings, allow_dangerous_deserialization=True) + docs = new_db.similarity_search(user_question) - chain = get_conversational_chain() + chain = get_conversational_chain() + if not chain: + return - response = chain( - {"input_documents": docs, "question": user_question}, - return_only_outputs=True - ) + with st.spinner("Generating response..."): + response = chain( + {"input_documents": docs, "question": user_question}, + return_only_outputs=True + ) + st.write("Reply: ", response["output_text"]) + except Exception as e: + st.error(f"Error processing question: {str(e)}") - print(response) - st.write("Reply: ", response["output_text"])
@UTSAVS26 |
Pull Request for PyVerse 💡
Requesting to submit a pull request to the PyVerse repository.
Issue Title
[Code Addition Request]: PDF Wizard - AI-Powered Document Q&A Tool with multiple pdfs📄✨
Info about the Related Issue
What's the goal of the project?
The goal of PDF Wizard is to provide an AI-powered interface that allows users to interact with multiple PDF documents. It extracts text from uploaded PDFs, converts them into embeddings using Google's Gemini AI, and stores them in a FAISS vector database. Users can then ask questions, and the system retrieves relevant content to generate accurate responses.
Name
Please mention your name.
Arnab Ghosh
GitHub ID
Please mention your GitHub ID.
https://github.com/tulu-g559
Email ID
Please mention your email ID for further communication.
[email protected]
Identify Yourself
Mention in which program you are contributing (e.g., WoB, GSSOC, SSOC, SWOC).
JWoC
Closes
Enter the issue number that will be closed through this PR.
*Closes: #1185 *
Describe the Add-ons or Changes You've Made
Give a clear description of what you have added or modified.
Describe your changes here.
Type of Change
Select the type of change:
How Has This Been Tested?
Describe how your changes have been tested.
Functionality Testing:
Uploaded multiple PDFs with different formats and structures.
Verified that text extraction from PDFs is accurate.
AI Response Testing:
Asked context-based questions and compared the answers to ensure relevance.
Tested edge cases where answers might not be available in the context.
Performance & Efficiency:
Measured response time for similarity search in FAISS.
Tested with various chunk sizes to optimize retrieval speed.
Checklist
Please confirm the following:
Summary by CodeRabbit