Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated excel_knowledge_source.py to account for excel files with multiple tabs. #1921

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Kking112
Copy link

Updated excel_knowledge_source.py to account for excel sheets that have multiple tabs. The old implementation contained a single df=pd.read_excel(excel_file_path), which only reads the first or most recently used excel sheet. The updated functionality reads all sheets in the excel workbook.

Kking112 and others added 2 commits January 18, 2025 18:02
…ve multiple tabs. The old implementation contained a single df=pd.read_excel(excel_file_path), which only reads the first or most recently used excel sheet. The updated functionality reads all sheets in the excel workbook.
@joaomdmoura
Copy link
Collaborator

Disclaimer: This review was made by a crew of AI Agents.

Code Review Comment for PR: ExcelKnowledgeSource Updates

Overview

This pull request significantly enhances the ExcelKnowledgeSource class by enabling it to handle multiple sheets within Excel workbooks, a feature that greatly improves its utility compared to the previous single-sheet implementation. Below are my insights based on the changes made, along with recommendations for improvement.

Positive Aspects

  1. Core Functionality Improvement: The ability to read multiple sheets addresses a critical limitation, allowing for more comprehensive data extraction from Excel files.
  2. Backward Compatibility Maintained: Existing functionality is preserved, ensuring no disruption for users still reliant on single-sheet reading.
  3. Error Handling Improvements: Robust error handling for dependencies can prevent runtime issues when required modules are not available.

Issues and Recommendations

1. Memory Efficiency Concerns

The current implementation loads all sheets into memory at once, which poses a risk of excessive memory usage for large files.

Recommendation:
Refactor the load_content method to handle sheets iteratively. Here's an example of how to read one sheet at a time:

def load_content(self) -> Dict[Path, Dict[str, str]]:
    content_dict = {}
    for file_path in self.safe_file_paths:
        with pd.ExcelFile(file_path) as xl:
            sheet_dict = {sheet_name: pd.read_excel(xl, sheet_name).to_csv(index=False)
                          for sheet_name in xl.sheet_names}
        content_dict[file_path] = sheet_dict
    return content_dict

2. CSV Conversion Implementation

Using string concatenation for CSV conversion is prone to errors, especially with special characters.

Recommendation:
Utilize the built-in CSV library for proper formatting. This enhances reliability:

import csv
from io import StringIO

def _sheet_to_csv(self, worksheet) -> str:
    output = StringIO()
    writer = csv.writer(output)
    for row in worksheet.values:
        writer.writerow(row)
    return output.getvalue()

3. Type Hints

The absence of type hints affects both maintainability and IDE functionality.

Recommendation:
Add type hints to all public methods to boost code clarity:

from typing import Dict, Path

def load_content(self) -> Dict[Path, Dict[str, str]]:
    ...

4. Debug Print Statement

Unnecessary print statements can clutter logs.

Recommendation:
Remove any debug prints such as:

# Remove this line
print(sheet_str)

5. Simplifying Content Processing in the add() Method

The current content processing logic is somewhat convoluted.

Recommendation:
Refactor to a more straightforward approach for improved readability:

def add(self) -> None:
    content_str = "\n".join(
        str(sheet_value) for file_content in self.content.values()
        for sheet_value in (file_content.values() if isinstance(file_content, dict) else [file_content])
    )
    new_chunks = self._chunk_text(content_str)
    self.chunks.extend(new_chunks)

6. Documentation Improvements

Docstrings should provide more detailed descriptions of method functionality.

Recommendation:
Enhance the documentation of the load_content method with clear explanations of the parameters and return types:

def load_content(self) -> Dict[Path, Dict[str, str]]:
    """Load and preprocess Excel file content from multiple sheets.
    
    Each sheet's content is converted to CSV format and stored.
    
    Returns:
        Dict[Path, Dict[str, str]]: A mapping of file paths to their respective sheet contents.
    
    Raises:
        ImportError: If required dependencies are missing.
        FileNotFoundError: If the specified Excel file cannot be opened.
    """

Additional Suggestions

  1. Consider validating Excel file formats to prevent unsupported formats from causing issues.
  2. Implement logging to track progress, especially when processing large files.
  3. Introduce unit tests specifically for multi-sheet scenarios to verify behavior across various conditions.

Impact Assessment

  • Positive: The enhancements vastly improve functionality for users dealing with complex Excel files.
  • Negative: There are potential memory problems that could arise with large files.
  • Risk Level: Medium due to memory handling issues, but improvements suggested can mitigate this risk.

Final Recommendations

I recommend approving the PR with the following changes. While the core enhancement is essential, refining the implementation for efficiency and robustness will ensure a higher quality product. Immediate focus should be given to optimizing memory management and improving the CSV conversion method.


This comment encompasses a detailed yet concise overview of the changes, the potential impacts, and specific improvement suggestions, ensuring a thorough code review process.

@Kking112 Kking112 marked this pull request as draft January 18, 2025 23:11
…e memory usage and provide better documentation
@Kking112 Kking112 marked this pull request as ready for review January 18, 2025 23:19
@Kking112
Copy link
Author

Made the changes suggested by the AI Agents. Please let me know if I should change anything else. Thank you!

@Kking112
Copy link
Author

I realized I made a mistake in the 2nd to last commit.. did not delete the old load_content function - corrected this in the latest commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants