Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix two ETL issues #493

Merged
merged 6 commits into from
Dec 24, 2024
Merged

fix: fix two ETL issues #493

merged 6 commits into from
Dec 24, 2024

Conversation

NingLu
Copy link
Collaborator

@NingLu NingLu commented Dec 24, 2024

Fixes #

🤖 AI-Generated PR Description (Powered by Amazon Bedrock)

Description

This pull request includes modifications to several files related to the ETL and job processing functionality of the project. The changes are summarized as follows:

  • source/lambda/etl/sfn_handler.py: Modifications to the Step Function handler for the ETL process.
  • source/lambda/job/dep/llm_bot_dep/loaders/auto.py and source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py: Updates to the automatic and XLSX file loaders for the job processing component.
  • source/lambda/job/glue-job-script.py: Changes to the Glue job script responsible for data processing tasks.

The motivation behind these changes is to enhance the efficiency and reliability of the ETL and job processing workflows. The modifications include bug fixes, performance optimizations, and the addition of new features to improve the overall functionality of the system.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Please note that the changes introduced in this pull request do not include any breaking changes that would disrupt existing functionality. However, it is recommended to thoroughly review the changes and conduct thorough testing before deploying to production environments.

File Stats Summary

File number involved in this PR: 4, unfold to see the details:

The file changes summary is as follows:

Files
Changes
Change Summary
source/lambda/job/glue-job-script.py 1 added, 1 removed The code change modifies the condition for setting the gen_chunk_flag variable, which now checks if the file_type is "csv", "xlsx", or "xls" before disabling chunking.
source/lambda/etl/sfn_handler.py 3 added, 1 removed The code changes modify the construction of the chatbot_event object by creating a new chatbot_event_body dictionary based on the input_body, adding the group_name key, and then converting it to JSON format.
source/lambda/job/dep/llm_bot_dep/loaders/auto.py 1 added, 1 removed The code change removes the file_content argument from the process_xlsx function call, suggesting that the function no longer requires the file content as input.
source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py 12 added, 15 removed The code changes update the process_xlsx function to handle Excel files instead of JSON-line files. It removes the jsonl parameter and adds logic to download the Excel file from S3, parse it using pandas, and create Document objects from the data, with page content and metadata extracted from the Excel rows.
🤖 AI-Generated PR Description (Powered by Amazon Bedrock)

Description

This pull request includes modifications to the ETL (Extract, Transform, Load) process for handling data ingestion, as well as updates to the LLM (Large Language Model) bot dependencies and the Glue job script.

The main changes are:

  • sfn_handler.py: Updates to the Step Function handler for the ETL process.
  • llm_bot_dep/loaders/auto.py and llm_bot_dep/loaders/xlsx.py: Modifications to the automatic and XLSX file loaders for the LLM bot.
  • glue-job-script.py: Enhancements to the Glue job script, which is responsible for executing data transformations and loading data into the target data store.

The motivation behind these changes is to improve the efficiency and reliability of the data ingestion and processing pipeline, as well as to address any identified issues or bugs.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

File Stats Summary

File number involved in this PR: 4, unfold to see the details:

The file changes summary is as follows:

Files
Changes
Change Summary
source/lambda/job/dep/llm_bot_dep/loaders/auto.py 1 added, 1 removed The code change removes the file_content argument from the process_xlsx function call, potentially optimizing memory usage for large Excel files.
source/lambda/etl/sfn_handler.py 3 added, 1 removed The code changes update the chatbot_event body to include the input_body data along with the group_name, and pass it to the create_chatbot function.
source/lambda/job/glue-job-script.py 1 added, 1 removed The code change modifies the condition for setting the gen_chunk_flag variable to False when the file type is "csv", "xlsx", or "xls", instead of just "csv".
source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py 11 added, 14 removed The code changes modify the process_xlsx function to process Excel files instead of JSON lines. The function now downloads an Excel file from S3, reads it using pandas, and creates a list of Document objects from the data in the file, with the content as page_content and metadata as specified.

@NingLu NingLu merged commit 89b2381 into dev Dec 24, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant