fix: fix two ETL issues #493

NingLu · 2024-12-24T07:11:40Z

Fixes #

🤖 AI-Generated PR Description (Powered by Amazon Bedrock)

Description

This pull request includes modifications to several files related to the ETL and job processing functionality of the project. The changes are summarized as follows:

source/lambda/etl/sfn_handler.py: Modifications to the Step Function handler for the ETL process.
source/lambda/job/dep/llm_bot_dep/loaders/auto.py and source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py: Updates to the automatic and XLSX file loaders for the job processing component.
source/lambda/job/glue-job-script.py: Changes to the Glue job script responsible for data processing tasks.

The motivation behind these changes is to enhance the efficiency and reliability of the ETL and job processing workflows. The modifications include bug fixes, performance optimizations, and the addition of new features to improve the overall functionality of the system.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Please note that the changes introduced in this pull request do not include any breaking changes that would disrupt existing functionality. However, it is recommended to thoroughly review the changes and conduct thorough testing before deploying to production environments.

File Stats Summary

File number involved in this PR: 4, unfold to see the details:

The file changes summary is as follows:

Files	Changes	Change Summary
source/lambda/job/glue-job-script.py	1 added, 1 removed	The code change modifies the condition for setting the `gen_chunk_flag` variable, which now checks if the `file_type` is "csv", "xlsx", or "xls" before disabling chunking.
source/lambda/etl/sfn_handler.py	3 added, 1 removed	The code changes modify the construction of the chatbot_event object by creating a new chatbot_event_body dictionary based on the input_body, adding the group_name key, and then converting it to JSON format.
source/lambda/job/dep/llm_bot_dep/loaders/auto.py	1 added, 1 removed	The code change removes the `file_content` argument from the `process_xlsx` function call, suggesting that the function no longer requires the file content as input.
source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py	12 added, 15 removed	The code changes update the `process_xlsx` function to handle Excel files instead of JSON-line files. It removes the `jsonl` parameter and adds logic to download the Excel file from S3, parse it using pandas, and create `Document` objects from the data, with page content and metadata extracted from the Excel rows.

🤖 AI-Generated PR Description (Powered by Amazon Bedrock)

Description

This pull request includes modifications to the ETL (Extract, Transform, Load) process for handling data ingestion, as well as updates to the LLM (Large Language Model) bot dependencies and the Glue job script.

The main changes are:

sfn_handler.py: Updates to the Step Function handler for the ETL process.
llm_bot_dep/loaders/auto.py and llm_bot_dep/loaders/xlsx.py: Modifications to the automatic and XLSX file loaders for the LLM bot.
glue-job-script.py: Enhancements to the Glue job script, which is responsible for executing data transformations and loading data into the target data store.

The motivation behind these changes is to improve the efficiency and reliability of the data ingestion and processing pipeline, as well as to address any identified issues or bugs.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

File Stats Summary

File number involved in this PR: 4, unfold to see the details:

The file changes summary is as follows:

Files	Changes	Change Summary
source/lambda/job/dep/llm_bot_dep/loaders/auto.py	1 added, 1 removed	The code change removes the `file_content` argument from the `process_xlsx` function call, potentially optimizing memory usage for large Excel files.
source/lambda/etl/sfn_handler.py	3 added, 1 removed	The code changes update the chatbot_event body to include the input_body data along with the group_name, and pass it to the create_chatbot function.
source/lambda/job/glue-job-script.py	1 added, 1 removed	The code change modifies the condition for setting the `gen_chunk_flag` variable to `False` when the file type is "csv", "xlsx", or "xls", instead of just "csv".
source/lambda/job/dep/llm_bot_dep/loaders/xlsx.py	11 added, 14 removed	The code changes modify the `process_xlsx` function to process Excel files instead of JSON lines. The function now downloads an Excel file from S3, reads it using pandas, and creates a list of Document objects from the data in the file, with the content as page_content and metadata as specified.

NingLu added 6 commits December 24, 2024 03:36

fix: sfn handler use default chatbot id even it is provided

8432486

fix: update excel loader

f00d64a

chore: fix excel exception

91ef89c

Merge branch 'dev' into lvn

af36a0b

chore: update job whl

059c9a7

fix: update excel logic

cfed475

NingLu merged commit 89b2381 into dev Dec 24, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix two ETL issues #493

fix: fix two ETL issues #493

NingLu commented Dec 24, 2024 •

edited by github-actions bot

Loading

fix: fix two ETL issues #493

fix: fix two ETL issues #493

Conversation

NingLu commented Dec 24, 2024 • edited by github-actions bot Loading

Description

Type of change

File Stats Summary

Description

Type of change

File Stats Summary

NingLu commented Dec 24, 2024 •

edited by github-actions bot

Loading