Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Decrease run-time duration of ODS-QLIK Process #28

Merged
merged 2 commits into from
Nov 7, 2024

Conversation

rymarczy
Copy link
Contributor

@rymarczy rymarczy commented Nov 7, 2024

This change alters the way ods-qlik data is loaded into the dmap-import database.

Previously the ods-qlik data loading steps where:

  1. load individual cdc csv.gz files into _history table
  2. execute very complicated query against history table to generate _fact table updates.

This approach has proven to not be scalable, as the size of the _history tables and complicated nature of the _fact table query, has resulted in unacceptable long load times.

The new process does the following:

  1. download batches of cdc csv.gz files on to local machine
  2. merge file batches into a single csv file
  3. load csv file to _history table
  4. convert csv file to dataframe object
  5. perform INSERT - UPDATE - DELETE operations on _fact table individually with dataframe object

The new process completely removes the need for a complicated query to load data from _history table to the _fact table and thus dramatically reduces loading time when running the ods-qlik process.

The new process is also completely compatible with existing ETL status files.

@rymarczy rymarczy changed the title FEAT: Increase load speed of ODS-QLIK Process FEAT: Decrease run-time duration of ODS-QLIK Process Nov 7, 2024
@rymarczy rymarczy requested a review from grejdi-mbta November 7, 2024 17:11
"""
find all available CDC dfm files for a Snapshot from Archive and Error buckets
find all available CDC csv.gz files for a Snapshot from Archive bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you're dropping Error bucket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped it when I was trying to simplify things, I can add it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you. But I think we should have it in there. The glue pipeline is unnecessarily strict so we shouldn't constrain this pipeline because of it.

Copy link
Contributor Author

@rymarczy rymarczy Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorporated

src/cubic_loader/qlik/ods_qlik.py Show resolved Hide resolved
src/cubic_loader/qlik/ods_qlik.py Show resolved Hide resolved
Copy link
Contributor

@grejdi-mbta grejdi-mbta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to test the error bucket addition.

@rymarczy
Copy link
Contributor Author

rymarczy commented Nov 7, 2024

Please make sure to test the error bucket addition.

I checked it out locally, the only tricky thing is returning a list with everything sorted by the last timestamp of the filename. The sorted call with a lambda function takes care of this.

@rymarczy rymarczy merged commit 9f9ab45 into main Nov 7, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants