FEAT: Decrease run-time duration of ODS-QLIK Process #28

rymarczy · 2024-11-07T17:09:41Z

This change alters the way ods-qlik data is loaded into the dmap-import database.

Previously the ods-qlik data loading steps where:

load individual cdc csv.gz files into _history table
execute very complicated query against history table to generate _fact table updates.

This approach has proven to not be scalable, as the size of the _history tables and complicated nature of the _fact table query, has resulted in unacceptable long load times.

The new process does the following:

download batches of cdc csv.gz files on to local machine
merge file batches into a single csv file
load csv file to _history table
convert csv file to dataframe object
perform INSERT - UPDATE - DELETE operations on _fact table individually with dataframe object

The new process completely removes the need for a complicated query to load data from _history table to the _fact table and thus dramatically reduces loading time when running the ods-qlik process.

The new process is also completely compatible with existing ETL status files.

grejdi-mbta · 2024-11-07T18:41:39Z

src/cubic_loader/qlik/ods_qlik.py

    """
-    find all available CDC dfm files for a Snapshot from Archive and Error buckets
+    find all available CDC csv.gz files for a Snapshot from Archive bucket


Curious why you're dropping Error bucket.

I dropped it when I was trying to simplify things, I can add it back.

Up to you. But I think we should have it in there. The glue pipeline is unnecessarily strict so we shouldn't constrain this pipeline because of it.

Incorporated

src/cubic_loader/qlik/ods_qlik.py

grejdi-mbta

Please make sure to test the error bucket addition.

rymarczy · 2024-11-07T19:21:58Z

Please make sure to test the error bucket addition.

I checked it out locally, the only tricky thing is returning a list with everything sorted by the last timestamp of the filename. The sorted call with a lambda function takes care of this.

drop fact table load

4b0aa13

rymarczy changed the title ~~FEAT: Increase load speed of ODS-QLIK Process~~ FEAT: Decrease run-time duration of ODS-QLIK Process Nov 7, 2024

rymarczy requested a review from grejdi-mbta November 7, 2024 17:11

grejdi-mbta reviewed Nov 7, 2024

View reviewed changes

add ERROR bucket back

04e6005

grejdi-mbta approved these changes Nov 7, 2024

View reviewed changes

rymarczy merged commit 9f9ab45 into main Nov 7, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Decrease run-time duration of ODS-QLIK Process #28

FEAT: Decrease run-time duration of ODS-QLIK Process #28

rymarczy commented Nov 7, 2024

grejdi-mbta Nov 7, 2024

rymarczy Nov 7, 2024

grejdi-mbta Nov 7, 2024

rymarczy Nov 7, 2024 •

edited

Loading

grejdi-mbta left a comment

rymarczy commented Nov 7, 2024

FEAT: Decrease run-time duration of ODS-QLIK Process #28

FEAT: Decrease run-time duration of ODS-QLIK Process #28

Conversation

rymarczy commented Nov 7, 2024

grejdi-mbta Nov 7, 2024

Choose a reason for hiding this comment

rymarczy Nov 7, 2024

Choose a reason for hiding this comment

grejdi-mbta Nov 7, 2024

Choose a reason for hiding this comment

rymarczy Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

grejdi-mbta left a comment

Choose a reason for hiding this comment

rymarczy commented Nov 7, 2024

rymarczy Nov 7, 2024 •

edited

Loading