Added flat file handling #142

dogversioning · 2024-12-03T21:02:15Z

This makes the following changes:

the S3Manager class is now decoupled from powerset_merge, and is in the shared folder (and intended to be used by alternate processing methods going forward)
A new processing type, process_flat, moves flat files into new location types in S3
- Some changes to process_upload to shuttle things off to different queues

github-actions · 2024-12-03T21:03:19Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
786	766	97%	90%	🟢

New Files

File	Coverage	Status
src/shared/s3_manager.py	100%	🟢
src/site_upload/process_flat/init.py	100%	🟢
src/site_upload/process_flat/process_flat.py	100%	🟢
TOTAL	100%	🟢

Modified Files

File	Coverage	Status
src/shared/awswrangler_functions.py	100%	🟢
src/shared/enums.py	100%	🟢
src/shared/functions.py	97%	🟢
src/site_upload/cache_api/cache_api.py	94%	🟢
src/site_upload/powerset_merge/powerset_merge.py	95%	🟢
src/site_upload/process_upload/process_upload.py	98%	🟢
TOTAL	98%	🟢

updated for commit: a8d41a1 by action🐍

src/shared/enums.py

src/shared/functions.py

dogversioning · 2024-12-03T21:10:46Z

src/shared/s3_manager.py

+        # TODO: Taking out a folder layer to match the depth of non-site aggregates
+        # Revisit when targeted crawling is implemented


To expand on this a bit here: this bit of the glue config specifies that table names will be taken from the fifth directory down. Since we're introducing site as a concept here, adding a new layer, I elected to strip one element out of the "ideal" path.

#125 discusses a way around this, which basically amounts to creating a crawler on demand and pointing it at just the directory you want to crawl new data in.

dogversioning · 2024-12-03T21:12:42Z

src/site_upload/powerset_merge/powerset_merge.py

I didn't want to touch this too much since this is already pretty unwieldy, but this module maybe should be renamed 'process_powersets' or 'process_cubes'? I also need to do a pass and see if there's more stuff in here that can/should get moved to the manager, now that that's seperate.

tests/conftest.py

mikix · 2024-12-03T21:18:12Z

src/shared/awswrangler_functions.py

+        last_valid_df,
+        to_path,
+        index=False,
+        quoting=csv.QUOTE_MINIMAL,


Minimal is my own preferred dialect, but if this csv file is headed towards the dashboard, the way Vlad has talked about quoting makes me think he expects CSVs to be using the csv.QUOTE_STRINGS dialect (or at least csv.QUOTE_NONNUMERIC if you don't have Python 3.12).

I swear we had a contract at one point indicating this format was ok.

Maybe you do? I just said that based on how Vlad keeps talking about quotes casually in convos, but maybe in some spots this is the right quoting.

src/shared/functions.py

src/shared/s3_manager.py

mikix · 2024-12-03T21:32:17Z

src/site_upload/process_flat/process_flat.py

+    manager.update_local_metadata(enums.TransactionKeys.LAST_DATA_UPDATE.value, site=manager.site)
+    manager.update_local_metadata(
+        enums.ColumnTypesKeys.LAST_DATA_UPDATE.value,
+        value=column_dict,
+        metadata=manager.types_metadata,
+        meta_type=enums.JsonFilename.COLUMN_TYPES.value,
+    )


This is called twice with the same key and different values - can these be combined? (I might not be understanding how local metadata works)

this is one of those 'i made this choice for backwards compatibility at the time' issues that has caused downstream boondoggles - meta_type determines which of the two cached metadata dicts (transaction details, column types) this is written to. So these actually are writing to completely different dicts.

If we think that needs to be revisited, I might want to do that as a followon PR.

OK got it... Maybe this pattern would work better for my brain if the "where does this data get written" field had more prominence - like maybe as a mandatory kwarg or a positional arg, so it's easier to compare two calls and see "ah yes, this arg is different". In the calls above, the first positional arg felt like the "destination" info and the kwargs the values.

Honestly, a lot of these kwargs could use more clarity to a newcomer. Like... "write_local_metadata(value=x, metadata=y)" -- is value= the metadata being written and metadata= is metadata about the metadata? Or is metadata= the metadata being written and value= is something else.

yeah - part of my struggle in getting onboarded back to this was remembering what all this stuff does.

This is sort of address via #70, but the names of all these things could use a second pass.

src/site_upload/process_flat/process_flat.py

mikix · 2024-12-03T21:36:03Z

src/site_upload/process_upload/process_upload.py

-    data_package = path_params[2]
+    # This happens when we're processing flat files, due to having to condense one
+    # folder layer.
+    # TODO: revisit on targeted crawling
+    if "__" in path_params[2]:
+        data_package = path_params[2].split("__")[1]
+    else:
+        data_package = path_params[2]


nit: I feel like this repo is due for a AggPath class or similar that abstracts all of the path assumptions / evolutions over time. There's a lot of path[2] and splitting going on across the repo, instead of in one nice central place.

yeah, that's really good idea. #143

dogversioning commented Dec 3, 2024

View reviewed changes

mikix approved these changes Dec 3, 2024

View reviewed changes

dogversioning added 3 commits December 4, 2024 15:41

Added flat file handling

35150b7

Some coverage updates

6556e3c

more coverage, PR feedback

a8d41a1

dogversioning force-pushed the mg/type-handling branch from 773cf96 to a8d41a1 Compare December 4, 2024 21:02

dogversioning merged commit 512074e into main Dec 5, 2024
2 checks passed

dogversioning deleted the mg/type-handling branch December 5, 2024 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added flat file handling #142

Added flat file handling #142

dogversioning commented Dec 3, 2024 •

edited

Loading

github-actions bot commented Dec 3, 2024 •

edited

Loading

dogversioning Dec 3, 2024

dogversioning Dec 3, 2024

mikix Dec 3, 2024

dogversioning Dec 4, 2024

mikix Dec 4, 2024

mikix Dec 3, 2024

dogversioning Dec 4, 2024

mikix Dec 4, 2024

dogversioning Dec 4, 2024

mikix Dec 3, 2024

dogversioning Dec 4, 2024

		# TODO: Taking out a folder layer to match the depth of non-site aggregates
		# Revisit when targeted crawling is implemented

Added flat file handling #142

Added flat file handling #142

Conversation

dogversioning commented Dec 3, 2024 • edited Loading

github-actions bot commented Dec 3, 2024 • edited Loading

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dogversioning commented Dec 3, 2024 •

edited

Loading

github-actions bot commented Dec 3, 2024 •

edited

Loading