Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added flat file handling #142

Merged
merged 3 commits into from
Dec 5, 2024
Merged

Added flat file handling #142

merged 3 commits into from
Dec 5, 2024

Conversation

dogversioning
Copy link
Contributor

@dogversioning dogversioning commented Dec 3, 2024

This makes the following changes:

  • the S3Manager class is now decoupled from powerset_merge, and is in the shared folder (and intended to be used by alternate processing methods going forward)
  • A new processing type, process_flat, moves flat files into new location types in S3
    • Some changes to process_upload to shuttle things off to different queues

Copy link

github-actions bot commented Dec 3, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
786 766 97% 90% 🟢

New Files

File Coverage Status
src/shared/s3_manager.py 100% 🟢
src/site_upload/process_flat/init.py 100% 🟢
src/site_upload/process_flat/process_flat.py 100% 🟢
TOTAL 100% 🟢

Modified Files

File Coverage Status
src/shared/awswrangler_functions.py 100% 🟢
src/shared/enums.py 100% 🟢
src/shared/functions.py 97% 🟢
src/site_upload/cache_api/cache_api.py 94% 🟢
src/site_upload/powerset_merge/powerset_merge.py 95% 🟢
src/site_upload/process_upload/process_upload.py 98% 🟢
TOTAL 98% 🟢

updated for commit: a8d41a1 by action🐍

src/shared/enums.py Show resolved Hide resolved
src/shared/functions.py Show resolved Hide resolved
Comment on lines +62 to +63
# TODO: Taking out a folder layer to match the depth of non-site aggregates
# Revisit when targeted crawling is implemented
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on this a bit here: this bit of the glue config specifies that table names will be taken from the fifth directory down. Since we're introducing site as a concept here, adding a new layer, I elected to strip one element out of the "ideal" path.

#125 discusses a way around this, which basically amounts to creating a crawler on demand and pointing it at just the directory you want to crawl new data in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to touch this too much since this is already pretty unwieldy, but this module maybe should be renamed 'process_powersets' or 'process_cubes'? I also need to do a pass and see if there's more stuff in here that can/should get moved to the manager, now that that's seperate.

tests/conftest.py Show resolved Hide resolved
last_valid_df,
to_path,
index=False,
quoting=csv.QUOTE_MINIMAL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimal is my own preferred dialect, but if this csv file is headed towards the dashboard, the way Vlad has talked about quoting makes me think he expects CSVs to be using the csv.QUOTE_STRINGS dialect (or at least csv.QUOTE_NONNUMERIC if you don't have Python 3.12).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear we had a contract at one point indicating this format was ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you do? I just said that based on how Vlad keeps talking about quotes casually in convos, but maybe in some spots this is the right quoting.

src/shared/functions.py Outdated Show resolved Hide resolved
src/shared/functions.py Outdated Show resolved Hide resolved
src/shared/s3_manager.py Outdated Show resolved Hide resolved
Comment on lines +48 to +54
manager.update_local_metadata(enums.TransactionKeys.LAST_DATA_UPDATE.value, site=manager.site)
manager.update_local_metadata(
enums.ColumnTypesKeys.LAST_DATA_UPDATE.value,
value=column_dict,
metadata=manager.types_metadata,
meta_type=enums.JsonFilename.COLUMN_TYPES.value,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called twice with the same key and different values - can these be combined? (I might not be understanding how local metadata works)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one of those 'i made this choice for backwards compatibility at the time' issues that has caused downstream boondoggles - meta_type determines which of the two cached metadata dicts (transaction details, column types) this is written to. So these actually are writing to completely different dicts.

If we think that needs to be revisited, I might want to do that as a followon PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK got it... Maybe this pattern would work better for my brain if the "where does this data get written" field had more prominence - like maybe as a mandatory kwarg or a positional arg, so it's easier to compare two calls and see "ah yes, this arg is different". In the calls above, the first positional arg felt like the "destination" info and the kwargs the values.

Honestly, a lot of these kwargs could use more clarity to a newcomer. Like... "write_local_metadata(value=x, metadata=y)" -- is value= the metadata being written and metadata= is metadata about the metadata? Or is metadata= the metadata being written and value= is something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - part of my struggle in getting onboarded back to this was remembering what all this stuff does.

This is sort of address via #70, but the names of all these things could use a second pass.

src/site_upload/process_flat/process_flat.py Show resolved Hide resolved
Comment on lines -26 to +32
data_package = path_params[2]
# This happens when we're processing flat files, due to having to condense one
# folder layer.
# TODO: revisit on targeted crawling
if "__" in path_params[2]:
data_package = path_params[2].split("__")[1]
else:
data_package = path_params[2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like this repo is due for a AggPath class or similar that abstracts all of the path assumptions / evolutions over time. There's a lot of path[2] and splitting going on across the repo, instead of in one nice central place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's really good idea. #143

@dogversioning dogversioning merged commit 512074e into main Dec 5, 2024
2 checks passed
@dogversioning dogversioning deleted the mg/type-handling branch December 5, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants