-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added flat file handling #142
Conversation
☂️ Python Coverage
Overall Coverage
New Files
Modified Files
|
# TODO: Taking out a folder layer to match the depth of non-site aggregates | ||
# Revisit when targeted crawling is implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on this a bit here: this bit of the glue config specifies that table names will be taken from the fifth directory down. Since we're introducing site as a concept here, adding a new layer, I elected to strip one element out of the "ideal" path.
#125 discusses a way around this, which basically amounts to creating a crawler on demand and pointing it at just the directory you want to crawl new data in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to touch this too much since this is already pretty unwieldy, but this module maybe should be renamed 'process_powersets' or 'process_cubes'? I also need to do a pass and see if there's more stuff in here that can/should get moved to the manager, now that that's seperate.
last_valid_df, | ||
to_path, | ||
index=False, | ||
quoting=csv.QUOTE_MINIMAL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimal is my own preferred dialect, but if this csv file is headed towards the dashboard, the way Vlad has talked about quoting makes me think he expects CSVs to be using the csv.QUOTE_STRINGS
dialect (or at least csv.QUOTE_NONNUMERIC
if you don't have Python 3.12).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swear we had a contract at one point indicating this format was ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you do? I just said that based on how Vlad keeps talking about quotes casually in convos, but maybe in some spots this is the right quoting.
manager.update_local_metadata(enums.TransactionKeys.LAST_DATA_UPDATE.value, site=manager.site) | ||
manager.update_local_metadata( | ||
enums.ColumnTypesKeys.LAST_DATA_UPDATE.value, | ||
value=column_dict, | ||
metadata=manager.types_metadata, | ||
meta_type=enums.JsonFilename.COLUMN_TYPES.value, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called twice with the same key and different values - can these be combined? (I might not be understanding how local metadata works)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is one of those 'i made this choice for backwards compatibility at the time' issues that has caused downstream boondoggles - meta_type
determines which of the two cached metadata dicts (transaction details, column types) this is written to. So these actually are writing to completely different dicts.
If we think that needs to be revisited, I might want to do that as a followon PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK got it... Maybe this pattern would work better for my brain if the "where does this data get written" field had more prominence - like maybe as a mandatory kwarg or a positional arg, so it's easier to compare two calls and see "ah yes, this arg is different". In the calls above, the first positional arg felt like the "destination" info and the kwargs the values.
Honestly, a lot of these kwargs could use more clarity to a newcomer. Like... "write_local_metadata(value=x, metadata=y)" -- is value=
the metadata being written and metadata=
is metadata about the metadata? Or is metadata=
the metadata being written and value=
is something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - part of my struggle in getting onboarded back to this was remembering what all this stuff does.
This is sort of address via #70, but the names of all these things could use a second pass.
data_package = path_params[2] | ||
# This happens when we're processing flat files, due to having to condense one | ||
# folder layer. | ||
# TODO: revisit on targeted crawling | ||
if "__" in path_params[2]: | ||
data_package = path_params[2].split("__")[1] | ||
else: | ||
data_package = path_params[2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel like this repo is due for a AggPath class or similar that abstracts all of the path assumptions / evolutions over time. There's a lot of path[2] and splitting going on across the repo, instead of in one nice central place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's really good idea. #143
773cf96
to
a8d41a1
Compare
This makes the following changes:
process_flat
, moves flat files into new location types in S3process_upload
to shuttle things off to different queues