Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update measures for Nov 2024 #135

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from
Open

Update measures for Nov 2024 #135

wants to merge 13 commits into from

Conversation

lizgzil
Copy link
Contributor

@lizgzil lizgzil commented Nov 15, 2024


Note: the current GJE uses data and plots created in this PR

Description

This PR updates the OJO analysis to use the extra job adverts from Nov 2023 to Nov 2024.

  • Reading and deduplicating using new data
  • New flows for the data update (which just extracts green measures for the new lot of data and merges it with the old)
  • Updating readmes and configs
  • Aggregation scripts are replaced by a quicker method in create_aggregated_data.py
  • New script to nicely format all outputs for the GJE download option (with data descriptions)
  • Processing data changes due to new format of the datasets
  • Update all plotting notebooks and Flourish-ready data outputs
  • Temporal analysis notebook

Fixes # (issue)

#134 #132

In order to test the code in this PR you need to ...

Please pay special attention to ...

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@lizgzil lizgzil changed the title Create new update flow for industries Update measures for Nov 2024 Nov 18, 2024
Comment on lines +128 to +150
job_desc_chunks = list(partition_all(chunk_size, ojo_jobs_data))

t0 = time.time()
for i, job_desc_chunk in tqdm(enumerate(job_desc_chunks)):
ind_green_measures_dict = im.get_measures(job_desc_chunk)
save_to_s3(
BUCKET_NAME,
ind_green_measures_dict,
os.path.join(
inds_output_folder,
f"ojo_newest_industry_green_measures_production_{production}_interim/{i}.json",
),
)

# Read them back in and save altogether
ind_measures_locs = get_s3_data_paths(
BUCKET_NAME,
os.path.join(
inds_output_folder,
f"ojo_newest_industry_green_measures_production_{production}_interim",
),
file_types=["*.json"],
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crispy-wonton here we do the processing of batches of 5000 job adverts (which takes a while to run) and save the outputs in interim files. Then at the end we read them all in together and save again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant