Update measures for Nov 2024 #135

lizgzil · 2024-11-15T09:45:01Z

Note: the current GJE uses data and plots created in this PR

Description

This PR updates the OJO analysis to use the extra job adverts from Nov 2023 to Nov 2024.

Reading and deduplicating using new data
New flows for the data update (which just extracts green measures for the new lot of data and merges it with the old)
Updating readmes and configs
Aggregation scripts are replaced by a quicker method in create_aggregated_data.py
New script to nicely format all outputs for the GJE download option (with data descriptions)
Processing data changes due to new format of the datasets
Update all plotting notebooks and Flourish-ready data outputs
Temporal analysis notebook

Fixes # (issue)

#134 #132

In order to test the code in this PR you need to ...

Please pay special attention to ...

Checklist:

…upation flow for update

…d add a readme

…ojo_green_measures

…early aggregation of data to create_aggregated_data

… public consumption, and add these to the readme

…g in Flourish

lizgzil · 2025-01-09T15:34:09Z

dap_prinz_green_jobs/pipeline/ojo_application/flows/ojo_industry_measures_update.py

+    job_desc_chunks = list(partition_all(chunk_size, ojo_jobs_data))
+
+    t0 = time.time()
+    for i, job_desc_chunk in tqdm(enumerate(job_desc_chunks)):
+        ind_green_measures_dict = im.get_measures(job_desc_chunk)
+        save_to_s3(
+            BUCKET_NAME,
+            ind_green_measures_dict,
+            os.path.join(
+                inds_output_folder,
+                f"ojo_newest_industry_green_measures_production_{production}_interim/{i}.json",
+            ),
+        )
+
+    # Read them back in and save altogether
+    ind_measures_locs = get_s3_data_paths(
+        BUCKET_NAME,
+        os.path.join(
+            inds_output_folder,
+            f"ojo_newest_industry_green_measures_production_{production}_interim",
+        ),
+        file_types=["*.json"],
+    )


@crispy-wonton here we do the processing of batches of 5000 job adverts (which takes a while to run) and save the outputs in interim files. Then at the end we read them all in together and save again.

Create new update flow for industries

d7e17a8

lizgzil changed the title ~~Create new update flow for industries~~ Update measures for Nov 2024 Nov 18, 2024

lizgzil added 12 commits November 18, 2024 10:13

Update industry rerun information in readmes etc

3a79dea

Update industry rerun information in readmes etc

826010f

Create skills flow update

5151a24

remove saving embeddings

6efabc0

Update readmes about the refresh, save things out to parquet, add occ…

4c6db9b

…upation flow for update

Reboot aggregation steps for new data refresh - delete old scripts an…

f66b2a3

…d add a readme

clean up create_aggregate_data and remove old functions from process_…

85a0a4b

…ojo_green_measures

Rerun high level analysis notebook

8f33946

Add extra parts of the esco taxononmy to the name mapper, and add a y…

f578a7f

…early aggregation of data to create_aggregated_data

Fix year to text issue, create a script to nicely format gje data for…

3290dd0

… public consumption, and add these to the readme

Update plotting notebooks with creating datasets suitable for plottin…

5dc48ab

…g in Flourish

Remove betting shop managers from temporal change plots

e7b1a0a

lizgzil commented Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update measures for Nov 2024 #135

Update measures for Nov 2024 #135

lizgzil commented Nov 15, 2024 •

edited

Loading

lizgzil Jan 9, 2025

Update measures for Nov 2024 #135

Are you sure you want to change the base?

Update measures for Nov 2024 #135

Conversation

lizgzil commented Nov 15, 2024 • edited Loading

Description

Checklist:

lizgzil Jan 9, 2025

Choose a reason for hiding this comment

lizgzil commented Nov 15, 2024 •

edited

Loading