Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update measures for Nov 2024 #135

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
.recipes/
.cookiecutter/state/
dap_prinz_green_jobs/notebooks/

*.lock

Expand Down
49 changes: 35 additions & 14 deletions dap_prinz_green_jobs/analysis/ojo_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,38 +4,59 @@ This folder contains scripts to aggregate data at the SIC-, SOC- and region-leve

### Skills formatting

To aggregate the data we need to utilise all the skills per job advert and create a new dataset which has a single skill per row.

This process takes a long time but is used by all the aggregation scripts, so we can first create it by running:
To speed up the aggregating step we process the skills datasets into more manageable forms. This is done by running

```
python dap_prinz_green_jobs/analysis/ojo_analysis/process_full_skills_data.py

```

This creates a file named `exploded_all_ojo_large_sample_skills_green_measures_production_True.csv` which will be stored in the same date stamped S3 folder as the extracted skills were.
And going forward information about the skills and green skills proportions are stored in 3 locations:

1. The skills extracted and mapped to ESCO (not just green) `s3://prinz-green-jobs/outputs/data/ojo_application/deduplicated_sample/20241114/latest_update_20241114_skills.parquet` (with columns ['id', 'skill_label', 'esco_label', 'esco_id']). A smaller version of this file with just the job advert and ESCO id columns is in `s3://prinz-green-jobs/outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_exploded.parquet`.
2. The green skills extracted and mapped to green ESCO `outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_green_measures_exploded_green.parquet` (with columns ['job_id', 'skill_label', 'extracted_green_skill', 'extracted_green_skill_id', 'green_skill_preferred_name'])
3. Information on the number of skills and proportion of green skills `outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_green_measures_skill_metrics.parquet` (with columns ['job_id', 'prop_green_with_hs', 'NUM_ORIG_ENTS', 'NUM_SPLIT_ENTS', 'num_all_skills_ojo', 'count_green_skills_no_hs', 'PROP_GREEN'])

### Data aggregation

To aggregate OJO data with extracted green measures (as defined in `ojo_analysis.yaml`), run the following commands:
To aggregate OJO data with extracted green measures (as defined in `ojo_analysis.yaml`), run:

```
python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_region.py #to aggregate by ITL regions
python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_soc.py #to aggregate by SOC codes
python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_sic.py #to aggregate by SIC codes
python dap_prinz_green_jobs/analysis/ojo_analysis/create_aggregated_data.py

```

Meanwhile, the `process_ojo_green_measures.py` file contains methods for analysis. These are largely used in the `notebooks/` directory to generate graphs for the Green Jobs Explorer tool.
to aggregate the data by SOC, SIC and ITL regions. This script draws on functions from `process_ojo_green_measures.py`.

This will also format the occupation aggregated data into a form suitable for the Green Jobs Explorer website - these are very superficial changes needed to create the website - e.g. changing single to double quotation marks.

### Finding similar occupations based of skills asked for

In `create_aggregated_data.py` the similarities of occupations are also created using functions from `occupation_similarity.py`. To do this, a matrix of the proportions of all skills per occupation is created, and then each row of this matrix is compared using cosine similarity to find the closest occupations to one another based off which skills are asked for. The output `occupation_aggregated_data_{DATE}_extra.csv` contains an additional column containing the list of similar occupations.

Finally, some minor tweaks for the dataset the powers the Green Jobs Explorer tool are done by running:
### Final files

Lots of files are outputted in the `s3://prinz-green-jobs/outputs/data/ojo_application/extracted_green_measures/analysis/20241121/` folder from the previous scripts, the most important for analysis are:

1. `occupation_aggregated_data_20241121_extra_gjeformat.csv`: The data which powers the Green Jobs Explorer website. This is the aggregated data per occupation (SOC_EXT) with occupations with less than 50 job adverts removed.
2. `industry_aggregated_data_20241121.csv`: The data aggregated by SIC.
3. `all_itl_aggregated_data_20241121.csv`: The data aggregated by each of ITL 1, 2 and 3.

### Data for the Green Jobs Explorer download

Although the data the powers the GJE is produced in `create_aggregated_data.py`, there is an additional step to create a nicely formatted xlxs dataset with information sheets about the column names etc. This is for a user to download.

This is created by running:

```
python dap_prinz_green_jobs/analysis/ojo_analysis/gje_formatting.py
python dap_prinz_green_jobs/analysis/ojo_analysis/create_open_gje_data.py

```

These are very superficial changes needed to create the website - e.g. changing single to double quotation marks.
it essentially renames columns, deletes some columns and creates a data explaination sheet to go alongside it.

### Finding similar occupations based of skills asked for
The outputs are saved to:

In `aggregate_by_soc.py` the similarities of occupations are also created using functions from `occupation_similarity.py`. To do this, a matrix of the proportions of all skills per occupation is created, and then each row of this matrix is compared using cosine similarity to find the closest occupations to one another based off which skills are asked for. The output `occupation_aggregated_data_{DATE}_extra.csv` contains an additional column containing the list of similar occupations.
- `s3://nesta-open-data/green_jobs_explorer/occupation_aggregated_data_20241121_GJE.xlsx`
- `s3://nesta-open-data/green_jobs_explorer/industry_aggregated_data_20241121_GJE.xlsx`
- `s3://nesta-open-data/green_jobs_explorer/region_aggregated_data_20241121_GJE.xlsx`
125 changes: 0 additions & 125 deletions dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_region.py

This file was deleted.

105 changes: 0 additions & 105 deletions dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_sic.py

This file was deleted.

Loading
Loading