nestauk · lizgzil · Nov 15, 2024 · Nov 18, 2024 · Nov 18, 2024 · Nov 18, 2024
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,6 @@
 .recipes/
 .cookiecutter/state/
+dap_prinz_green_jobs/notebooks/
 
 *.lock
 

diff --git a/dap_prinz_green_jobs/analysis/ojo_analysis/README.md b/dap_prinz_green_jobs/analysis/ojo_analysis/README.md
@@ -4,38 +4,59 @@ This folder contains scripts to aggregate data at the SIC-, SOC- and region-leve
 
 ### Skills formatting
 
-To aggregate the data we need to utilise all the skills per job advert and create a new dataset which has a single skill per row.
-
-This process takes a long time but is used by all the aggregation scripts, so we can first create it by running:
+To speed up the aggregating step we process the skills datasets into more manageable forms. This is done by running
 
 ```
 python dap_prinz_green_jobs/analysis/ojo_analysis/process_full_skills_data.py
 
 ```
 
-This creates a file named `exploded_all_ojo_large_sample_skills_green_measures_production_True.csv` which will be stored in the same date stamped S3 folder as the extracted skills were.
+And going forward information about the skills and green skills proportions are stored in 3 locations:
+
+1. The skills extracted and mapped to ESCO (not just green) `s3://prinz-green-jobs/outputs/data/ojo_application/deduplicated_sample/20241114/latest_update_20241114_skills.parquet` (with columns ['id', 'skill_label', 'esco_label', 'esco_id']). A smaller version of this file with just the job advert and ESCO id columns is in `s3://prinz-green-jobs/outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_exploded.parquet`.
+2. The green skills extracted and mapped to green ESCO `outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_green_measures_exploded_green.parquet` (with columns ['job_id', 'skill_label', 'extracted_green_skill', 'extracted_green_skill_id', 'green_skill_preferred_name'])
+3. Information on the number of skills and proportion of green skills `outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_green_measures_skill_metrics.parquet` (with columns ['job_id', 'prop_green_with_hs', 'NUM_ORIG_ENTS', 'NUM_SPLIT_ENTS', 'num_all_skills_ojo', 'count_green_skills_no_hs', 'PROP_GREEN'])
 
 ### Data aggregation
 
-To aggregate OJO data with extracted green measures (as defined in `ojo_analysis.yaml`), run the following commands:
+To aggregate OJO data with extracted green measures (as defined in `ojo_analysis.yaml`), run:
 
 ```
-python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_region.py #to aggregate by ITL regions
-python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_soc.py #to aggregate by SOC codes
-python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_sic.py #to aggregate by SIC codes
+python dap_prinz_green_jobs/analysis/ojo_analysis/create_aggregated_data.py
+
 ```
 
-Meanwhile, the `process_ojo_green_measures.py` file contains methods for analysis. These are largely used in the `notebooks/` directory to generate graphs for the Green Jobs Explorer tool.
+to aggregate the data by SOC, SIC and ITL regions. This script draws on functions from `process_ojo_green_measures.py`.
+
+This will also format the occupation aggregated data into a form suitable for the Green Jobs Explorer website - these are very superficial changes needed to create the website - e.g. changing single to double quotation marks.
+
+### Finding similar occupations based of skills asked for
+
+In `create_aggregated_data.py` the similarities of occupations are also created using functions from `occupation_similarity.py`. To do this, a matrix of the proportions of all skills per occupation is created, and then each row of this matrix is compared using cosine similarity to find the closest occupations to one another based off which skills are asked for. The output `occupation_aggregated_data_{DATE}_extra.csv` contains an additional column containing the list of similar occupations.
 
-Finally, some minor tweaks for the dataset the powers the Green Jobs Explorer tool are done by running:
+### Final files
+
+Lots of files are outputted in the `s3://prinz-green-jobs/outputs/data/ojo_application/extracted_green_measures/analysis/20241121/` folder from the previous scripts, the most important for analysis are:
+
+1. `occupation_aggregated_data_20241121_extra_gjeformat.csv`: The data which powers the Green Jobs Explorer website. This is the aggregated data per occupation (SOC_EXT) with occupations with less than 50 job adverts removed.
+2. `industry_aggregated_data_20241121.csv`: The data aggregated by SIC.
+3. `all_itl_aggregated_data_20241121.csv`: The data aggregated by each of ITL 1, 2 and 3.
+
+### Data for the Green Jobs Explorer download
+
+Although the data the powers the GJE is produced in `create_aggregated_data.py`, there is an additional step to create a nicely formatted xlxs dataset with information sheets about the column names etc. This is for a user to download.
+
+This is created by running:
 
 ```
-python dap_prinz_green_jobs/analysis/ojo_analysis/gje_formatting.py
+python dap_prinz_green_jobs/analysis/ojo_analysis/create_open_gje_data.py
 
 ```
 
-These are very superficial changes needed to create the website - e.g. changing single to double quotation marks.
+it essentially renames columns, deletes some columns and creates a data explaination sheet to go alongside it.
 
-### Finding similar occupations based of skills asked for
+The outputs are saved to:
 
-In `aggregate_by_soc.py` the similarities of occupations are also created using functions from `occupation_similarity.py`. To do this, a matrix of the proportions of all skills per occupation is created, and then each row of this matrix is compared using cosine similarity to find the closest occupations to one another based off which skills are asked for. The output `occupation_aggregated_data_{DATE}_extra.csv` contains an additional column containing the list of similar occupations.
+- `s3://nesta-open-data/green_jobs_explorer/occupation_aggregated_data_20241121_GJE.xlsx`
+- `s3://nesta-open-data/green_jobs_explorer/industry_aggregated_data_20241121_GJE.xlsx`
+- `s3://nesta-open-data/green_jobs_explorer/region_aggregated_data_20241121_GJE.xlsx`
diff --git a/dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_region.py b/dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_region.py
diff --git a/dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_sic.py b/dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_sic.py