diff --git a/README.md b/README.md index 5d90d4b..b33f36d 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,38 @@ # rialto-airflow [![.github/workflows/test.yml](https://github.com/sul-dlss-labs/rialto-airflow/actions/workflows/test.yml/badge.svg)](https://github.com/sul-dlss-labs/rialto-airflow/actions/workflows/test.yml) - -Airflow for harvesting data for open access analysis and research intelligence. The workflow is integrates data from [sul_pub](https://github.com/sul-dlss/sul_pub), [rialto-orgs](https://github.com/sul-dlss/rialto-orgs), [OpenAlex](https://openalex.org/) and [Dimensions](https://www.dimensions.ai/) APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from sul_pub, look those publications up in OpenAlex and Dimensions using the DOI, merge the the author/department information found in [rialto_orgs], and publish the data to our JupyterHub environment. + +Airflow for harvesting data for open access analysis and research intelligence. The workflow integrates data from [sul_pub](https://github.com/sul-dlss/sul_pub), [rialto-orgs](https://github.com/sul-dlss/rialto-orgs), [OpenAlex](https://openalex.org/) and [Dimensions](https://www.dimensions.ai/) APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from SUL-Pub, OpenAlex, and Dimensions, enrich them with additional metadata from OpenAlex and Dimensions using the DOI, merge the organizational data found in [rialto_orgs], and publish the data to our JupyterHub environment. ```mermaid flowchart TD - last_harvest(Determine last harvest) --> sul_pub(Publications from sul_pub) - sul_pub --> extract_doi(Extract DOIs) - extract_doi -- DOI --> openalex(OpenAlex) - extract_doi -- DOI --> dimensions(Dimensions) - dimensions --> merge_pubs(Merge Publications) - openalex --> merge_pubs(Merge Publications) - merge_pubs -- SUNETID --> join_departments(Join Departments) - join_departments --> publish(Publish) + last_harvest(Determine last harvest) --> sul_pub_harvest(SUL-Pub harvest) + sul_pub_harvest --> sul_pub_pubs[/SUL-Pub publications/] + rialto_orgs_export(Manual RIALTO app export) --> org_data[/Stanford organizational data/] + last_harvest --> dimensions_harvest_orcid(Dimensions harvest ORCID) + last_harvest --> openalex_harvest_orcid(OpenAlex harvest ORCID) + org_data --> dimensions_harvest_orcid + org_data --> openalex_harvest_orcid + dimensions_harvest_orcid --> dimensions_orcid_doi_dict[/Dimensions ORCID-DOI dictionary/] + openalex_harvest_orcid --> openalex_orcid_doi_dict[/OpenAlex ORCID-DOI dictionary/] + dimensions_orcid_doi_dict -- DOI --> doi_set(DOI set) + openalex_orcid_doi_dict -- DOI --> doi_set(DOI set) + sul_pub_pubs -- DOI --> doi_set(DOI set) + doi_set --> dois[/All unique DOIs/] + dois --> dimensions_enrich(Dimensions harvest DOI) + dois --> openalex_enrich(OpenAlex harvest DOI) + dimensions_enrich --> dimensions_enriched[/Dimensions publications/] + openalex_enrich --> openalex_enriched[/OpenAlex publications/] + dimensions_enriched -- DOI --> merge_pubs(Merge publications) + openalex_enriched -- DOI --> merge_pubs + sul_pub_pubs -- DOI --> merge_pubs + merge_pubs --> all_enriched_publications[/All publications/] + all_enriched_publications --> join_org_data(Join organizational data) + org_data --> join_org_data + join_org_data --> publication_set[/Publication set/] + publication_set -- DOI & (ORCID & SUNET) --> contributions(Publications to contributions) + contributions --> contributions_set[/Contributions set/] + contributions_set --> publish(Publish) ``` ## Running Locally with Docker @@ -53,7 +72,7 @@ done uv venv ``` -This will create the virtual environment at the default location of `.venv/`. `uv` automatically looks for a venv at this location when installing dependencies. +This will create the virtual environment at the default location of `.venv/`. `uv` automatically looks for a venv at this location when installing dependencies. 3. Activate the virtual environment: ``` @@ -70,7 +89,7 @@ To add a dependency: 2. Add the dependency to `pyproject.toml`. 3. To re-generate the locked dependencies in `requirements.txt`: ``` -uv pip compile pyproject.toml -o requirements.txt +uv pip compile pyproject.toml -o requirements.txt ``` Unlike poetry, uv's dependency resolution is not platform-agnostic. If we find we need to generate a requirements.txt for linux, we can use [uv's multi-platform resolution options](https://github.com/astral-sh/uv?tab=readme-ov-file#multi-platform-resolution).