Skip to content

Commit

Permalink
Merge pull request #20 from sul-dlss-labs/docs
Browse files Browse the repository at this point in the history
revise docs
  • Loading branch information
edsu authored Jun 19, 2024
2 parents 9fa6a6e + 709064e commit 5ff1640
Showing 1 changed file with 31 additions and 12 deletions.
43 changes: 31 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,38 @@
# rialto-airflow

[![.github/workflows/test.yml](https://github.com/sul-dlss-labs/rialto-airflow/actions/workflows/test.yml/badge.svg)](https://github.com/sul-dlss-labs/rialto-airflow/actions/workflows/test.yml)
Airflow for harvesting data for open access analysis and research intelligence. The workflow is integrates data from [sul_pub](https://github.com/sul-dlss/sul_pub), [rialto-orgs](https://github.com/sul-dlss/rialto-orgs), [OpenAlex](https://openalex.org/) and [Dimensions](https://www.dimensions.ai/) APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from sul_pub, look those publications up in OpenAlex and Dimensions using the DOI, merge the the author/department information found in [rialto_orgs], and publish the data to our JupyterHub environment.

Airflow for harvesting data for open access analysis and research intelligence. The workflow integrates data from [sul_pub](https://github.com/sul-dlss/sul_pub), [rialto-orgs](https://github.com/sul-dlss/rialto-orgs), [OpenAlex](https://openalex.org/) and [Dimensions](https://www.dimensions.ai/) APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from SUL-Pub, OpenAlex, and Dimensions, enrich them with additional metadata from OpenAlex and Dimensions using the DOI, merge the organizational data found in [rialto_orgs], and publish the data to our JupyterHub environment.

```mermaid
flowchart TD
last_harvest(Determine last harvest) --> sul_pub(Publications from sul_pub)
sul_pub --> extract_doi(Extract DOIs)
extract_doi -- DOI --> openalex(OpenAlex)
extract_doi -- DOI --> dimensions(Dimensions)
dimensions --> merge_pubs(Merge Publications)
openalex --> merge_pubs(Merge Publications)
merge_pubs -- SUNETID --> join_departments(Join Departments)
join_departments --> publish(Publish)
last_harvest(Determine last harvest) --> sul_pub_harvest(SUL-Pub harvest)
sul_pub_harvest --> sul_pub_pubs[/SUL-Pub publications/]
rialto_orgs_export(Manual RIALTO app export) --> org_data[/Stanford organizational data/]
last_harvest --> dimensions_harvest_orcid(Dimensions harvest ORCID)
last_harvest --> openalex_harvest_orcid(OpenAlex harvest ORCID)
org_data --> dimensions_harvest_orcid
org_data --> openalex_harvest_orcid
dimensions_harvest_orcid --> dimensions_orcid_doi_dict[/Dimensions ORCID-DOI dictionary/]
openalex_harvest_orcid --> openalex_orcid_doi_dict[/OpenAlex ORCID-DOI dictionary/]
dimensions_orcid_doi_dict -- DOI --> doi_set(DOI set)
openalex_orcid_doi_dict -- DOI --> doi_set(DOI set)
sul_pub_pubs -- DOI --> doi_set(DOI set)
doi_set --> dois[/All unique DOIs/]
dois --> dimensions_enrich(Dimensions harvest DOI)
dois --> openalex_enrich(OpenAlex harvest DOI)
dimensions_enrich --> dimensions_enriched[/Dimensions publications/]
openalex_enrich --> openalex_enriched[/OpenAlex publications/]
dimensions_enriched -- DOI --> merge_pubs(Merge publications)
openalex_enriched -- DOI --> merge_pubs
sul_pub_pubs -- DOI --> merge_pubs
merge_pubs --> all_enriched_publications[/All publications/]
all_enriched_publications --> join_org_data(Join organizational data)
org_data --> join_org_data
join_org_data --> publication_set[/Publication set/]
publication_set -- DOI & (ORCID & SUNET) --> contributions(Publications to contributions)
contributions --> contributions_set[/Contributions set/]
contributions_set --> publish(Publish)
```

## Running Locally with Docker
Expand Down Expand Up @@ -53,7 +72,7 @@ done
uv venv
```

This will create the virtual environment at the default location of `.venv/`. `uv` automatically looks for a venv at this location when installing dependencies.
This will create the virtual environment at the default location of `.venv/`. `uv` automatically looks for a venv at this location when installing dependencies.

3. Activate the virtual environment:
```
Expand All @@ -70,7 +89,7 @@ To add a dependency:
2. Add the dependency to `pyproject.toml`.
3. To re-generate the locked dependencies in `requirements.txt`:
```
uv pip compile pyproject.toml -o requirements.txt
uv pip compile pyproject.toml -o requirements.txt
```

Unlike poetry, uv's dependency resolution is not platform-agnostic. If we find we need to generate a requirements.txt for linux, we can use [uv's multi-platform resolution options](https://github.com/astral-sh/uv?tab=readme-ov-file#multi-platform-resolution).
Expand Down

0 comments on commit 5ff1640

Please sign in to comment.