Add new tasks create_doi_sunet and contribs #67

edsu · 2024-07-01T11:50:29Z

Once the initial DOI collection process is complete we know the population of DOIs we are working with in the dataset. We are also able to map the DOI to a SUNETID using either the orcidid (for Dimensions and OpenAlex) or cap_profile_id (for sul_pub).

The new doi_sunet task will create a mapping of doi -> [sunetid] using the pickle files, sul_pub csv and the authors csv. This is then used by the doi_set task to generate the list of DOIs needed for harvesting.

Once the publications datasets are merged the new contribs task uses the doi_sunet mapping to add the sunetid column, split out the publications into contributions where each row has a unique sunetid. Finally the contributions are joined with the authors.csv.

edsu · 2024-07-01T12:21:18Z

rialto_airflow/dags/harvest.py

 from rialto_airflow.harvest.sul_pub import sul_pub_csv
-from rialto_airflow.harvest.doi_set import create_doi_set
-
+from rialto_airflow.utils import create_snapshot_dir, rialto_authors_file


This was mostly ruff reordering the imports.

edsu · 2024-07-01T12:21:55Z

rialto_airflow/dags/harvest.py

@@ -70,12 +70,22 @@ def sul_pub_harvest(snapshot_dir):
        return str(csv_file)

    @task()
-    def doi_set(dimensions, openalex, sul_pub):


This is the new task to create the DOI -> [SUNETID] mapping pickle file.

edsu · 2024-07-01T12:23:07Z

rialto_airflow/dags/harvest.py

@@ -105,18 +115,12 @@ def merge_publications(sul_pub, openalex_pubs, dimensions_pubs, snapshot_dir):
        return str(output)

    @task()


The join to authors is handled as part of pubs_to_contribs now.

edsu · 2024-07-01T12:24:50Z

rialto_airflow/dags/harvest.py

        """
-        return create_doi_set(dimensions, openalex, sul_pub)
+        return list(pickle.load(open(doi_sunet_pickle, "rb")).keys())


The step of creating the list of DOIs from the doi_sunet.pickle is now so simple it's put inline here.

edsu · 2024-07-01T12:26:39Z

rialto_airflow/harvest/doi_set.py

harvest.doi_set was renamed to harvest.doi_sunet

edsu · 2024-07-01T12:35:23Z

rialto_airflow/harvest/doi_sunet.py

This is a complicated bit of code. The doi_sunet module contains the logic for creating a doi -> [sunetid] mapping by using:

the doi -> [orcid] mappings in the dimensions and openalex pickle files

the doi -> [cap_profile_id] mapping present in the sul_pub.csv

and then generating a doi -> [sunetid] by looking up the sunetid in the authors.csv using the respective orcid or cap_profile_id.

lwrubel · 2024-07-01T15:49:03Z

rialto_airflow/harvest/doi_sunet.py

+    df = df[df["doi"].notna()]
+
+    def extract_cap_ids(authors):
+        return [a["cap_profile_id"] for a in eval(authors)]


I think this needs to account for the status being approved and not using those that have a status of unknown or denied.

Once the initial DOI collection process is complete we know the population of DOIs we are working with in the dataset. We are also able to join the data to the authors.csv using either the `orcidid` (for Dimensions and OpenAlex) or `cap_profile_id` (for sul_pub). The doi_sunet task will create a mapping of `doi -> [sunetid]` using the pickle files, sul_pub harvest and the authors.csv. Once the publications datasets are merged the `doi_sunet` mapping is used to add the `sunetid` column, and then join with the `authors.csv`. Closes #33 Closes #34

edsu force-pushed the doi-sunet-contribs branch from 168600f to d866690 Compare July 1, 2024 11:55

edsu marked this pull request as ready for review July 1, 2024 12:17

edsu commented Jul 1, 2024

View reviewed changes

rialto_airflow/harvest/doi_set.py Outdated

Copy link

Contributor Author

edsu Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

harvest.doi_set was renamed to harvest.doi_sunet

edsu commented Jul 1, 2024

View reviewed changes

lwrubel reviewed Jul 1, 2024

View reviewed changes

edsu force-pushed the doi-sunet-contribs branch from 21420d7 to 9a02e82 Compare July 1, 2024 19:16

lwrubel approved these changes Jul 1, 2024

View reviewed changes

lwrubel merged commit 75c3720 into main Jul 1, 2024
1 check passed

lwrubel deleted the doi-sunet-contribs branch July 1, 2024 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new tasks create_doi_sunet and contribs #67

Add new tasks create_doi_sunet and contribs #67

edsu commented Jul 1, 2024 •

edited

Loading

edsu Jul 1, 2024

edsu Jul 1, 2024

edsu Jul 1, 2024

edsu Jul 1, 2024

edsu Jul 1, 2024

edsu Jul 1, 2024

lwrubel Jul 1, 2024

		@@ -105,18 +115,12 @@ def merge_publications(sul_pub, openalex_pubs, dimensions_pubs, snapshot_dir):
		return str(output)

		@task()

Add new tasks create_doi_sunet and contribs #67

Add new tasks create_doi_sunet and contribs #67

Conversation

edsu commented Jul 1, 2024 • edited Loading

edsu Jul 1, 2024

Choose a reason for hiding this comment

edsu Jul 1, 2024

Choose a reason for hiding this comment

edsu Jul 1, 2024

Choose a reason for hiding this comment

edsu Jul 1, 2024

Choose a reason for hiding this comment

edsu Jul 1, 2024

Choose a reason for hiding this comment

edsu Jul 1, 2024

Choose a reason for hiding this comment

lwrubel Jul 1, 2024

Choose a reason for hiding this comment

edsu commented Jul 1, 2024 •

edited

Loading