Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new tasks create_doi_sunet and contribs #67

Merged
merged 1 commit into from
Jul 1, 2024
Merged

Conversation

edsu
Copy link
Contributor

@edsu edsu commented Jul 1, 2024

Once the initial DOI collection process is complete we know the population of DOIs we are working with in the dataset. We are also able to map the DOI to a SUNETID using either the orcidid (for Dimensions and OpenAlex) or cap_profile_id (for sul_pub).

The new doi_sunet task will create a mapping of doi -> [sunetid] using the pickle files, sul_pub csv and the authors csv. This is then used by the doi_set task to generate the list of DOIs needed for harvesting.

Once the publications datasets are merged the new contribs task uses the doi_sunet mapping to add the sunetid column, split out the publications into contributions where each row has a unique sunetid. Finally the contributions are joined with the authors.csv.

@edsu edsu force-pushed the doi-sunet-contribs branch from 168600f to d866690 Compare July 1, 2024 11:55
@edsu edsu marked this pull request as ready for review July 1, 2024 12:17
from rialto_airflow.harvest.sul_pub import sul_pub_csv
from rialto_airflow.harvest.doi_set import create_doi_set

from rialto_airflow.utils import create_snapshot_dir, rialto_authors_file
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mostly ruff reordering the imports.

@@ -70,12 +70,22 @@ def sul_pub_harvest(snapshot_dir):
return str(csv_file)

@task()
def doi_set(dimensions, openalex, sul_pub):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new task to create the DOI -> [SUNETID] mapping pickle file.

@@ -105,18 +115,12 @@ def merge_publications(sul_pub, openalex_pubs, dimensions_pubs, snapshot_dir):
return str(output)

@task()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The join to authors is handled as part of pubs_to_contribs now.

"""
return create_doi_set(dimensions, openalex, sul_pub)
return list(pickle.load(open(doi_sunet_pickle, "rb")).keys())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step of creating the list of DOIs from the doi_sunet.pickle is now so simple it's put inline here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

harvest.doi_set was renamed to harvest.doi_sunet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complicated bit of code. The doi_sunet module contains the logic for creating a doi -> [sunetid] mapping by using:

  • the doi -> [orcid] mappings in the dimensions and openalex pickle files
  • the doi -> [cap_profile_id] mapping present in the sul_pub.csv

and then generating a doi -> [sunetid] by looking up the sunetid in the authors.csv using the respective orcid or cap_profile_id.

df = df[df["doi"].notna()]

def extract_cap_ids(authors):
return [a["cap_profile_id"] for a in eval(authors)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to account for the status being approved and not using those that have a status of unknown or denied.

Once the initial DOI collection process is complete we know the
population of DOIs we are working with in the dataset. We are also able
to join the data to the authors.csv using either the `orcidid` (for
Dimensions and OpenAlex) or `cap_profile_id` (for sul_pub).

The doi_sunet task will create a mapping of `doi -> [sunetid]` using the
pickle files, sul_pub harvest and the authors.csv.

Once the publications datasets are merged the `doi_sunet` mapping is
used to add the `sunetid` column, and then join with the `authors.csv`.

Closes #33
Closes #34
@edsu edsu force-pushed the doi-sunet-contribs branch from 21420d7 to 9a02e82 Compare July 1, 2024 19:16
@lwrubel lwrubel merged commit 75c3720 into main Jul 1, 2024
1 check passed
@lwrubel lwrubel deleted the doi-sunet-contribs branch July 1, 2024 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants