-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new tasks create_doi_sunet and contribs #67
Conversation
from rialto_airflow.harvest.sul_pub import sul_pub_csv | ||
from rialto_airflow.harvest.doi_set import create_doi_set | ||
|
||
from rialto_airflow.utils import create_snapshot_dir, rialto_authors_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was mostly ruff reordering the imports.
@@ -70,12 +70,22 @@ def sul_pub_harvest(snapshot_dir): | |||
return str(csv_file) | |||
|
|||
@task() | |||
def doi_set(dimensions, openalex, sul_pub): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new task to create the DOI -> [SUNETID] mapping pickle file.
@@ -105,18 +115,12 @@ def merge_publications(sul_pub, openalex_pubs, dimensions_pubs, snapshot_dir): | |||
return str(output) | |||
|
|||
@task() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The join to authors is handled as part of pubs_to_contribs now.
""" | ||
return create_doi_set(dimensions, openalex, sul_pub) | ||
return list(pickle.load(open(doi_sunet_pickle, "rb")).keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The step of creating the list of DOIs from the doi_sunet.pickle
is now so simple it's put inline here.
rialto_airflow/harvest/doi_set.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
harvest.doi_set
was renamed to harvest.doi_sunet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a complicated bit of code. The doi_sunet
module contains the logic for creating a doi -> [sunetid]
mapping by using:
- the
doi -> [orcid]
mappings in the dimensions and openalex pickle files - the
doi -> [cap_profile_id]
mapping present in the sul_pub.csv
and then generating a doi -> [sunetid]
by looking up the sunetid
in the authors.csv
using the respective orcid
or cap_profile_id
.
rialto_airflow/harvest/doi_sunet.py
Outdated
df = df[df["doi"].notna()] | ||
|
||
def extract_cap_ids(authors): | ||
return [a["cap_profile_id"] for a in eval(authors)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to account for the status
being approved
and not using those that have a status of unknown
or denied
.
Once the initial DOI collection process is complete we know the population of DOIs we are working with in the dataset. We are also able to join the data to the authors.csv using either the `orcidid` (for Dimensions and OpenAlex) or `cap_profile_id` (for sul_pub). The doi_sunet task will create a mapping of `doi -> [sunetid]` using the pickle files, sul_pub harvest and the authors.csv. Once the publications datasets are merged the `doi_sunet` mapping is used to add the `sunetid` column, and then join with the `authors.csv`. Closes #33 Closes #34
Once the initial DOI collection process is complete we know the population of DOIs we are working with in the dataset. We are also able to map the DOI to a SUNETID using either the
orcidid
(for Dimensions and OpenAlex) orcap_profile_id
(for sul_pub).The new
doi_sunet
task will create a mapping ofdoi -> [sunetid]
using the pickle files, sul_pub csv and the authors csv. This is then used by thedoi_set
task to generate the list of DOIs needed for harvesting.Once the publications datasets are merged the new
contribs
task uses thedoi_sunet
mapping to add thesunetid
column, split out the publications into contributions where each row has a uniquesunetid
. Finally the contributions are joined with theauthors.csv
.