Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HELP-290 HELP-334 GlyTouCan IDs masterlist for June submission #2

Closed
jeremywalter opened this issue Nov 21, 2022 · 4 comments
Closed
Assignees

Comments

@jeremywalter
Copy link
Contributor

Attached is the csv dataset that contains all GlyGen GlyTouCans with their status and xrefs with regards to PubChem mapping. Here are a few rows. Let me know if this works.

glytoucan_ac status xref_id xref_key
G00023MO PubChem crossref exists 91846235:252277270 glycan_xref_pubchem_compound:glycan_xref_pubchem_substance
G00024MO PubChem crossref exists 11375554:252288623 glycan_xref_pubchem_compound:glycan_xref_pubchem_substance
G00025AJ PubChem crossref exists 91857678:252290930 glycan_xref_pubchem_compound:glycan_xref_pubchem_substance
G00025MO PubChem crossref exists 5288428:252293186 glycan_xref_pubchem_compound:glycan_xref_pubchem_substance
G00025YC No PubChem crossref exists    
G00026MO No PubChem crossref exists    
G00027JG No PubChem crossref exists    
G00027MO PubChem crossref exists 91859643:252293273 glycan_xref_pubchem_compound:glycan_xref_pubchem_substance

Best,

Jeet Vora
Senior Research Associate
Scientific Coordinator for GlyGen.org

Project Manager for Glycosciences-NIH CFDE

The George Washington University
Ross Hall, Room 559
2300 Eye Street N.W.
Washington, DC 20052
[email protected]

Pronouns - He/him/his

On Fri, Jun 3, 2022 at 2:37 PM Jeet Vora <[email protected]> wrote:

Hi Arthur,

I can provide you with the dataset as requested. For this release I will share it via email or from online folder but for the next release it will have a stable URL from data.glygen.org

Will share the dataset once compiled.

Best,

Jeet Vora
Senior Research Associate
Scientific Coordinator for GlyGen.org

Project Manager for Glycosciences-NIH CFDE

The George Washington University
Ross Hall, Room 559
2300 Eye Street N.W.
Washington, DC 20052
[email protected]

Pronouns - He/him/his

On Thu, Jun 2, 2022 at 2:32 PM Rene Ranzinger <[email protected]> wrote:

________________________________________
From: Arthur Brady <[email protected]>
Sent: Thursday, June 2, 2022 1:28 PM
To: Rene Ranzinger
Subject: HELP-290: GlyTouCan IDs masterlist for June submissionIN PROGRESS GlyTouCan IDs masterlist for June submission

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]

—-—-—-—
Reply above this line.

Arthur Brady commented:

Summary: we need a way to access an up-to-date map from GlyTouCan IDs to equivalent PubChem IDs. You can provide it to us however you would like.* I would request that whatever format you choose (API or file) be able to express whether or not a given GlyTouCan ID exists at all: i.e. it should recognize GlyTouCan terms with no associated PubChem ID and return a “no PubChem crossref exists” response which is distinct from the “requested GlyTouCan ID doesn’t exist” response.

*as long as your API can handle either (1) lots of little queries, fast, or (2) a bulk query for the whole dataset, because I’ll need to essentially grab the whole thing so we can properly process any incoming IDs.

View request<https://cfde.atlassian.net/servicedesk/customer/portal/2/HELP-290?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJxc2giOiJhODBlMGQ3MjNlZjQyZGNhZTViZTA4YzY5YzNjMDMzY2U5OGI4ZWU4MTU4YWY1YzIzNzkzZTA0NjFhMzA5NTJiIiwiaXNzIjoic2VydmljZWRlc2stand0LXRva2VuLWlzc3VlciIsImNvbnRleHQiOnsidXNlciI6IjEwMjc1IiwiaXNzdWUiOiJIRUxQLTI5MCJ9LCJleHAiOjE2NTY2MTAxMzQsImlhdCI6MTY1NDE5MDkzNH0.Ew8Sk0IoFEKnsUF1DYLHgMdsMGEXATwrOUop_rqwAIM&sda_source=notification-email> · Turn off this request's notifications<https://cfde.atlassian.net/servicedesk/customer/portal/2/HELP-290/unsubscribe?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJxc2giOiI2YjdkNmY2YThhODk5MDM5NWM2ODlkOGZiYWQ4ODNmZjQ5ZDg1ZjA0YWRhNDkxMzVmODE1NjAzNTk4ZmU5MDBmIiwiaXNzIjoic2VydmljZWRlc2stand0LXRva2VuLWlzc3VlciIsImNvbnRleHQiOnsidXNlciI6InFtOjAzZDc2NWE2LTdlMDctNGEwYi04ZGUxLThmZmRjMjI2ODc4Zjo4YTUzYmZiYi04NmY5LTRkNzgtYmZiZS0yOGZkNjMwMTg2YzkiLCJpc3N1ZSI6IkhFTFAtMjkwIn0sImV4cCI6MTY1NjYxMDEzNCwiaWF0IjoxNjU0MTkwOTM0fQ.phXc-0Oj38SWlGYd_EKEhKrNvVRXmcVOeivCMIlmTnY>

This is shared with [email protected].

Powered by Jira Service Management<https://www.atlassian.com/software/jira/service-desk/powered-by?utm_medium=jira-in-product&utm_source=jira_service_desk_email_footer&utm_content=cfde>

Sent on June 2, 2022 5:28:54 PM GMT

Attached is the csv dataset that contains all GlyGen GlyTouCans with their status and xrefs with regards to PubChem mapping. Here are a few rows. Let me know if this works.

glytoucan_ac

status

xref_id

xref_key

G00023MO

PubChem crossref exists

91846235:252277270

glycan_xref_pubchem_compound:glycan_xref_pubchem_substance

G00024MO

PubChem crossref exists

11375554:252288623

glycan_xref_pubchem_compound:glycan_xref_pubchem_substance

G00025AJ

PubChem crossref exists

91857678:252290930

glycan_xref_pubchem_compound:glycan_xref_pubchem_substance

G00025MO

PubChem crossref exists

5288428:252293186

glycan_xref_pubchem_compound:glycan_xref_pubchem_substance

G00025YC

No PubChem crossref exists

G00026MO

No PubChem crossref exists

G00027JG

No PubChem crossref exists

G00027MO

PubChem crossref exists

91859643:252293273

glycan_xref_pubchem_compound:glycan_xref_pubchem_substance

Best,

Jeet Vora
Senior Research Associate
Scientific Coordinator for GlyGen.org

Project Manager for Glycosciences-NIH CFDE

The George Washington University
Ross Hall, Room 559
2300 Eye Street N.W.
Washington, DC 20052
[email protected]

Pronouns - He/him/his

On Fri, Jun 3, 2022 at 2:37 PM Jeet Vora <[email protected]> wrote:

Hi Arthur,

I can provide you with the dataset as requested. For this release I will share it via email or from online folder but for the next release it will have a stable URL from data.glygen.org

Will share the dataset once compiled.

Best,

Jeet Vora
Senior Research Associate
Scientific Coordinator for GlyGen.org

Project Manager for Glycosciences-NIH CFDE

The George Washington University
Ross Hall, Room 559
2300 Eye Street N.W.
Washington, DC 20052
[email protected]

Pronouns - He/him/his

On Thu, Jun 2, 2022 at 2:32 PM Rene Ranzinger <[email protected]> wrote:


From: Arthur Brady <[email protected]>
Sent: Thursday, June 2, 2022 1:28 PM
To: Rene Ranzinger
Subject: HELP-290: GlyTouCan IDs masterlist for June submissionIN PROGRESS GlyTouCan IDs masterlist for June submission

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]

—-—-—-—
Reply above this line.

Arthur Brady commented:

Summary: we need a way to access an up-to-date map from GlyTouCan IDs to equivalent PubChem IDs. You can provide it to us however you would like.* I would request that whatever format you choose (API or file) be able to express whether or not a given GlyTouCan ID exists at all: i.e. it should recognize GlyTouCan terms with no associated PubChem ID and return a “no PubChem crossref exists” response which is distinct from the “requested GlyTouCan ID doesn’t exist” response.

*as long as your API can handle either (1) lots of little queries, fast, or (2) a bulk query for the whole dataset, because I’ll need to essentially grab the whole thing so we can properly process any incoming IDs.

View requesthttps://cfde.atlassian.net/servicedesk/customer/portal/2/HELP-290?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJxc2giOiJhODBlMGQ3MjNlZjQyZGNhZTViZTA4YzY5YzNjMDMzY2U5OGI4ZWU4MTU4YWY1YzIzNzkzZTA0NjFhMzA5NTJiIiwiaXNzIjoic2VydmljZWRlc2stand0LXRva2VuLWlzc3VlciIsImNvbnRleHQiOnsidXNlciI6IjEwMjc1IiwiaXNzdWUiOiJIRUxQLTI5MCJ9LCJleHAiOjE2NTY2MTAxMzQsImlhdCI6MTY1NDE5MDkzNH0.Ew8Sk0IoFEKnsUF1DYLHgMdsMGEXATwrOUop_rqwAIM&sda_source=notification-email · Turn off this request's notificationshttps://cfde.atlassian.net/servicedesk/customer/portal/2/HELP-290/unsubscribe?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJxc2giOiI2YjdkNmY2YThhODk5MDM5NWM2ODlkOGZiYWQ4ODNmZjQ5ZDg1ZjA0YWRhNDkxMzVmODE1NjAzNTk4ZmU5MDBmIiwiaXNzIjoic2VydmljZWRlc2stand0LXRva2VuLWlzc3VlciIsImNvbnRleHQiOnsidXNlciI6InFtOjAzZDc2NWE2LTdlMDctNGEwYi04ZGUxLThmZmRjMjI2ODc4Zjo4YTUzYmZiYi04NmY5LTRkNzgtYmZiZS0yOGZkNjMwMTg2YzkiLCJpc3N1ZSI6IkhFTFAtMjkwIn0sImV4cCI6MTY1NjYxMDEzNCwiaWF0IjoxNjU0MTkwOTM0fQ.phXc-0Oj38SWlGYd_EKEhKrNvVRXmcVOeivCMIlmTnY

This is shared with [email protected].

Powered by Jira Service Managementhttps://www.atlassian.com/software/jira/service-desk/powered-by?utm_medium=jira-in-product&utm_source=jira_service_desk_email_footer&utm_content=cfde

Sent on June 2, 2022 5:28:54 PM GMT

@jeremywalter
Copy link
Contributor Author

@jeremywalter
Copy link
Contributor Author

[email protected]
August 17, 2022 at 3:53 PM
Hi Arthur,

Based on the June 15 email (see below) I have checked the filtered-out proteins in GlyGen and prepared the reason why they were filtered out (you have also provided the example reasons). Main reason for this issue is asynchronicity between UniProt version releases used by GlyGen and CFDE. This issue will arise when the accessions are obsolete or merged with other entries. We will filter out such accession in the future.

However, there is no reason found for the three accessions. They are unreviewed mouse proteins and exist in UniProt. If you know the reason why they were filtered out please let us know.

D3YTX5
D3Z7A4
E9Q7U8

Edit

Delete

[email protected]
June 23, 2022 at 4:43 PM
Hi Arthur,

Yes thanks for the explanation for excluded protein entries. As I mentioned it is because of the different versions.

For eg Q6ZW33 is a protein in GlyGen but has been recently replaced by O94851. Once the submission is done. I will look into all the excluded entries.

For G06850XD yes it is not in the mapping file. I am looking at what file it is in and why it is not a part of the masterlist and mapping file.

No action needed from your end for these entries. We are omitting these out for now.

Again, thanks for your help and explanation.

Edit

Delete

👍
1

Arthur Brady
June 23, 2022 at 4:19 PM
Edited
Hi Jeet,

I addressed these below; please reread the comment here for an explanation of why those IDs are failing. As I said below, you will need to remove IDs that have been deleted from UniProt (e.g. Q6ZRZ4); remove IDs that came from the wrong database entirely (e.g. Q6ZW33 is a PRO ID, not a UniProtKB accession); and ensure that you use only primary accessions and not secondary accessions (e.g. P62861 is the primary accession for P35544).

As for the glycan G06850XD, there is no mention of that ID in the GlyTouCan ID mapping file you sent me.

Best,

Arthur

Edit

Delete

[email protected]
June 22, 2022 at 6:15 PM
ccing Rene

Edit

Delete

[email protected]
June 22, 2022 at 5:05 PM

Thanks Arthur,

While running the prep_script we came across 1 glycan and 25 proteins that were flagged. I am looking into it and find a reason for their exclusion. We have for now submitted 18 files and there are couple other issues that need to be resolved. These may be arising because of the different UniProt versions we both are using.

Glycan
G06850XD

Protein
A0A087WX78
A0A0B4J2J1
A6NLF2
A8MVS1
D3YTX5
D3Z7A4
E9Q7U8
P0C7X5
P35544
Q3SY89
Q4G091
Q5ND19
Q6ZRZ4
Q6ZW33
Q8C7R2
Q8N6G1
Q8NG57
Q8VCG1
Q8WTZ3
Q96KH6
Q96NR2
Q9D5U9
Q9NY84
Q9UJ94

@jeremywalter jeremywalter changed the title HELP-290 GlyTouCan IDs masterlist for June submission HELP-290 HELP 334 GlyTouCan IDs masterlist for June submission Nov 21, 2022
@jeremywalter jeremywalter changed the title HELP-290 HELP 334 GlyTouCan IDs masterlist for June submission HELP-290 HELP-334 GlyTouCan IDs masterlist for June submission Nov 21, 2022
@ReneRanzinger
Copy link
Member

This is related to glygener/glygen.cfde.generator#18.

@jonathancrabtree
Copy link

Closing this case: as per my recent comment in issue #4, I've confirmed that Arthur incorporated the attached mapping file (gtc_pubchem_xref_status.csv, MD5 b6e820ac60c0ba0b2633cbb1a58938a8) back in mid June of 2022. See issue #4 for progress/updates on incorporating the latest version from the GlyGen-provided URL.

With respect to the 3 mouse UniProt accessions mentioned in the August 17 e-mail above (D3YTX5, D3Z7A4, E9Q7U8), I don't know if this was resolved, but to me it looks like--strictly speaking--those are prefixes of UniProt names, rather than UniProt accessions per se. If I search the protein.tsv.gz file for those 3 ids, I can find them, but not in the 'id' column, only in the 'name' column, with "_MOUSE" as a suffix:

E9Q8U4  D3YTX5_MOUSE    Taste receptor type 2   []      NCBI:txid10090
E9Q2E3  E9Q7U8_MOUSE            []      NCBI:txid10090
E9QA13  D3Z7A4_MOUSE            []      NCBI:txid10090

I'd suggest using the actual accession numbers (E9Q8U4, E9Q2E3, E9QA13), as this looks like it may be another instance of the issue that Arthur had already flagged in his June 23rd e-mail, namely ensuring that only primary accessions are used to reference UniProt proteins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants