Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing variants #32

Open
bcrone opened this issue Apr 13, 2017 · 4 comments
Open

Missing variants #32

bcrone opened this issue Apr 13, 2017 · 4 comments

Comments

@bcrone
Copy link

bcrone commented Apr 13, 2017

Hello,

I have come across a handful of variants that were present in previous versions of the curated Clinvar TSV that are not present in the most recent version, but are present in the Clinvar database. I am using both the clinvar_alleles.single.b37.tsv and clinvar_alleles.single.b37.vcf committed on 2017-03-13:

10 55582650 . A ATGTT
2 179325735 rs17304212 C G
2 179325735 rs17304212 C T
2 179325816 rs79399438 G A
3 150690352 rs111033258 A C
5 149740732 rs56180593 C T

Is there a reason these are no longer present? Were they inadvertently dropped? Or am I missing them in either the TSV or VCF?

@simnim
Copy link

simnim commented Apr 26, 2017

I'm not seeing the first one either, but the others look present to me

gzcat clinvar_alleles.single.b37.tsv.gz | egrep '\s(55582650|179325735|179325816|150690352|149740732)\s' | cut -f-4
2	179325735	C	G
2	179325735	C	T
2	179325816	G	A
3	150690352	A	C
5	149740732	C	T

Actually, after looking more closely, the first one looks like it's now https://www.ncbi.nlm.nih.gov/clinvar/variation/227010/

@XiaoleiZ
Copy link
Collaborator

XiaoleiZ commented May 9, 2017

I could not find the first one in ClinVar website either. If you could provide its Allele ID in ClinVar, it would be convenient to have a check.

@lacek
Copy link

lacek commented Aug 29, 2017

I am not sure if what is described below related to the original issue of @bcrone, but it does fit the topic.

There are indeed ClinVar records (both XML and TXT) that are lack of alternate allele or reference allele. For example, the SNP rs397704725 with Allele ID 15083 can be found in ClinVar VCF/TXT/XML but not in clinvar_alleles.single.b37.tsv. This is because the parser script simply skips record without ref/alt.

For records in dbSNP, we could look up ref/alt from dbSNP data. Otherwise, it probably requires parsing the HGVS representation, which is not a trivial task.

@oliverking
Copy link

As noted by @lacek above, the script parse_clinvar_xml.py currently ignores those ClinVarSet elements that do not have referenceAllele and alternateAllele defined. But as of the July 2017 ClinVar XML release, the XML attributes referenceAlleleVCF, alternateAlleleVCF and positionVCF are defined for many of the ClinVarSet elements; these often appear in addition to the referenceAllele/alternateAllele attributes, but sometimes also in their absence.

For the March 2018 ClinVar XML release, this parser would output nearly 4000 additional RCV accessions --- almost all of them indels --- if it were modified to allow referenceAlleleVCF/alternateAlleleVCF attributes as well as referenceAllele/alternateAllele attributes. Ignoring the referenceAllele/alternateAllele attributes altogether and just using the referenceAlleleVCF/alternateAlleleVCF attributes would cause the parser to miss out on fewer than 100 RCV accessions, some of which at a quick glance have irregularities (e.g. alternateAllele Y and R), so perhaps these lack referenceAlleleVCF/alternateAlleleVCF attributes for a reason.

I haven't checked the impact on all the downstream steps in the master.py script. It appears that parse_clinvar_xml.py should be modified to use the "positionVCF" attribute instead of the "start" attribute for the "pos" column of its output when referenceAlleleVCF is used, as these alleles appear to have been left-normalized; this makes the subsequent normalize.py step largely redundant for those entries with referenceAlleleVCF attributes, but it does still filter out a few hundred entries with alt = ref.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants