-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing variants #32
Comments
I'm not seeing the first one either, but the others look present to me
Actually, after looking more closely, the first one looks like it's now https://www.ncbi.nlm.nih.gov/clinvar/variation/227010/ |
I could not find the first one in ClinVar website either. If you could provide its Allele ID in ClinVar, it would be convenient to have a check. |
I am not sure if what is described below related to the original issue of @bcrone, but it does fit the topic. There are indeed ClinVar records (both XML and TXT) that are lack of alternate allele or reference allele. For example, the SNP rs397704725 with Allele ID 15083 can be found in ClinVar VCF/TXT/XML but not in clinvar_alleles.single.b37.tsv. This is because the parser script simply skips record without ref/alt. For records in dbSNP, we could look up ref/alt from dbSNP data. Otherwise, it probably requires parsing the HGVS representation, which is not a trivial task. |
As noted by @lacek above, the script parse_clinvar_xml.py currently ignores those ClinVarSet elements that do not have referenceAllele and alternateAllele defined. But as of the July 2017 ClinVar XML release, the XML attributes referenceAlleleVCF, alternateAlleleVCF and positionVCF are defined for many of the ClinVarSet elements; these often appear in addition to the referenceAllele/alternateAllele attributes, but sometimes also in their absence. For the March 2018 ClinVar XML release, this parser would output nearly 4000 additional RCV accessions --- almost all of them indels --- if it were modified to allow referenceAlleleVCF/alternateAlleleVCF attributes as well as referenceAllele/alternateAllele attributes. Ignoring the referenceAllele/alternateAllele attributes altogether and just using the referenceAlleleVCF/alternateAlleleVCF attributes would cause the parser to miss out on fewer than 100 RCV accessions, some of which at a quick glance have irregularities (e.g. alternateAllele Y and R), so perhaps these lack referenceAlleleVCF/alternateAlleleVCF attributes for a reason. I haven't checked the impact on all the downstream steps in the master.py script. It appears that parse_clinvar_xml.py should be modified to use the "positionVCF" attribute instead of the "start" attribute for the "pos" column of its output when referenceAlleleVCF is used, as these alleles appear to have been left-normalized; this makes the subsequent normalize.py step largely redundant for those entries with referenceAlleleVCF attributes, but it does still filter out a few hundred entries with alt = ref. |
Hello,
I have come across a handful of variants that were present in previous versions of the curated Clinvar TSV that are not present in the most recent version, but are present in the Clinvar database. I am using both the clinvar_alleles.single.b37.tsv and clinvar_alleles.single.b37.vcf committed on 2017-03-13:
10 55582650 . A ATGTT
2 179325735 rs17304212 C G
2 179325735 rs17304212 C T
2 179325816 rs79399438 G A
3 150690352 rs111033258 A C
5 149740732 rs56180593 C T
Is there a reason these are no longer present? Were they inadvertently dropped? Or am I missing them in either the TSV or VCF?
The text was updated successfully, but these errors were encountered: