-
Notifications
You must be signed in to change notification settings - Fork 2
HGVS cDNA and genomic coordinate conversion
Ensembl transcripts completely match the reference genome they are annotated against. Where the underlying reference sequence changed in the move from GRCh37 to GRCh38, they incremented the transcript version.
RefSeq transcripts exists independently of a genome build, that is, it's just a sequence.
These sequences "may be created in part or entirely from genomic accessions" (as well as transcripts) see FAQ: Why not transcript? and FAQ: What sequence?
This means their sequence won't always match a build, (or all builds/patch versions)
RefSeq and Ensembl regularly release gene annotation GFF/GTFs which contain transcript exon/CDS etc coordinates for a particular genome build. These are made by aligning cDNA sequences to the genome using a mapping tool like "spline".
Some variant databases / curation systems align transcripts to genome builds themselves, so there are HGVSs out there using transcripts that have never appeared in official GTF/GFF releases.
Because refseq has sequence differences, the alignment may have gaps.
In RefSeq GFF/GTFs, there are "cDNA_match" records with "gap_count" attributes. We can also detect them by comparing the transcript sequence length (from Fasta or API) and the sum of exons lengths in build coordinates. Comparing lengths won't be able to find alignment gaps with an equal number of insertions and deletions.
Transcript versions are build independent sequences - VariantGrid stores these sequences (from Fasta or API) and length in the TranscriptVersionSequenceInfo record (genome build independent)
A VariantGrid TranscriptVersion is from a RefSeq/Ensembl GFF/GTF and thus is tied to a genome build (there may be a copy per build). We should probably rename TranscriptVersionSequenceInfo to TranscriptVersion and make the current TranscriptVersion TranscriptVersionCoordinate
For local (ie Python library) HGVS conversion, we need the genome coordinates, which means we can only locally convert transcript versions that have appeared in RefSeq/Ensembl GTF/GFFs (unless we align them ourselves). We use previous releases to obtain historical transcript versions
Python has 2 HGVS libraries, pyhgvs and biocommons HGVS
BioCommons HGVS explains the gap problem:
PyHGVS doesn't handle gaps, I can't get UTA to load due to Docker issues and firewalls preventing direct SQL access.
BioCommons UTA has a slightly different issue with unaligned regions (slightly different than gaps) See issue
- Mutalyzer
- ClinGen Allele Registry
VEP can convert genome coordinates to HGVS correctly locally (though only with transcript/versions in the VEP annotation release) VEP can only perform HGVS to coordinates calculations online (ie via API) and that appears to have the gap issue.