Skip to content

HGVS cDNA and genomic coordinate conversion

Dave Lawrence edited this page Sep 27, 2021 · 9 revisions

Ensembl transcripts completely match the reference genome they are annotated against. Where the underlying reference sequence changed in the move from GRCh37 to GRCh38, they incremented the transcript version.

RefSeq transcripts exists independently of a genome build, that is, it's just a sequence.

These sequences "may be created in part or entirely from genomic accessions" (as well as transcripts) see FAQ: Why not transcript? and FAQ: What sequence?

This means their sequence won't always match a build, (or all builds/patch versions)

Transcript genome coordinates

RefSeq and Ensembl regularly release gene annotation GFF/GTFs which contain transcript exon/CDS etc coordinates for a particular genome build. These are made by aligning cDNA sequences to the genome using a mapping tool like "spline".

Some variant databases / curation systems align transcripts to genome builds themselves, so there are HGVSs out there using transcripts that have never appeared in official GTF/GFF releases.

RefSeq Alignment gaps

Because refseq has sequence differences, the alignment may have gaps.

In RefSeq GFF/GTFs, there are "cDNA_match" records with "gap_count" attributes. We can also detect them by comparing the transcript sequence length (from Fasta or API) and the sum of exons lengths in build coordinates. Comparing lengths won't be able to find alignment gaps with an equal number of insertions and deletions.

Transcript versions and genome builds

Transcript versions are build independent sequences - VariantGrid stores these sequences (from Fasta or API) and length in the TranscriptVersionSequenceInfo record (genome build independent)

A VariantGrid TranscriptVersion is from a RefSeq/Ensembl GFF/GTF and thus is tied to a genome build (there may be a copy per build). We should probably rename TranscriptVersionSequenceInfo to TranscriptVersion and make the current TranscriptVersion TranscriptVersionCoordinate

Python HGVS libraries

For local (ie Python library) HGVS conversion, we need the genome coordinates, which means we can only locally convert transcript versions that have appeared in RefSeq/Ensembl GTF/GFFs (unless we align them ourselves). We use previous releases to obtain historical transcript versions

Python has 2 HGVS libraries, pyhgvs and biocommons HGVS

BioCommons HGVS explains the gap problem:

PyHGVS doesn't handle gaps, I can't get UTA to load due to Docker issues and firewalls preventing direct SQL access.

BioCommons UTA has a slightly different issue with unaligned regions (slightly different than gaps) See issue

Web services

  • Mutalyzer
  • ClinGen Allele Registry

Other conversion programs

VEP can convert genome coordinates to HGVS correctly locally (though only with transcript/versions in the VEP annotation release) VEP can only perform HGVS to coordinates calculations online (ie via API) and that appears to have the gap issue.

Clone this wiki locally