-
Notifications
You must be signed in to change notification settings - Fork 2
Generating Transcript Version Data
We use a modified version of PyHGVS for HGVS resolution - see Python HGVS library discussion
Ensembl and RefSeq GFFs only contain the latest transcript versions - to be able to resolve historical HGVSs, we need copies of every transcript version we can find. We thus download all the GFFs, then keep the most recent copy of each transcript version, and store the URL it was produced with.
PyHGVS contains a script that will download all known GFF files, then create PyHGVS data for all of the latest transcript versions.
python3 -m pip install pyreference
git clone https://github.com/SACGF/pyhgvs
./hgvs/generate_transcript_data/ensembl_gene_annotation_grch37.sh
./hgvs/generate_transcript_data/ensembl_gene_annotation_grch38.sh
./hgvs/generate_transcript_data/refseq_gene_annotation_grch37.sh
./hgvs/generate_transcript_data/refseq_gene_annotation_grch38.sh
The merged multi GTF/GFF files look like: "pyhgvs_transcripts_refseq_grch38.json.gz"
The gene symbol associated with a transcript can change over time. We need to be able to freeze this so that eg an analysis that filters to a gene list will return the same results, even if later the gene/transcript link changes.
So - you will need to create a Gene Annotation Release for each VEP annotation version. You need to do this AFTER the normal annotations have been imported (eg the merged file that also contains the individual release)
To find the GTF used for your VEP release see VEP Cache or Issue. If you just have a GFF the release number can be found via zcat gff | head
These are the GTFs used in VEP v100
# RefSeq GRCh37
python3 ./hgvs/generate_transcript_data/pyreference_to_pyhgvs_json.py --pyreference-json GCF_000001405.25_GRCh37.p13_genomic.105.20190906.gff.gz.json.gz --output pyhgvs_transcripts_refseq_grch37_v105.20190906.json.gz
# RefSeq GRCh38
python3 ./hgvs/generate_transcript_data/pyreference_to_pyhgvs_json.py --pyreference-json GCF_000001405.39_GRCh38.p13_genomic.109.20190607.gff.gz.json.gz --output pyhgvs_transcripts_refseq_grch38_v109.20190607.json.gz
# Ensembl GRCh37 (VEP 100 uses latest GRCh37 release v87)
python3 ./hgvs/generate_transcript_data/pyreference_to_pyhgvs_json.py --pyreference-json Homo_sapiens.GRCh37.87.gff3.gz.json.gz --output pyhgvs_transcripts_ensembl_grch37_v87.json.gz
# Ensembl GRCh38 (VEP 100)
python3 ./hgvs/generate_transcript_data/pyreference_to_pyhgvs_json.py --pyreference-json Homo_sapiens.GRCh38.100.gff3.gz.json.gz --output pyhgvs_transcripts_ensembl_grch38_v100.json.gz
SACGF fork of PyHGVS library to handle alignment gaps (this is in VariantGrid requirements.txt)
PyReference - Python lib to convert GFF3/GTF to JSON for easier processing.