Skip to content

Generating Transcript Version Data

Dave Lawrence edited this page Apr 27, 2022 · 9 revisions

Background

See Transcript Versions and Python HGVS library discussion - we use a modified version of PyHGVS for HGVS resolution and have a spin-off project https://github.com/SACGF/cdot/ which collects transcripts in a JSON.gz format, and has loaders for both Python HGVS libraries

How to produce transcript files

The gene/transcript and gene/symbol or symbol for HGNC changes over time. When we merge data from historical files, we always keep the latest one. See cdot instructions for creating transcript JSON.gz files

This will produce a lot of data, you need the final files for each build/annotation consortium combination, then load them into VariantGrid:

python3 manage.py import_gene_annotation --json-file ${DATA_DIR}/cdot-0.2.5.refseq.grch37.json.gz --annotation-consortium=RefSeq --genome-build=GRCh37 --clear-obsolete

python3 manage.py import_gene_annotation --json-file ${DATA_DIR}/cdot-0.2.5.refseq.grch38.json.gz --annotation-consortium=RefSeq --genome-build=GRCh38 --clear-obsolete

python3 manage.py import_gene_annotation --json-file ${DATA_DIR}/cdot-0.2.5.ensembl.grch37.json.gz --annotation-consortium=Ensembl --genome-build=GRCh37 --clear-obsolete

python3 manage.py import_gene_annotation --json-file ${DATA_DIR}/cdot-0.2.5.ensembl.grch38.json.gz --annotation-consortium=Ensembl --genome-build=GRCh38 --clear-obsolete

Gene Annotation Release

In the step above, we keep the latest versions after merging many GTFs, eg the final transcript/gene/symbol relationship will be from the most recent file containing the transcript.

For consistent analyses, we need to keep a snapshot of a certain state, eg the gene/symbol mappings for Ensembl release 100 (the release used by GRCh38 Ensembl variant annotation) so a gene list filter always returns the same results.

To do this, we create a GeneAnnotationRelease which is loaded from a single RefSeq/Ensembl GTF for each Variant Annotation Version. You need to do this AFTER the normal annotations have been imported (eg the merged file that also contains the individual release)

To find the GTF used for your VEP release see VEP Cache or Issue. If you just have a GFF the release number can be found via zcat gff | head

To get the single GTF into the right format, you need to run "cdot_json.py merge_historical" on a single file.

VEP 100 data files

These are the GTFs used in VEP v100. This expects the default files from cdot have already been generated (eg all_transcripts.sh)

# RefSeq GRCh37
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.refseq.grch37.release_105.20190906.json.gz GCF_000001405.25_GRCh37.p13_genomic.105.20190906.gff.json.gz

# RefSeq GRCh38
${CDOT_DIR}/cdot/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.refseq.grch38.109.20190607.json.gz GCF_000001405.39_GRCh38.p13_genomic.109.20190607.gff.json.gz

# Ensembl GRCh37 - 37 has been staying on release 87
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.ensembl.grch37.release_87.json.gz Homo_sapiens.GRCh37.87.gff3.json.gz

# Ensembl GRCh38 - 38 updates with each Ensembl Release
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.ensembl.grch38.release_100.json.gz Homo_sapiens.GRCh38.100.gff3.json.gz

Inserting Gene Annotation Releases

# RefSeq GRCh37
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh37 --json-file cdot-0.2.5.refseq.grch37.release_105.20190906.json.gz

# RefSeq GRCh38
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh38 --json-file cdot-0.2.5.refseq.grch38.109.20190607.json.gz

# Ensembl GRCh37 - 37 has been staying on release 87
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh37 --json-file cdot-0.2.5.ensembl.grch37.release_87.json.gz

# Ensembl GRCh38 - 38 updates with each Ensembl Release
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh38 --json-file 
cdot-0.2.5.ensembl.grch38.release_100.json.gz

See also

SACGF fork of PyHGVS library to handle alignment gaps (this is in VariantGrid requirements.txt)

cdot - Generate JSON.gz from GTF files, and provide loaders for each HGVS library

Clone this wiki locally