Skip to content

Generating Transcript Version Data

Dave Lawrence edited this page May 17, 2022 · 9 revisions

Background

See Transcript Versions and Python HGVS library discussion - we use a modified version of PyHGVS for HGVS resolution and have a spin-off project https://github.com/SACGF/cdot/ which collects transcripts in a JSON.gz format, and has loaders for both Python HGVS libraries

Gene Annotation Release

To get the single GTF into the right format, you need to run "cdot_json.py merge_historical" on a single file.

VEP 100 data files

These are the GTFs used in VEP v100. This expects the default files from cdot have already been generated (eg all_transcripts.sh)

# RefSeq GRCh37
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.refseq.grch37.release_105.20190906.json.gz GCF_000001405.25_GRCh37.p13_genomic.105.20190906.gff.json.gz

# RefSeq GRCh38
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.refseq.grch38.109.20190607.json.gz GCF_000001405.39_GRCh38.p13_genomic.109.20190607.gff.json.gz

# Ensembl GRCh37 - 37 has been staying on release 87
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.ensembl.grch37.release_87.json.gz Homo_sapiens.GRCh37.87.gff3.json.gz

# Ensembl GRCh38 - 38 updates with each Ensembl Release
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.ensembl.grch38.release_100.json.gz Homo_sapiens.GRCh38.100.gff3.json.gz

VEP 106 data files

# TODO: RefSeq

# Ensembl GRCh37 - 37 has been staying on release 87
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.ensembl.grch37.release_87.json.gz Homo_sapiens.GRCh37.87.gff3.json.gz

# Ensembl GRCh38 - 38 updates with each Ensembl Release
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.ensembl.grch38.release_106.json.gz Homo_sapiens.GRCh38.106.gff3.json.gz

See also

SACGF fork of PyHGVS library to handle alignment gaps (this is in VariantGrid requirements.txt)

cdot - Generate JSON.gz from GTF files, and provide loaders for each HGVS library

Clone this wiki locally