-
Notifications
You must be signed in to change notification settings - Fork 2
Generating Transcript Version Data
Dave Lawrence edited this page May 17, 2022
·
9 revisions
See Transcript Versions and Python HGVS library discussion - we use a modified version of PyHGVS for HGVS resolution and have a spin-off project https://github.com/SACGF/cdot/ which collects transcripts in a JSON.gz format, and has loaders for both Python HGVS libraries
To get the single GTF into the right format, you need to run "cdot_json.py merge_historical" on a single file.
These are the GTFs used in VEP v100. This expects the default files from cdot have already been generated (eg all_transcripts.sh
)
# RefSeq GRCh37
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.refseq.grch37.release_105.20190906.json.gz GCF_000001405.25_GRCh37.p13_genomic.105.20190906.gff.json.gz
# RefSeq GRCh38
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.refseq.grch38.109.20190607.json.gz GCF_000001405.39_GRCh38.p13_genomic.109.20190607.gff.json.gz
# Ensembl GRCh37 - 37 has been staying on release 87
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.ensembl.grch37.release_87.json.gz Homo_sapiens.GRCh37.87.gff3.json.gz
# Ensembl GRCh38 - 38 updates with each Ensembl Release
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.ensembl.grch38.release_100.json.gz Homo_sapiens.GRCh38.100.gff3.json.gz
# TODO: RefSeq
# Ensembl GRCh37 - 37 has been staying on release 87
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.ensembl.grch37.release_87.json.gz Homo_sapiens.GRCh37.87.gff3.json.gz
# Ensembl GRCh38 - 38 updates with each Ensembl Release
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.ensembl.grch38.release_106.json.gz Homo_sapiens.GRCh38.106.gff3.json.gz
SACGF fork of PyHGVS library to handle alignment gaps (this is in VariantGrid requirements.txt)
cdot - Generate JSON.gz from GTF files, and provide loaders for each HGVS library