-
Notifications
You must be signed in to change notification settings - Fork 2
Generating Transcript Version Data
See Transcript Versions and Python HGVS library discussion - we use a modified version of PyHGVS for HGVS resolution and have a spin-off project https://github.com/SACGF/cdot/ which collects transcripts in a JSON.gz format, and has loaders for both Python HGVS libraries
The gene/transcript and gene/symbol or symbol for HGNC changes over time. When we merge data from historical files, we always keep the latest one. See cdot instructions for creating transcript JSON.gz files
This will produce a lot of data, you need the final files for each build/annotation consortium combination, then load them into VariantGrid:
python3 manage.py import_gene_annotation --json-file ${DATA_DIR}/cdot-0.2.5.refseq.grch37.json.gz --annotation-consortium=RefSeq --genome-build=GRCh37 --clear-obsolete
python3 manage.py import_gene_annotation --json-file ${DATA_DIR}/cdot-0.2.5.refseq.grch38.json.gz --annotation-consortium=RefSeq --genome-build=GRCh38 --clear-obsolete
python3 manage.py import_gene_annotation --json-file ${DATA_DIR}/cdot-0.2.5.ensembl.grch37.json.gz --annotation-consortium=Ensembl --genome-build=GRCh37 --clear-obsolete
python3 manage.py import_gene_annotation --json-file ${DATA_DIR}/cdot-0.2.5.ensembl.grch38.json.gz --annotation-consortium=Ensembl --genome-build=GRCh38 --clear-obsolete
In the step above, we keep the latest versions after merging many GTFs, eg the final transcript/gene/symbol relationship will be from the most recent file containing the transcript.
For consistent analyses, we need to keep a snapshot of a certain state, eg the gene/symbol mappings for Ensembl release 100 (the release used by GRCh38 Ensembl variant annotation) so a gene list filter always returns the same results.
To do this, we create a GeneAnnotationRelease which is loaded from a single RefSeq/Ensembl GTF for each Variant Annotation Version. You need to do this AFTER the normal annotations have been imported (eg the merged file that also contains the individual release)
To find the GTF used for your VEP release see VEP Cache or Issue. If you just have a GFF the release number can be found via zcat gff | head
To get the single GTF into the right format, you need to run "cdot_json.py merge_historical" on a single file.
These are the GTFs used in VEP v100. This expects the default files from cdot have already been generated (eg all_transcripts.sh
)
# RefSeq GRCh37
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.refseq.grch37.release_105.20190906.json.gz GCF_000001405.25_GRCh37.p13_genomic.105.20190906.gff.json.gz
# RefSeq GRCh38
${CDOT_DIR}/cdot/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.refseq.grch38.109.20190607.json.gz GCF_000001405.39_GRCh38.p13_genomic.109.20190607.gff.json.gz
# Ensembl GRCh37 - 37 has been staying on release 87
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh37 --output cdot-0.2.5.ensembl.grch37.release_87.json.gz Homo_sapiens.GRCh37.87.gff3.json.gz
# Ensembl GRCh38 - 38 updates with each Ensembl Release
${CDOT_DIR}/generate_transcript_data/cdot_json.py merge_historical --genome-build=GRCh38 --output cdot-0.2.5.ensembl.grch38.release_100.json.gz Homo_sapiens.GRCh38.100.gff3.json.gz
# RefSeq GRCh37
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh37 --json-file cdot-0.2.5.refseq.grch37.release_105.20190906.json.gz
# RefSeq GRCh38
python3 manage.py import_gene_annotation --annotation-consortium=RefSeq --genome-build=GRCh38 --json-file cdot-0.2.5.refseq.grch38.109.20190607.json.gz
# Ensembl GRCh37 - 37 has been staying on release 87
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh37 --json-file cdot-0.2.5.ensembl.grch37.release_87.json.gz
# Ensembl GRCh38 - 38 updates with each Ensembl Release
python3 manage.py import_gene_annotation --annotation-consortium=Ensembl --genome-build=GRCh38 --json-file
cdot-0.2.5.ensembl.grch38.release_100.json.gz
SACGF fork of PyHGVS library to handle alignment gaps (this is in VariantGrid requirements.txt)
cdot - Generate JSON.gz from GTF files, and provide loaders for each HGVS library