-
Notifications
You must be signed in to change notification settings - Fork 0
Input Data
- Genomic sequence in FASTA format (e.g "genomic.fna" from RefSeq or "dna.toplevel.fa" from Ensembl)
- Genome annotation in GFF3 format (e.g. "genomic.gff" from RefSeq or ".gff3" from Ensembl)
- Pairwise orthologs of all proteins between the reference and each core species
- Tab separated file containing the position and sequence of each miRNA for which a model should be constructed
When preparing your input data, you will need one pairwise ortholog file per core species you are using. These pairwise orthologs files should:
- be Tab separated
- have orthologs of the reference species in the first column
- have orthologs of the core species in the second column
- can have any number of optional columns after that which will be ignored by ncOrtho
Example:
#H_sapiens G_gorilla type group
HUMAN00001 GORGO02140 1:1 604910
HUMAN00008 GORGO02141 1:1 995108
HUMAN00023 GORGO02142 1:1 807473
HUMAN00028 GORGO02144 1:1 978496
An easy way to obtain this data is by using the OMA Genome Pair View. If your reference/core species set contains taxa that are not part of the OMA database, you might want to use OMA standalone or another ortholog detection tool of your choice.
It is important to remember however, that the IDs in the pairwise orthologs file need to have a matching ID type in the GFF3 file you are using! There are four different types of IDs in a (RefSeq) GFF file that ncOrtho can parse: "ID", "Name", "GeneID", "CDS". See below for an example for two lines from a GFF file and which ID corresponds to which keyword:
NC_000001.11 BestRefSeq gene 65419 71585 . + . ID=gene-OR4F5;Dbxref=GeneID:79501,HGNC:HGNC:14825;Name=OR4F5;description=olfactory receptor family 4 subfamily F member 5;gbkey=Gene;gene=OR4F5;gene_biotype=protein_coding
NC_000001.11 BestRefSeq CDS 65565 65573 . + 0 ID=cds-NP_001005484.2;Parent=rna-NM_001005484.2;Dbxref=CCDS:CCDS30547.1,Ensembl:ENSP00000493376.2,GeneID:79501,Genbank:NP_001005484.2,HGNC:HGNC:14825;Name=NP_001005484.2;gbkey=CDS;gene=OR4F5;product=olfactory receptor 4F5;protein_id=NP_001005484.2;tag=MANE Select
Keyword | Example |
---|---|
ID | OR4F5 |
Name | OR4F5 |
GeneID | 79501 |
CDS | NP_001005484 |
If you have difficulties with the automatic parsing you can also supply ncOrtho with an ordered tab-separated file of gene-locations like:
#Contig Start End Strand ID_from_orthology_file
NC_003070.9 3631 5899 + NP_171609
NC_003070.9 6788 9130 - NP_001030923
NC_003070.9 11649 13714 - NP_171611
NC_003070.9 23121 31227 + NP_001184881
NC_003070.9 31170 33171 - NP_001322481
Important: Irrespective of the type of ID used, the start and end position should be given for each gene (not for each possible isoform of a gene)
Information about all reference miRNAs has to be supplied in a single Tab separated file with 7 columns:
- Unique miRNA id
- Contig/Chromosome id (needs to match the one in the reference GFF file!)
- Start
- Stop
- Strand (+ or -)
- pre-miRNA sequence
- mature miRNA sequence (no features that use the mature sequence are as of yet implemented. This column can be filled with a placeholder like "NA" or "None")
Example:
hsa-mir-552 NC_000001.11 34669599 34669694 - AACCAUUCAAAUAUACCACAGUUUGUUUAACCUUUUGCCUGUUGGUUGAAGAUGCCUUUCAACAGGUGACUGGUUAGACAAACUGUGGUAUAUACA NA
hsa-mir-30e NC_000001.11 40754355 40754446 + GGGCAGUCUUUGCUACUGUAAACAUCCUUGACUGGAAGCUGUAAGGUGUUCAGAGGAGCUUUCAGUCGGAUGUUUACAGCGGCAGGCUGCCA NA
hsa-mir-30c-1 NC_000001.11 40757284 40757372 + ACCAUGCUGUAGUGUGUGUAAACAUCCUACACUCUCAGCUGUGAGCUCAAGGUGGCUGGGAGAGGGUUGUUUACUCCUUCUGCCAUGGA NA
hsa-mir-6733 NC_000001.11 43171652 43171712 - GUGCUUGGGAAAGACAAACUCAGAGUUCCCUUCUUGUGAGCUCAGUGUCUGGAUUUCCUAG NA
You can retrieve this information from databases like MirGeneDB or miRBase.
To provide better readability and a more user-friendly input, ncOrtho will read the paths to the various input files from a parameter file in YAML format. Before each analysis, modify the example file to match your species of interest and supply paths to appropriate files on your local system.
Do not change the "type" mapping in the parameters file to other strings. You can, however, add as many core species as you might like by adding the corresponding blocks.
Example file:
---
type: reference
name: Homo_sapiens
genome: </path/to/GCF_000001405.39_genomic.fna>
annotation: </path/to/GCF_000001405.39_genomic.gff>
---
type: core
name: Gorilla_gorilla
genome: </path/to/GCF_008122165.1_genomic.fna>
annotation: </path/to/GCF_008122165.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Gorilla_gorilla_rep.txt>
---
type: core
name: Macaca_mulatta
genome: </path/to/GCF_003339765.1_genomic.fna>
annotation: </path/to/GCF_003339765.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Macaca_mulatta_rep.txt>
---
type: core
name: Nomascus_leucogenys
genome: </path/to/GCF_006542625.1_genomic.fna>
annotation: </path/to/GCF_006542625.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Nomascus_leucogenys_rep.txt>
---
type: core
name: Pongo_abelii
genome: </path/to/GCF_002880775.1_genomic.fna>
annotation: </path/to/GCF_002880775.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Pongo_abelii_rep.txt>