Input Data

Data type Overview

Genomic sequence in FASTA format (e.g "genomic.fna" from RefSeq or "dna.toplevel.fa" from Ensembl)
Genome annotation in GFF3 format (e.g. "genomic.gff" from RefSeq or ".gff3" from Ensembl)
Pairwise orthologs of all proteins between the reference and each core species
Tab separated file containing the position and sequence of each miRNA for which a model should be constructed

Pairwise orthologs

When preparing your input data, you will need one pairwise ortholog file per core species you are using. These pairwise orthologs files should:

be Tab separated
have orthologs of the reference species in the first column
have orthologs of the core species in the second column
can have any number of optional columns after that which will be ignored by ncOrtho

Example:

#H_sapiens	G_gorilla	type	group
HUMAN00001	GORGO02140	1:1	604910
HUMAN00008	GORGO02141	1:1	995108
HUMAN00023	GORGO02142	1:1	807473
HUMAN00028	GORGO02144	1:1	978496

An easy way to obtain this data is by using the OMA Genome Pair View. If your reference/core species set contains taxa that are not part of the OMA database, you might want to use OMA standalone or another ortholog detection tool of your choice.

It is important to remember however, that the IDs in the pairwise orthologs file need to have a matching ID type in the GFF3 file you are using! There are four different types of IDs in a (RefSeq) GFF file that ncOrtho can parse: "ID", "Name", "GeneID", "CDS". See below for an example for two lines from a GFF file and which ID corresponds to which keyword:

NC_000001.11	BestRefSeq	gene	65419	71585	.	+	.	ID=gene-OR4F5;Dbxref=GeneID:79501,HGNC:HGNC:14825;Name=OR4F5;description=olfactory receptor family 4 subfamily F member 5;gbkey=Gene;gene=OR4F5;gene_biotype=protein_coding
NC_000001.11	BestRefSeq	CDS	65565	65573	.	+	0	ID=cds-NP_001005484.2;Parent=rna-NM_001005484.2;Dbxref=CCDS:CCDS30547.1,Ensembl:ENSP00000493376.2,GeneID:79501,Genbank:NP_001005484.2,HGNC:HGNC:14825;Name=NP_001005484.2;gbkey=CDS;gene=OR4F5;product=olfactory receptor 4F5;protein_id=NP_001005484.2;tag=MANE Select

Keyword	Example
ID	OR4F5
Name	OR4F5
GeneID	79501
CDS	NP_001005484

If you have difficulties with the automatic parsing you can also supply ncOrtho with an ordered tab-separated file of gene-locations like:

#Contig    Start    End    Strand    ID_from_orthology_file
NC_003070.9	3631	5899	+	NP_171609
NC_003070.9	6788	9130	-	NP_001030923
NC_003070.9	11649	13714	-	NP_171611
NC_003070.9	23121	31227	+	NP_001184881
NC_003070.9	31170	33171	-	NP_001322481

Important: Irrespective of the type of ID used, the start and end position should be given for each gene (not for each possible isoform of a gene)

Reference miRNAs

Information about all reference miRNAs has to be supplied in a single Tab separated file with 7 columns:

Unique miRNA id
Contig/Chromosome id (needs to match the one in the reference GFF file!)
Start
Stop
Strand (+ or -)
pre-miRNA sequence
mature miRNA sequence (no features that use the mature sequence are as of yet implemented. This column can be filled with a placeholder like "NA" or "None")

Example:

hsa-mir-552	NC_000001.11	34669599	34669694	-	AACCAUUCAAAUAUACCACAGUUUGUUUAACCUUUUGCCUGUUGGUUGAAGAUGCCUUUCAACAGGUGACUGGUUAGACAAACUGUGGUAUAUACA	NA
hsa-mir-30e	NC_000001.11	40754355	40754446	+	GGGCAGUCUUUGCUACUGUAAACAUCCUUGACUGGAAGCUGUAAGGUGUUCAGAGGAGCUUUCAGUCGGAUGUUUACAGCGGCAGGCUGCCA	NA
hsa-mir-30c-1	NC_000001.11	40757284	40757372	+	ACCAUGCUGUAGUGUGUGUAAACAUCCUACACUCUCAGCUGUGAGCUCAAGGUGGCUGGGAGAGGGUUGUUUACUCCUUCUGCCAUGGA	NA
hsa-mir-6733	NC_000001.11	43171652	43171712	-	GUGCUUGGGAAAGACAAACUCAGAGUUCCCUUCUUGUGAGCUCAGUGUCUGGAUUUCCUAG	NA

You can retrieve this information from databases like MirGeneDB or miRBase.

Parameters file

To provide better readability and a more user-friendly input, ncOrtho will read the paths to the various input files from a parameter file in YAML format. Before each analysis, modify the example file to match your species of interest and supply paths to appropriate files on your local system.

Do not change the "type" mapping in the parameters file to other strings. You can, however, add as many core species as you might like by adding the corresponding blocks.

Example file:

---
type: reference
name: Homo_sapiens
genome: </path/to/GCF_000001405.39_genomic.fna>
annotation: </path/to/GCF_000001405.39_genomic.gff>
---
type: core
name: Gorilla_gorilla
genome: </path/to/GCF_008122165.1_genomic.fna>
annotation: </path/to/GCF_008122165.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Gorilla_gorilla_rep.txt>
---
type: core
name: Macaca_mulatta
genome: </path/to/GCF_003339765.1_genomic.fna>
annotation: </path/to/GCF_003339765.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Macaca_mulatta_rep.txt>
---
type: core
name: Nomascus_leucogenys
genome: </path/to/GCF_006542625.1_genomic.fna>
annotation: </path/to/GCF_006542625.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Nomascus_leucogenys_rep.txt>
---
type: core
name: Pongo_abelii
genome: </path/to/GCF_002880775.1_genomic.fna>
annotation: </path/to/GCF_002880775.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Pongo_abelii_rep.txt>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input Data

Data type Overview

Pairwise orthologs

Reference miRNAs

Parameters file

Content

Introduction

Home

Covariance Model construction

Input Data

Choosing core species

Running CM construction

Ortholog Search

Running the orthology search

Downstream

Analysis

Support

Known Issues

Clone this wiki locally