Skip to content

Input Data

felixlangschied edited this page Feb 16, 2023 · 1 revision

Data type Overview

  • Genomic sequence in FASTA format (e.g "genomic.fna" from RefSeq or "dna.toplevel.fa" from Ensembl)
  • Genome annotation in GFF3 format (e.g. "genomic.gff" from RefSeq or ".gff3" from Ensembl)
  • Pairwise orthologs of all proteins between the reference and each core species
  • Tab separated file containing the position and sequence of each miRNA for which a model should be constructed

Pairwise orthologs

When preparing your input data, you will need one pairwise ortholog file per core species you are using. These pairwise orthologs files should:

  • be Tab separated
  • have orthologs of the reference species in the first column
  • have orthologs of the core species in the second column
  • can have any number of optional columns after that which will be ignored by ncOrtho

Example:

#H_sapiens	G_gorilla	type	group
HUMAN00001	GORGO02140	1:1	604910
HUMAN00008	GORGO02141	1:1	995108
HUMAN00023	GORGO02142	1:1	807473
HUMAN00028	GORGO02144	1:1	978496

An easy way to obtain this data is by using the OMA Genome Pair View. If your reference/core species set contains taxa that are not part of the OMA database, you might want to use OMA standalone or another ortholog detection tool of your choice.

It is important to remember however, that the IDs in the pairwise orthologs file need to have a matching ID type in the GFF3 file you are using! There are four different types of IDs in a (RefSeq) GFF file that ncOrtho can parse: "ID", "Name", "GeneID", "CDS". See below for an example for two lines from a GFF file and which ID corresponds to which keyword:

NC_000001.11	BestRefSeq	gene	65419	71585	.	+	.	ID=gene-OR4F5;Dbxref=GeneID:79501,HGNC:HGNC:14825;Name=OR4F5;description=olfactory receptor family 4 subfamily F member 5;gbkey=Gene;gene=OR4F5;gene_biotype=protein_coding
NC_000001.11	BestRefSeq	CDS	65565	65573	.	+	0	ID=cds-NP_001005484.2;Parent=rna-NM_001005484.2;Dbxref=CCDS:CCDS30547.1,Ensembl:ENSP00000493376.2,GeneID:79501,Genbank:NP_001005484.2,HGNC:HGNC:14825;Name=NP_001005484.2;gbkey=CDS;gene=OR4F5;product=olfactory receptor 4F5;protein_id=NP_001005484.2;tag=MANE Select
Keyword Example
ID OR4F5
Name OR4F5
GeneID 79501
CDS NP_001005484

If you have difficulties with the automatic parsing you can also supply ncOrtho with an ordered tab-separated file of gene-locations like:

#Contig    Start    End    Strand    ID_from_orthology_file
NC_003070.9	3631	5899	+	NP_171609
NC_003070.9	6788	9130	-	NP_001030923
NC_003070.9	11649	13714	-	NP_171611
NC_003070.9	23121	31227	+	NP_001184881
NC_003070.9	31170	33171	-	NP_001322481

Important: Irrespective of the type of ID used, the start and end position should be given for each gene (not for each possible isoform of a gene)

Reference miRNAs

Information about all reference miRNAs has to be supplied in a single Tab separated file with 7 columns:

  1. Unique miRNA id
  2. Contig/Chromosome id (needs to match the one in the reference GFF file!)
  3. Start
  4. Stop
  5. Strand (+ or -)
  6. pre-miRNA sequence
  7. mature miRNA sequence (no features that use the mature sequence are as of yet implemented. This column can be filled with a placeholder like "NA" or "None")

Example:

hsa-mir-552	NC_000001.11	34669599	34669694	-	AACCAUUCAAAUAUACCACAGUUUGUUUAACCUUUUGCCUGUUGGUUGAAGAUGCCUUUCAACAGGUGACUGGUUAGACAAACUGUGGUAUAUACA	NA
hsa-mir-30e	NC_000001.11	40754355	40754446	+	GGGCAGUCUUUGCUACUGUAAACAUCCUUGACUGGAAGCUGUAAGGUGUUCAGAGGAGCUUUCAGUCGGAUGUUUACAGCGGCAGGCUGCCA	NA
hsa-mir-30c-1	NC_000001.11	40757284	40757372	+	ACCAUGCUGUAGUGUGUGUAAACAUCCUACACUCUCAGCUGUGAGCUCAAGGUGGCUGGGAGAGGGUUGUUUACUCCUUCUGCCAUGGA	NA
hsa-mir-6733	NC_000001.11	43171652	43171712	-	GUGCUUGGGAAAGACAAACUCAGAGUUCCCUUCUUGUGAGCUCAGUGUCUGGAUUUCCUAG	NA

You can retrieve this information from databases like MirGeneDB or miRBase.

Parameters file

To provide better readability and a more user-friendly input, ncOrtho will read the paths to the various input files from a parameter file in YAML format. Before each analysis, modify the example file to match your species of interest and supply paths to appropriate files on your local system.

Do not change the "type" mapping in the parameters file to other strings. You can, however, add as many core species as you might like by adding the corresponding blocks.

Example file:

---
type: reference
name: Homo_sapiens
genome: </path/to/GCF_000001405.39_genomic.fna>
annotation: </path/to/GCF_000001405.39_genomic.gff>
---
type: core
name: Gorilla_gorilla
genome: </path/to/GCF_008122165.1_genomic.fna>
annotation: </path/to/GCF_008122165.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Gorilla_gorilla_rep.txt>
---
type: core
name: Macaca_mulatta
genome: </path/to/GCF_003339765.1_genomic.fna>
annotation: </path/to/GCF_003339765.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Macaca_mulatta_rep.txt>
---
type: core
name: Nomascus_leucogenys
genome: </path/to/GCF_006542625.1_genomic.fna>
annotation: </path/to/GCF_006542625.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Nomascus_leucogenys_rep.txt>
---
type: core
name: Pongo_abelii
genome: </path/to/GCF_002880775.1_genomic.fna>
annotation: </path/to/GCF_002880775.1_genomic.gff>
orthologs: </path/to/oma/PairwiseOrthologs/Homo_sapiens_rep-Pongo_abelii_rep.txt>

Content

Introduction

Covariance Model construction

Ortholog Search

Downstream

Support

Clone this wiki locally