GETHOG3 needs following software packages: omamer, Nextflow, Biopython,dendropy , ete3, fasttree and mafft (multiple sequence aligner).
You can start with conda.
conda create --name gethog3 python=3.9
conda activate gethog3
For installing omamer, please check its page github page. (You may need to install omamer with scipy==1.4.1 numpy==1.20.0 pytables==3.6.1
You can install the rest using conda.
conda install -c conda-forge biopython ete3
conda install -c bioconda mafft iqtree fasttree nextflow
And for wrappers, we need
conda install -c biopython dendropy
conda install -c conda-forge pyparsing
1- Sets of protein sequences in FASTA format (with .fa
extension) in the folder proteome
. The name of each fasta file is the name of species.
2- The omamer database which you can download from here which is this link.
This file is 13 Gb
containing all the gene families of the Tree of Life or a subset of them, e.g. Primates (352MB).
3- Sepecies tree in nwk or phyloxml format. Note that the internal node should not contain any special character (e.g. \
or space).
The reason is that gethog3 write some files whose names contains the internal node's name.
Orthology information as HOG strcutre in OrthoXML format.
First, download the GETHOG3 package:
mv gethog3-master gethog3
or clone it
git clone [email protected]:sinamajidian/gethog3.git
Then, cd to the testdata
folder and download the omamer database.
cd gethog3/testdata
wget # 352MB
mv Primates.h5 working_folder
If you are using omamer database of different name (e.g. LUCA.h5), please change params.omamer_db
in gethog3/gethog3/nextflow.config
Next, set the path to working_folder (as a global path) in two places gethog3/gethog3/
and gethog3/gethog3/nextflow.config
1- The variable working_folder
in the file
working_folder = "/work/FAC/FBM/DBC/cdessim2/default/smajidi1/test/gethog3/testdata/working_folder"+ "/"
2- The variable params.working_folder
in the file nextflow.config
(the same as item 1).
params.working_folder= "/work/folder/gethog3/testdata/working_folder"+"/"
params.gethog3= "/work/folder/gethog3/gethog3"+"/"
Make sure that it ends with "/"
Finally, run the package using nextflow as below:
cd gethog3/testdata
nextflow ../gethog3/
After few minutes, the run for test data finishes. Then, following files and folders should appear in the folder gethog3/testdata
gene_id_dic_xml.pickle hogmap output_hog_.orthoxml pickle_rhogs
Primates.h5 proteome rhogs_all rhogs_big rhogs_rest species_tree.nwk
among which output_hog_.orthoxml
is the final output. Its content looks like this
<?xml version="1.0" ?>
<orthoXML xmlns="" origin="OMA" originVersion="Nov 2021" version="0.3">
<species name="MYCGE" NCBITaxId="1">
<database name="QFO database " version="2020">
<gene id="1000000000" protId="sp|P47500|RF1_MYCGE"/>
<gene id="1000000001" protId="sp|P13927|EFTU_MYCGE"/>
<gene id="1000000002" protId="sp|P47639|ATPB_MYCGE"/>
<orthologGroup id="HOG:B0885011_sub10003">
<property name="TaxRange" value="inter1"/>
<geneRef id="1002000004"/>
<geneRef id="1001000004"/>
Please first try the test data. Now you should have the GETHOG3 package.
GETHOG3 is based on nextflow. We consider a working folder which contains the omamer database Primates.h5
the proteome folder of the species of interst proteome
(of fa files inside),
and the speceis tree species_tree.phyloxml
(or nwk).
$ ls working_folder
Primates.h5 proteome species_tree.phyloxml
After running the package, the outputs will appear in this working folder.
Please set the working_folder in two places gethog3/gethog3/
and gethog3/gethog3/nextflow.config
1- The variable working_folder
in the file
2- The variable params.working_folder
in the file nextflow.config
(the same as item 1).
Then, provide the address of the gethog3 code as params.gethog3
in the file nextflow.config
Finally you can run:
nextflow gethog3/gethog3/
prelease v.0.0.4 (Feb 15 2022)