Name	Name	Last commit message	Last commit date
Latest commit History 166 Commits
.idea	.idea	add zoo wrapper again	Feb 14, 2023
analysis	analysis	update	Dec 22, 2022
archive	archive	cleaning	Feb 15, 2023
gethog3	gethog3	update	Feb 15, 2023
testdata/working_folder	testdata/working_folder	update	Feb 15, 2023
README.md	README.md	update	Feb 15, 2023

GetHOG v3. (Under development)

prerequisites

GETHOG3 needs following software packages: omamer, Nextflow, Biopython,dendropy , ete3, fasttree and mafft (multiple sequence aligner).

You can start with conda.

conda create --name gethog3 python=3.9
conda activate gethog3

For installing omamer, please check its page github page. (You may need to install omamer with scipy==1.4.1 numpy==1.20.0 pytables==3.6.1)

You can install the rest using conda.

conda install -c conda-forge biopython ete3  
conda install -c bioconda mafft iqtree fasttree nextflow

And for wrappers, we need

conda install -c biopython dendropy
conda install -c conda-forge pyparsing

Input and Output:

Input:

1- Sets of protein sequences in FASTA format (with .fa extension) in the folder proteome. The name of each fasta file is the name of species.

2- The omamer database which you can download from here which is this link. This file is 13 Gb containing all the gene families of the Tree of Life or a subset of them, e.g. Primates (352MB).

3- Sepecies tree in nwk or phyloxml format. Note that the internal node should not contain any special character (e.g. \ / or space). The reason is that gethog3 write some files whose names contains the internal node's name.

Output:

Orthology information as HOG strcutre in OrthoXML format.

How to run GETHOG3 the test data

First, download the GETHOG3 package:

wget https://github.com/sinamajidian/gethog3/archive/refs/heads/master.zip
unzip master.zip
mv gethog3-master gethog3

or clone it

git clone [email protected]:sinamajidian/gethog3.git

Then, cd to the testdata folder and download the omamer database.

cd gethog3/testdata
wget https://omabrowser.org/All/Primates.h5    # 352MB
mv Primates.h5  working_folder

If you are using omamer database of different name (e.g. LUCA.h5), please change params.omamer_db in gethog3/gethog3/nextflow.config.

Next, set the path to working_folder (as a global path) in two places gethog3/gethog3/_config.py and gethog3/gethog3/nextflow.config :

1- The variable working_folder in the file _config.py

working_folder = "/work/FAC/FBM/DBC/cdessim2/default/smajidi1/test/gethog3/testdata/working_folder"+ "/"

2- The variable params.working_folder in the file nextflow.config (the same as item 1).

params.working_folder= "/work/folder/gethog3/testdata/working_folder"+"/"
params.gethog3= "/work/folder/gethog3/gethog3"+"/"

Make sure that it ends with "/".

Finally, run the package using nextflow as below:

cd gethog3/testdata
nextflow ../gethog3/gethog3_script.nf

After few minutes, the run for test data finishes. Then, following files and folders should appear in the folder gethog3/testdata.

gene_id_dic_xml.pickle  hogmap  output_hog_.orthoxml  pickle_rhogs 
 Primates.h5  proteome  rhogs_all  rhogs_big  rhogs_rest  species_tree.nwk

among which output_hog_.orthoxml is the final output. Its content looks like this

<?xml version="1.0" ?>
<orthoXML xmlns="http://orthoXML.org/2011/" origin="OMA" originVersion="Nov 2021" version="0.3">
   <species name="MYCGE" NCBITaxId="1">
      <database name="QFO database " version="2020">
         <genes>
            <gene id="1000000000" protId="sp|P47500|RF1_MYCGE"/>
            <gene id="1000000001" protId="sp|P13927|EFTU_MYCGE"/>
            <gene id="1000000002" protId="sp|P47639|ATPB_MYCGE"/>
            
 ...
      <orthologGroup id="HOG:B0885011_sub10003">
         <property name="TaxRange" value="inter1"/>
         <geneRef id="1002000004"/>
         <geneRef id="1001000004"/>
      </orthologGroup>
   </groups>
</orthoXML>

How to config and run GETHOG3

Please first try the test data. Now you should have the GETHOG3 package.

GETHOG3 is based on nextflow. We consider a working folder which contains the omamer database Primates.h5, the proteome folder of the species of interst proteome (of fa files inside), and the speceis tree species_tree.phyloxml (or nwk).

$ ls working_folder
Primates.h5  proteome  species_tree.phyloxml

After running the package, the outputs will appear in this working folder.

Please set the working_folder in two places gethog3/gethog3/_config.py and gethog3/gethog3/nextflow.config : 1- The variable working_folder in the file _config.py 2- The variable params.working_folder in the file nextflow.config (the same as item 1).

Then, provide the address of the gethog3 code as params.gethog3 in the file nextflow.config.

Finally you can run:

nextflow gethog3/gethog3/gethog3_script.nf

log changes

prelease v.0.0.4 (Feb 15 2022)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GetHOG v3. (Under development)

prerequisites

Input and Output:

Input:

Output:

How to run GETHOG3 the test data

How to config and run GETHOG3

log changes

About

Releases 11

Packages

Contributors 5

Languages

License

DessimozLab/FastOMA

Folders and files

Latest commit

History

Repository files navigation

GetHOG v3. (Under development)

prerequisites

Input and Output:

Input:

Output:

How to run GETHOG3 the test data

How to config and run GETHOG3

log changes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 11

Packages 0

Contributors 5

Languages

Packages