Skip to content

Config File

ebersber edited this page Nov 4, 2018 · 10 revisions

Table of Contents

The protTrace Configuration File

For each protTrace run you need to specify a config file that controls the the protTrace workflow. The initial config file is generated when you run the generate_conf.pl script. For subsequent runs you can either edit the config file manually using any text editor, or you re-run the generat_conf.pl script using the option //-update//.

General options

  • species [Five Letter Abbreviation] - Default YEAST
    • Use this option to specify the seed species, i.e. the species your protein of interest was derived from. Currently, we are only supporting species abbreviations as they are used by OMA. If you are unsure how to abbreviate your species, take a look at the species information provided online by OMA.
  • nr_of_processors [INT] - Default 1
    • The number of processors to be used by protTrace
  • delete_temporary_files [YES|NO] - Default YES
    • This option cleans all the temporary files produced by protTrace. It is highly recommended to use this option.
  • reuse_cache [YES|NO] - Default NO
    • If set to YES, protTrace will use pre-existing information present in the cache and output directories. NOTE: This information will be used without further check for sanity. If unsure, choose 'NO'.
  • map_traceability_tree [YES|NO] - Default NO
    • Selecting this option will colorize a taxonomy tree according to the traceability values of the seed protein in the corresponding species. In additon, a mapping file be generated that specifies the traceability index for the seed protein in each species.

Preprocessing

The following tasks are accomplished in preprocessing step

  • Preparation of BLAST directory for subsequent traceability calculations
  • Compilation of orthologs (Default: OMA)
  • Multiple sequence alignment of orthologoys sequences (MAFFT_LINSI)
  • Sequence tree reconstruction (IQ-TREE)
  • Scaling factor (k) calculation
  • Insertion / deletion (indel) rates calculation (IQ-TREE)
  • Preparation of the input parameter XML file for REvolver simulations
  • OPTIONALLY: Feature architecture similarity (FAS) score computation between seed protein and ortholog. This requires the installation of the HaMStR package

Basic preprocessing

The following parameters are the main switches controlling the first step in the protTrace workflow.

  • preprocessing [YES|NO] - Default YES
    • specifies whether the preprocessing procedure is performed
  • orthologs_prediction [YES|NO] - Default YES
    • specifies whether you want protTrace to compile orthologs to the seed protein. When choosing 'NO' you will have to specify a file with an existing orthologous group. Details about how to format the corresponding custom ortholog files are provided projects:prottrace:options:general:custom_orthologs.
  • search_oma_database [YES|NO] - Default YES
    • specifies whether you want protTrace to use pre-computed orthologous groups from OMA. If set to 'NO', you will have to set the option run_hamstrOneSeq to 'YES'

Advanced Preprocessing

With the help of the following options you can fine-tune the behaviour of protTrace during the preprocessing step.

  • run_hamstr [YES|NO] - Default NO
  • run_hamstrOneSeq [YES|NO] - Default NO
    • set to 'YES' when you want to use HaMStR-OneSeq for compiling the orthologous group for the seed protein on the fly instead of using pre-computed orthologous groups. Requires a fully functional installation of the the HaMStR Package. See projects:prottrace:options:hamstr for the an overview of how to setup HaMStR
  • include_paralogs [YES|NO] - Default NO
    • Set to 'YES' to include inparalogs in the analysis
  • fas_score [YES|NO] - Default NO (STILL EXPERIMENTAL)
    • Set to 'YES' to activate the computation of the feature architecture similarity between the seed protein and of its ortholog. Requires a fully functional installation of the the HaMStR Package. See projects:prottrace:options:hamstr for the an overview of how to setup HaMStR

Scaling Factor

  • calculate_scaling_factor [YES|NO] - Default YES
    • Set to 'YES' to compute the relative substition rate of the seed protein by comparing the pairwise distances of the orthologs to the pairwise distances of the corresponding species.
  • default_scaling_factor [FLOAT] - Default 1.57
    • Allows you to adjust the relative substitution rate for cases where the number of orthologs does not allow the empirical inference.

Indel parameter

  • perform_msa [YES|NO] - Default YES
    • Set to 'NO' if you plan to provide protTrace with an already existing MSA for the ortholog group. Make sure to follow the naming convention.
  • calculate_indel [YES|NO] - Default YES
    • Computes the indel rate and length distribution parameter from the alignment and the sequence tree. Set to 'NO' to skip the indel rate computation
  • default_indel [FLOAT] - 0.08
    • Specifies the default indel rate in case that the number of orthologs for a seed protein does not suffice to compute an empirical indel rate (<4). Default value is derived from the analysis of the Yeast gene set.
  • default_indel_distribution [FLOAT] - Default 0.25
    • Specifies the default shape parameter of the geometric distribution used to model the indel length distribution.

Traceability calculation

  • traceability_calculation [YES|NO] - Default YES
    • set to 'NO' to omitt the traceability computation
  • aa_substitution_matrix [JTT|WAG|LG|Blosum62|mtMAM|mtREV|mtART] - Default WAG
    • lets you specify the substitution model for the simulation of protein sequence evolution
  • simulation_runs [INTEGER] - Default 100
    • specifies the number of REvolver simulations

Program paths

  • iqtree [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • linsi [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • hmmfetch [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • hmmscan [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • blastp [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • makeblastdb [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • Rscript [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • hamstr [PATH TO EXECUTABLE] - NONE (optional)
    • provide the absolute path to the executable including the program name.
  • oneseq [PATH TO EXECUTABLE] - NONE (optional)
    • provide the absolute path to the executable including the program name.

Paths to files

  • REvolver [REvolver.jar] - Default Path2protTrace-dir/used_files/REvolver.jar
  • simulation_tree [stepWiseTree.nw] - Default Path2protTrace-dir/used_files/stepWiseTree.nw
  • decay_script [r_nonlinear_leastsquare.R] - Default Path2protTrace-dir/used_files/r_nonlinear_leastsquare.R
  • plot_figtree [plotPdf.jar] - Default Path2protTrace-dir/used_files/plotPdf.jar
  • Xref_mapping_file [speciesTreeMapping.txt] - Default Path2protTrace-dir/used_files/speciesTreeMapping.txt
    • cross reference file, Links species identifier, species names, NCBI taxonomy id, species abbreviation
  • reference_species_tree [speciesTree.nw] - Default Path2protTrace-dir/used_files/speciesTree.nw
    • Species tree on which the traceability values are plotted. protTrace ships with an NCBI taxonomy tree
  • species_MaxLikMatrix [speciesLikelihoodMatrix.txt] - Default Path2protTrace-dir/used_files/speciesLikelihoodMatrix.txt
    • Pairwise likelihood distance matrix of the species represented in the tree provided with the option reference_species_tree
  • path_oma_seqs [oma-seqs.fa] - Default Path2protTrace-dir/used_files/oma-seqs.fa
    • specifies the file containing the sequences corresponding to the OMA orthologous groups. Optionally downloaded and processed by the //create_conf.pl// script
  • path_oma_group [oma-groups.txt] - Default Path2protTrace-dir/used_files/oma-groups.txt
    • specifies the file containing the OMA orthologous group assignments. Optionally downloaded by the //create_conf.pl// script
  • pfam_database [Pfam-A.hmm] - Default Path2protTrace-dir/used_files/Pfam-A.hmm
    • Profile Hidden Markov models of the Pfam database. Optionally downloaded and processed by the //create_conf.pl// script
  • fas_annotations [Path2HaMStR/weight_dir] - Default NONE
    • sub-directory of the HaMStR directory containing the feature annotated gene sets
  • hamstr_environment [Path2HaMStR] - Default NONE
  • path_output_dir [output] - Default Path2protTrace/output
    • directory, where protTrace stores the results. If the specified directory does not exist, protTrace will attempt to generate it
  • path_cache [cache] - Default Path2protTrace/cache
    • directory, where protTrace stores meta-results
Clone this wiki locally