Skip to content

Config File

Ingo Ebersberger edited this page Nov 4, 2018 · 10 revisions

The protTrace configuration file

For each protTrace run you need to specify a config file that controls the the protTrace workflow. The initial config file is generated when you run the generate_conf.pl script. For subsequent runs you can either edit the config file manually using any text editor, or you re-run the generat_conf.pl script using the option //-update//.

General options

  • species [Five Letter Abbreviation] - Default YEAST
  • nr_of_processors [INT] - Default 1
    • The number of processors to be used by protTrace
  • delete_temporary_files [YES|NO] - Default YES
    • This option cleans all the temporary files produced by protTrace. It is highly recommended to use this option.
  • reuse_cache [YES|NO] - Default NO
    • If set to YES, protTrace will use pre-existing information present in the cache and output directories. NOTE: This information will be used without further check for sanity. If unsure, choose 'NO'.
  • map_traceability_tree [YES|NO] - Default NO

===== Preprocessing ====== A series of events occur in preprocessing step

  • Preparation of BLAST directory for subsequent traceability calculations
  • Compilation of orthologs (Default: OMA)
  • Multiple sequence alignment of orthologoys sequences (MAFFT_LINSI)
  • Sequence tree reconstruction (IQ-TREE)
  • Scaling factor (k) calculation
  • Insertion / deletion (indel) rates calculation (IQ-TREE)
  • Preparation of the input parameter XML file for REvolver simulations
  • OPTIONALLY: Feature architecture similarity (FAS) score computation between seed protein and ortholog. This requires the installation of the HaMStR package ==== Basic preprocessing ==== The following parameters are the main switches controlling the first step in the protTrace workflow.
  • preprocessing [YES|NO] - Default YES
    • specifies whether the preprocessing procedure is performed
  • orthologs_prediction [YES|NO] - Default YES
    • specifies whether you want protTrace to compile orthologs to the seed protein. When choosing 'NO' you will have to specify a file with an existing orthologous group. Details about how to format the corresponding custom ortholog files are provided projects:prottrace:options:general:custom_orthologs.
  • search_oma_database [YES|NO] - Default YES
    • specifies whether you want protTrace to use pre-computed orthologous groups from OMA. If set to 'NO', you will have to set the option run_hamstrOneSeq to 'YES'

==== Advanced Preprocessing ==== With the help of the following options you can fine-tune the behaviour of protTrace during the preprocessing step.

  • run_hamstr [YES|NO] - Default NO
  • run_hamstrOneSeq [YES|NO] - Default NO
    • set to 'YES' when you want to use HaMStR-OneSeq for compiling the orthologous group for the seed protein on the fly instead of using pre-computed orthologous groups. Requires a fully functional installation of the https://github.com/BIONF/HaMStR package. See projects:prottrace:options:hamstr for the an overview of how to setup HaMStR
  • include_paralogs [YES|NO] - Default NO
    • Set to 'YES' to include inparalogs in the analysis
  • fas_score [YES|NO] - Default NO (STILL EXPERIMENTAL)
    • Set to 'YES' to activate the computation of the feature architecture similarity between the seed protein and of its ortholog. Requires a fully functional installation of the https://github.com/BIONF/HaMStR package. See projects:prottrace:options:hamstr for the an overview of how to setup HaMStR ===== Scaling Factor =====
  • calculate_scaling_factor [YES|NO] - Default YES
    • Set to 'YES' to compute the relative substition rate of the seed protein by comparing the pairwise distances of the orthologs to the pairwise distances of the corresponding species.
  • default_scaling_factor [FLOAT] - Default 1.57
    • Allows you to adjust the relative substitution rate for cases where the number of orthologs does not allow the empirical inference.

===== Indel parameter =====

  • perform_msa [YES|NO] - Default YES
    • Set to 'NO' if you plan to provide protTrace with an already existing MSA for the ortholog group. Make sure to follow the naming convention.
  • calculate_indel [YES|NO] - Default YES
    • Computes the indel rate and length distribution parameter from the alignment and the sequence tree. Set to 'NO' to skip the indel rate computation
  • default_indel [FLOAT] - 0.08
    • Specifies the default indel rate in case that the number of orthologs for a seed protein does not suffice to compute an empirical indel rate (<4). Default value is derived from the analysis of the Yeast gene set.
  • default_indel_distribution [FLOAT] - Default 0.25
    • Specifies the default shape parameter of the geometric distribution used to model the indel length distribution. ===== Traceability calculation =====
  • traceability_calculation [YES|NO] - Default YES
    • set to 'NO' to omitt the traceability computation
  • aa_substitution_matrix [JTT|WAG|LG|Blosum62|mtMAM|mtREV|mtART] - Default WAG
    • lets you specify the substitution model for the simulation of protein sequence evolution
  • simulation_runs [INTEGER] - Default 100
    • specifies the number of REvolver simulations

===== Program paths =====

  • iqtree [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • linsi [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • hmmfetch [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • hmmscan [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • blastp [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • makeblastdb [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • Rscript [PATH TO EXECUTABLE] - NONE (essential)
    • determined by create_conf.pl((works only if the program is in the default PATH)), alternatively provide the absolute path to the executable including the program name.
  • hamstr [PATH TO EXECUTABLE] - NONE (optional)
    • provide the absolute path to the executable including the program name.
  • oneseq [PATH TO EXECUTABLE] - NONE (optional)
    • provide the absolute path to the executable including the program name.

===== Paths to files =====

  • REvolver [REvolver.jar] - Default Path2protTrace-dir/used_files/REvolver.jar
  • simulation_tree [stepWiseTree.nw] - Default Path2protTrace-dir/used_files/stepWiseTree.nw
  • decay_script [r_nonlinear_leastsquare.R] - Default Path2protTrace-dir/used_files/r_nonlinear_leastsquare.R
  • plot_figtree [plotPdf.jar] - Default Path2protTrace-dir/used_files/plotPdf.jar
  • Xref_mapping_file [speciesTreeMapping.txt] - Default Path2protTrace-dir/used_files/speciesTreeMapping.txt
    • cross reference file, Links species identifier, species names, NCBI taxonomy id, species abbreviation
  • reference_species_tree [speciesTree.nw] - Default Path2protTrace-dir/used_files/speciesTree.nw
    • Species tree on which the traceability values are plotted. protTrace ships with an NCBI taxonomy tree
  • species_MaxLikMatrix [speciesLikelihoodMatrix.txt] - Default Path2protTrace-dir/used_files/speciesLikelihoodMatrix.txt
    • Pairwise likelihood distance matrix of the species represented in the tree provided with the option reference_species_tree
  • path_oma_seqs [oma-seqs.fa] - Default Path2protTrace-dir/used_files/oma-seqs.fa
    • specifies the file containing the sequences corresponding to the OMA orthologous groups. Optionally downloaded and processed by the //create_conf.pl// script
  • path_oma_group [oma-groups.txt] - Default Path2protTrace-dir/used_files/oma-groups.txt
    • specifies the file containing the OMA orthologous group assignments. Optionally downloaded by the //create_conf.pl// script
  • pfam_database [Pfam-A.hmm] - Default Path2protTrace-dir/used_files/Pfam-A.hmm
    • Profile Hidden Markov models of the Pfam database. Optionally downloaded and processed by the //create_conf.pl// script
  • fas_annotations [Path2HaMStR/weight_dir] - Default NONE
    • sub-directory of the HaMStR directory containing the feature annotated gene sets
  • hamstr_environment [Path2HaMStR] - Default NONE
  • path_output_dir [output] - Default Path2protTrace/output
    • directory, where protTrace stores the results. If the specified directory does not exist, protTrace will attempt to generate it
  • path_cache [cache] - Default Path2protTrace/cache
    • directory, where protTrace stores meta-results
Clone this wiki locally