Skip to content
AlistairNWard edited this page Aug 30, 2012 · 36 revisions

Documentation for Version 0.5

Whats New

Everything!

Table of contents

## Overview As next-generation DNA sequencing becomes increasingly commonplace, the demand for powerful, sophisticated, yet easy to use analysis software has increased dramatically. The Marth lab at Boston College is at the forefront of genomic software development, addressing a large fraction of the analysis problems from read mapping to variant analysis. To best serve the research community, the _gkno_ package has been developed to address the following requirements of next-generation data analysis.
  1. A unified launcher bringing together software tools into a single environment,
  2. a general framework for generating linked lists of tools allowing integrated pipelines to be constructed,
  3. and a web environment providing easy access to documentation and tutorials, user forums, blogs and bug reporting.

The web environment keeps people up to date on the work being performed with gkno and useful information that different users post in the forum. The documentation, including the tutorials, provides clear instructions on how to download and execute gkno as well as more in depth information about the included tools, pipelines and configuration files. A core goal of the package is to enable inexperienced users to simply download and execute predetermined analysis pipelines in order to generate sensible results for their research projects. The intricacies of the pipelines (including which processing tools and sensible parameter sets) are all hidden in configuration files and only advanced users need interrogate them.

Installing gkno

Download and installation instructions.

gkno launcher description

The gkno launcher is designed to bring the wealth of next-generation DNA sequencing analysis software into a single, easy to use command line. The power of the launcher is the ability to bring together multiple tools into a single analysis pipeline with the minimum of required user input. The pipeline is defined in a configuration file that can be quickly and easily constructed and is then available for repeated use. When the command line is executed, gkno generates a Makefile that is automatically (unless specified otherwise by the user) executed using the GNU make framework. This system ensures that each tool is aware of its file dependencies and includes rules to determine how all of the necessary files are to be created. If a tool fails, any files created in the failed step are deleted and the user is informed of where the problems occurred. This ensures that no partially constructed files will be made available to the user, leading to the potential of analysis based on incomplete data.

Tool mode

Each tool in gkno is described by a json configuration file. This file describes the executable commands, the tool location, all of the allowed command line arguments, the expected parameters, data types and default values. In general, the user should have no need to deal with the configuration files, but a complete description of the format of the configuration files is given in the 'Configuration files' section. A list of all the available tools can be seen by typing:

gkno --help

In order to run a tool, the user simply needs to specify the name of the tool to run. In order to get extra information (e.g. the available command line arguments), help can be displayed by typing:

gkno <tool> --help ###Pipeline mode The gkno launcher can be used to launch any of the available pipelines. Including the term pipe as the first argument instructs gkno to operate in the pipeline mode. To see a list of all available pipelines, type:

gkno pipe --help

In order to see all of the available command line arguments for a particular pipeline, the following command line can be used:

gkno pipe <pipeline name> --help

Executing the command line above lists all of the arguments available as part of the specified pipeline. The pipeline arguments are not. however, the complete set of arguments available to all of the constituent tools. If the user wishes to set a parameter in one of the pipelines' tools, but this is not an available pipeline command line argument, all of the tools arguments are accessible by setting the pipeline task as an argument and then including arguments for that task in quotation marks. For example, if the fastq-vcf pipeline is executed and the --use-best-n-alleles argument in freebayes requires modification, the following command line is valid:

gkno pipe fastq-vcf --freebayes “--use-best-n-alleles 5”

is a valid command line. All of the commands for (in this example, freebayes) are contained within the quotation marks. The pipelines are designed in such a way that the commonly accessed commands for each of the constituent tools are accessible via the standard command line, but advanced options may require using this syntax. ###GNU make As previously mentioned, the gkno package use the GNU make system to execute pipelines. On execution of a gkno pipeline, a Makefile is generated. This general framework of the Makefile is a list of blocks describing what files are required by a 'rule' and the files that are output when the 'rule' is executed. The rule is itself one or more command lines. When executed (using the command make --file <Makefile name>), make searches for the final required output file and all of the dependencies, e.g. the files that are required to make the output file. If the final file does not exist, or any of the dependencies are missing or were created more recently than the output, make will try execute the rule. In the absence of some of the dependencies, make will search for a rule describing how to generate this file and so on.

The important thing to note is that after the pipeline has been executed, it can be rerun at any point by using the make --file <Makefile name> command. If everything is up to date, nothing will happen. If any files have been modified or deleted, the pipeline will be executed from where the first missing/modified file is and not waste time rerunning commands that are not required (as the files already exist).

This is important when running the same pipeline for multiple input files for the following reason. Consider the pipeline 'fastq-vcf'. This begins by generating reference files that the Mosaik software requires for performing alignments and can take a reasonable amount of time for a large reference genome. Once this is complete, the remainder of the pipeline deals with aligning and variant calling on a pair of fastq files (containing sequencing reads). If the pipeline is now run again for a different pair of fastq files, the reference building steps will be skipped since they already exist. The important thing to remember is:

Only the required tasks in a pipeline will be run. If files already exist, the pipeline will not waste time recreating them.

See the 'Using GNU make' tutorial for worked examples of using the GNU make framework. ###Logging gkno usage is logged in order to keep track of which tools/pipelines are most commonly used in the community. Every time gkno is launched, an ID of the form tool/ or pipe/ is generated and sent back to the Marth lab. No information about the user/location etc. is tracked, just the used tool.   ##Configuration files The python code describing the gkno launcher is general and has no knowledge of the individual tools it comprises. In order to generate executable scripts (Makefiles are created that are executed using GNU make), configuration files are required to describe the individual tool command lines and how the different tools interact in a pipeline. These configuration files are in json format and the file contents for tools and pipelines are different.

This section of the documentation describes the format of the json configuration files in some detail and is not intended for the user just wanting to get started with the gkno package. For a more hands on description of how to use gkno or modify specific aspects of the configuration files, specific tutorials with worked examples have been developed. These are included in the documentation, but are also available on the gkno website under the Tutorials tab. ###Tool configuration files The tool configuration files describe all of the information necessary to run each of the individual tools. Each individual tool configuration file may contain multiple ‘tools’, each describing a different mode of operation. For example, the software tool MosaikBuild can be used to construct a reference file that is readable by the Mosaik aligner from a standard fasta reference file, or it can be used to generate a read archive using input fastq files. Each mode of operation is distinct and has different command line arguments and so they appear separately in the configuration file (as mosaik-build-reference and mosaik-build-fastq). ####json elements Each tool description is described in json format using the following elements. Unless otherwise stated, the element is required in the configuration file and its omission will cause gkno to terminate.

  • description: a brief description of the tool and its role. This text appears in the pipeline help and so its inclusion is necessary in order to ensure clarity.
  • path: the location of the executable file within the gkno package
  • executable: the name of the executable file
  • precommand: command line arguments to be included prior to the executable file, for example java –Xmx4g –jar
  • modifier: text to be included after the executable, but prior to any of the tools command line arguments, for example, sort or index in bamtools
  • help: the help command for this tool (usually --help or -h)
  • arguments: a list of all the valid command line arguments for this tool. Each argument is supplied with all the information necessary for gkno. In order to this, the elements are supplied for each argument (unless specified as optional).
    • description: a brief description of the command line argument used in the help messages.
    • input: A Boolean indicating if this argument is associated with an input file.
    • output: A Boolean indicating if this argument is associated with an output file.
    • resource: A Boolean indicating if the file associated with this argument is a resource file. This will (unless overridden by the user) assume that the file is in the gkno resource directory.
    • required: A Boolean indicating if the file associated with this argument is required for successful operation of the tool. If required is set to true and the file is not provided, gkno will terminate highlighting that this file is missing.
    • dependent: indicates that the tool is dependent on the existence of this file. The executable script is a Makefile executed using the 'gnu make' system. Before running the provided command line, the existence of the dependent files is checked. If one of these files does not exist, make will check to see if the script contains a rule for how to create this file. If it does, this file will be created, otherwise the script will fail.
    • type: the expected data type associated with this argument. This can be one of the following: string, int, float or flag. On the command line, all arguments will expect a value to be provided unless the data type is set to flag.
    • short form argument: a short form version of the command line argument. For example, the argument could be --fastq and the alternative would likely be -f.
    • extension (optional): the suffix of the file associated with this option. If there is no such file, then this element does not need to be set. If multiple extensions are allowed, they can be separated by a pipe. For example, fa and fasta are valid extensions for a fasta reference file and so this field would be populated with fa|fasta.
    • default (optional): the default parameter to be given to this command line argument.
    • use for filenames (optional): if the output file is not defined and there are multiple input files provided to the tool, the input file with this value set to true will used to construct the output filename. The input extension will be replaced with the output extension. For example, if the input filename is input_test.fq and the output file extension is defined as bam, the output filename will be input_test.bam if not defined by the user.
    • stub (optional): if the output from a tool is a set of files and the output argument does not contain the file extension, then the output is a stub and this option is set as true. In this case, the following argument (outputs) is also required.
    • outputs (optional): a list of the output suffixes that will be generated by the tool.
    • if input is stream: this option is available for input file arguments. Ordinarily, the tool accepts a file as input and so the input argument would be set to the filename. If the input is a stream, this argument allows the command line to be modified so that instead of a filename, this argument is provided (for example bamtools would require the stdin instead of the filename). It is possible that if the input is a stream, this argument should not appear on the command line, but should be replaced with a different argument (for example, freebayes ordinarily expects the command line argument --bam, but if a stream is used as input, the argument --stdin is expected in place of this). If this is the case, this argument is set to replace and the replace argument with entry must be provided.
  • replace argument with (optional): to be included only if if input is stream (explained above) is set to replace. If set, this needs two inputs:
  • argument: the command line argument that is to be used as a replacement,
  • value: the value which is supplied. This can be set as blank.
  • hide tool (optional): some of the tools included in gkno use a command line syntax that is non-standard (here, standard is assumed to be command line arguments of the form --argument <value> or the short form equivalent -argument <value>, where <value> can be omitted if the argument is a flag). Tools indicated as hidden are invisible in tool mode and cannot be accessed on the pipeline command line, however they can be built into pipelines like any other tool. Default: not hidden.
  • argument delimiter (optional): the standard format of command line arguments is --argument <value>, where argument can be omitted for flags. Some tools do not conform to this format (for example, some tools have the format argument=value. The argument delimiter allows the delimiter to be set. If omitted, the default value is a single space.
  • input is stream (optional): some tools only operate on the stream and, as such, do not have command line arguments for the input files as the stream is assumed (ogap and bamleftalign are examples of such tools). By setting input is stream_ in the tool configuration file, gkno will not attempt to find the input arguments and determine if they are compatibile with the stream.
  • output is stream (optional): as input is stream except for tools only outputting to the stream.
  • additional files: ADD TEXT
####Example tool configuration file As an example, a section of the configuration file for _freebayes_ is included below. The actual file can be found in the <_gkno_path>/config\_files/tools_ directory. This example contains a sample of the provided arguments, but shows the expected syntax of the file. ```javascript { "tools" : { "freebayes" : { "description" : "Bayesian variant and haplotype calling", "path" : "freebayes/bin", "executable" : "freebayes", "help" : "--help|-h", "arguments" : { "--bam" : { "description" : "Add FILE to the set of BAM files to be analyzed.", "input" : "true", "use for filenames" : "true", "output" : "false", "resource" : "false", "required" : "true", "dependent" : "true", "short form argument" : "-b", "extension" : "bam", "if input is stream" : "replace", "replace argument with" : { "argument" : "--stdin", "value" : "" }, "type" : "string" }, "--no-snps" : { "description" : "Ignore SNP alleles.", "input" : "false", "output" : "false", "resource" : "false", "required" : "false", "dependent" : "false", "short form argument" : "-I", "type" : "flag" }, ... } } } } ``` ###Pipeline configuration files The pipeline configuration file describes the tools to be used, the order of use and any linkage between the tools. A small number of pipeline command line arguments are also defined in this configuration file. In general, this always includes the input and output paths describing where files are to be found and where to write output files to. Any major options within constituent tools that the user will likely want to modify are also included. There are a number of sections in the pipeline configuration file and these are listed below. Each section is required unless otherwise noted. * __description__: a brief description of the pipeline, * __resource path__: the default location for the resources for this pipeline, * __workflow__: and ordered list of the tasks to be executed, * __tools__: links the tasks in the _workflow_ section to actual tools, * __linkage__: describes interrelationships between the constituent tools, * __arguments__: definitions of the command line arguments that can be set for this pipeline, * __tools outputting to stream__: describes which tasks output to a stream, allowing tasks to be piped together, * __construct filenames__: describes methods for constructing the output filenames for constituent tasks, * __delete files__: describes which files are intermediate files and should be deleted during execution of the pipeline.

Each of these individual components of the pipeline configuration file are discussed in detail in the following sections. Tutorials are provided that give worked examples on how to construct a basic pipeline configuration file (Building a pipeline configuration file) and how to add the further options. Please refer to these for examples. ####Pipeline workflow The workflow is a simple list of the tasks appearing in the pipeline, in the order in which they are to be executed. Each of the tasks is a unique name describing the role of the task. There is no formal requirement on what these tasks should be named, only that each task is named uniquely. Descriptive names for the tasks ensures that other users of the pipeline can be clear about the role of each task in the pipeline. Each task has an associated set of parameters which can be modified if desired. ####Tasks to tools Each of the tasks listed in the workflow section is associated with a specific tool. This section simply tells gkno which tool is required to run each of the tasks. ####Task linkage The linkage section provides connections between the different tasks in the pipeline. These linkages do no necessarily apply between consecutive tasks, but can link parameters and files between any tasks in the pipeline. Each task in the pipeline that has any files coming from previous tasks in the pipeline, or require parameters used in previous tasks can be given a section within the linkage block. Each of these task blocks contains an entry for each of the command line arguments that depends on the output of a previous task. The information that can be provided for the command line argument is shown in the list below. Unless otherwise noted, the information is required.

  • link to this task: the tool whose output is to be used for the current tools input,
  • link to this argument: The command from the tool specified in tool that generates the output to be used,
  • extension (optional): in the case where the tool specified generates an output stub, the extension of the file to be linked is required. For example, the MosaikAligner output command (-out) is a stub and the output extensinos are defined as ‘.bam’, ‘.special.bam’ and ‘.multiple.bam’. If a later tool is linked to the MosaikAligner output, the extension is required to define which of the file is required (e.g. .bam).
  • json block (optional): ADD TEXT

As an example, consider the following hypthotical pipeline. The first task in the pipeline uses the premo tool and requires the -fq1 argument to define the input fastq file. The next task in the pipeline uses the mosaik-build-fastq and it also requires the same fastq file as input in the -q argument. It is important that both tasks are supplied with the same input fastq file, but it is overkill to require the user to specify the input to both tasks, since they are the same file. The linkage block described above can be used to link the -q argument in mosaik-build-fastq to the -fq1 argument in the premo task and so they are forced to be the same. Of course, the user can always overide any of the linkages by defining the inputs on the command line. ####Arguments The arguments block contains information about all the allowable command line arguments for the pipeline. These arguments are then linked to the relevant task within the pipeline. The arguments blocks can be supplied with the following information which are required unless otherwise stated:

  • link to this task: the task with which the command line argument should be associated. For input files, this would typically be the first task in the pipeline that uses the input files. If later tasks in the pipeline require the same input files, the linkage section would be used to link this later task to the first task.
  • link to this argument: the command line argument for the task specified by the above.
  • short form argument: an alternative, short-form version of the argument (e.g. –hs for --hash-size).
  • type: the expected data type associated with this argument. This can be one of the following: string, int, float or flag. On the command line, all arguments will expect a value to be provided unless the data type is set to flag.
  • default: the default value given to the argument.

Each of the pipeline has a set of command line arguments that are always available and do not appear in the pipeline configuration file. These are:

  • --input-path (-ip): the input path for all input files if the path is unspecified. If the path is specified, this path is obviously used, otherwise the assumption is that the files reside in the current working directory. Setting --input-path will force gkno to assume all unspecified input files (except for resource files, see below) are available in the path specified by --input-path.
  • --output-path (-op): similar to the input path. All output files are output to the --output-path unless a path is provided with the filename.
  • --resource-path (-rp): all files listed in the tool configuration file as resources files are assumed to be available in the resource path directory. By default, this is in the gkno root directory, but can be modified using this command.
  • --execute (-ex): a Boolean defining whether gkno should execute the scripts after creating them. Default: True
  • --verbose (-vb): a Boolean used to tell gkno whether to output verbose information to screen as gkno runs. Default: True
  • --export-config (-ec): tells know to generate a new configuration file. See the specific tutorial for further information on this option.
  • --multiple-runs (-mr): informs gkno that a json file is provided containing multiple sets of input files/parameters. See the specific tutorial for further information on this option.

It is desirable to include all commonly used tool command line arguments in the pipeline configuration file. More in-depth or esoteric commands are not included, but are always accessible using the tool specific commands. ####Streaming between tasks (optional) There are times where it is preferable to link several tasks together with pipes, so that each task sends its output to the stream and the following task accepts the stream as input. For cases where the intermediate files are not required, it can save a lot of memory to just pass information on the stream. The downside to this process is that if a task in the stream fails, GNU make cannot give information on the specific task that failed. As such, this option should only be used if there is a high degree of confidence in the tasks.

If included, this section is just a list of tasks that output to the stream. In the makefile, all files contained in this list output to the stream and so will be linked to the next task in the pipeline workflow by a pipe. If consecutive tasks appear in this list, then there will be multiple tasks linked by pipes in a single command line.

For further information on this feature and worked examples demonstrating its use, see the relevant pipeline tutorial. ####Construct filenames (optional) One of the design choices in gkno is to minimise the number of command line arguments that have to be supplied by the user. The hope is that keeping the command line as simple as possible, will ensure that the launcher is as easy to use as possible and allow people with less computational experience to use gkno. Tasks within the pipeline can produce multiple output files and so it can become difficult to handle naming all of the output files if the user isn't providing any input.

gnko handles the naming of output files in the following way. If there is only a single input file, then gkno uses this as the basis of the output filename (replacing the extension of the input with that of the output). If there are multiple input files, however, it is less obvious how to proceed. If the json entry for one of the input files in the arguments section of a tool configuration file contains the term '_use for filenames', then this input will be used as the basis for the output file, again replacing the extensions. There is one more way of defining the output filenames, which overrides the methods above, if present. This requires inclusion of the "construct filenames" section in the pipeline configuration.

This section describes to gkno what filenames, text or parameters to use in constructing the output filename. Usually, the filename is built using one of the input filenames (without extension) as a base and then parameters used in the pipeline can be included, so that on completion of the pipeline, it is clear what parameters were used in generating the files. The json section is build up by defining the task and then all command line arguments for that task for which a filename is required. For each of these arguments, the following information is required.

  • filename root: this is what generates the root of the output filename. This can either be text or if an input file is to be used for generating the output filename, this should be set to from argument. If _from argument is set the next three values are required.
  • get root from task (optional - required if 'filename root' is from argument): the task from which to get the filename to be used as the root of this filename.
  • get root from argument (optional - required if 'filename root' is from argument): the argument from which to get the filename to be used as the root of this filename.
  • remove input extension (optional - required if 'filename root' is from argument): instructs gkno to remove the extension from the argument being used as the root if set to true. If the filename is being built from an argument that has no extension, this isn't required and is thus set to false.

If the output filename is to include additional text from parameters in pipeline, the additional text from parameters block can be defined. This must contain the following elements:

  • order: a list showing the order of the parameters to be used in constructing the filename. filename root appears in this list (usually first) and all of the other parameters can be given any name. These names must match up with a description block of the same name in the additional text from parameters section.
  • description block: For each name in the order list (except filename root), a block with the following elements must be provided:
  • get parameter from task: the task from which to get the parameter and,
  • get parameter from argument: the argument from which to get the parameter.
  • extension: if the parameter being inputted is another filename, the path will be removed and if this extension is set to true the filename extension will be removed prior to integration in the output filename.
  • separator (optional): the delimiter to use when connecting together the root and the parameters into a filename. Default: '_'.

An example of the construct filenames block is provided below. The output filename (from the build-jump-database -out command line argument) will be constructed using the build-reference -fr argument (for example test.dat) with the extension removed. Thus the filename root is test. The output filename is this root followed by the parameter set with the command line argument -hs in task build-jump-database or its default value. Thus the filename would be test_15 since the delimiter is unset and defaults to '_'. The output extension is set later when the command lines are being constructed.

"build-jump-database" : {  
  "-out" : {  
    "filename root" : "from argument",  
    "get root from task" : "build-reference",  
    "get root from argument" : "-fr",  
    "remove input extension" : "true",  
    "additional text from variables" : {  
      "order" : [  
        "filename root",  
        "hash-size"  
      ],  
      "hash-size" : {  
        "get parameter from task" : "build-jump-database",  
        "get parameter from argument" : "-hs"  
      }  
    }  
  }  
}   

For worked examples on using this feature, see the "Constructing output filenames" tutorial. ####Delete intermediate files (optional) When a pipeline contains a large number of tasks, a large number of files are created during execution of the pipeline. A lot of these files are required by the user and so can be discarded. It is prudent to discard these files at the earliest convenience in the pipeline as keeping them (especially when dealing with large datasets) can cause problems with memory as there are a lot of very large files being created. We can define which files should be deleted by the pipeline (these are termed intermediate files) and at which point they should be deleted. This is achieved with the delete files section of the pipeline configuration file. This is relatively simple inclusion and consists of defining the command line argument that produces an intermediate file and which task to delete the file afterwards. There are two configurations of this depending on whether the output is a stub or not.

For worked examples on deleting intermediate files, see the 'Deleting intermediate files' tutorial. #####Output is not a stub The most common example is for a single output file. In this case, the required information is as follows:

"<task A>" : {  
  "<argument>" : {  
    "delete after task" : "<task B>"  
  }  
}  

The above description states that the output file created by task A and defined by the command line argument <argument> should be deleted. After <task A> has been run, the output file will remain available until after <task B> has been run and then it is deleted. #####Output is a stub The requirements for an output stub are very similar to above. If the output file is a stub, then it doesn't include an extension, but a list of extensions that will be produced. For example, if <argument> in <task A> is a stub, then it will produce a set of output files, for example, <stub>_file1.ext, <stub>_file2.ext. For these cases, we need to specify not only the argument, but also the extension of the particular file that is to be deleted. This is shown below:

"<task A>" : {  
  "<argument>" : {  
    "<extension>" : {  
      "delete after task" : "<task B>"  
    }  
  }  
}  

#####Intermediate status As well as the command to delete the files being added after the specified task has been run, the files designated as intermediate are listed in the .INTERMEDIATE section of the GNU makefile. If the files were simply deleted, rerunning the command:

make --file <Makefile name>

would cause GNU make to recreate the deleted files. By listing them as intermediate, GNU make treats these files as intermediate and will not try to recreate them when it discovers that they are missing. ####Example configuration file As an example, the pipeline configuration file for the singleSampleAlignment pipeline is shown below. This is a simple pipeline consisting of only two tools.

{
  "workflow" : [
    "fastqCheck",
    "MosaikBuildFastq",
    "MosaikAligner"
  ],
  "linkage" : {
    "MosaikBuildFastq" : {
      "-q" : {
        "tool" : "fastqCheck",
        "output command" : "-q"
      },
      "-q2" : {
        "tool" : "fastqCheck",
        "output command" : "-q2"
      }
    },
    "MosaikAligner" : {
      "-in" : {
        "tool" : "MosaikBuildFastq",
        "output command" : "-out"
      },
      "json parameters" : {
        "tool" : "fastqCheck",
        "output command" : "-out"
      }
    }
  },
  "arguments" : {
    "--name" : {
      "tool" : "pipeline",
      "command" : "null",
      "alternative" : "-n",
      "default" : "output"
    },
    "--input-path" : {
      "tool" : "pipeline",
      "command" : "null",
      "alternative" : "-ip",
      "default" : "$(PWD)"
    },
    "--output-path" : {
      "tool" : "pipeline",
      "command" : "null",
      "alternative" : "-op",
      "default" : "$(PWD)"
    },
    "--resource-path" : {
      "tool" : "pipeline",
      "command" : "null",
      "alternative" : "-rp",
      "default" : "$(RESOURCES)"
    },
    "--fastq" : {
      "tool" : "fastqCheck",
      "command" : "-q",
      "alternative" : "-q"
    },
    "--fastq2" : {
      "tool" : "fastqCheck",
      "command" : "-q2",
      "alternative" : "-q2"
    },
    "--median-fragment-length" : {
      "tool" : "MosaikBuildFastq",
      "command" : "-mfl",
      "alternative" : "-mfl"
    },
    "--sequencing-technology" : {
      "tool" : "MosaikBuildFastq",
      "command" : "-st",
      "alternative" : "-st",
      "default" : "illumina"
    }
  }
}
## Available tools The toolkit is dynamic and extra tools can be added by the Marthlab or others (in collaboration with the Marthlab). A list of currently available tools, along with a brief description and links to references are included below: ### Mosaik Mosaik is the Marthlabs sequence read alignment software and comprises multiple elements, each of which are described below. **MosaikBuild** MosaikBuild is used to convert a fasta format reference file into a native format used by the alignment software. Sequence reads themselves also require conversion into a format that the aligner can read. This is also achieved using MosaikBuild.

MosaikJump

A hash-based algorithm is used to perform alignments within Mosaik. To facilitate this, a jump database is required. This database is generated using the MosaikJump utility.

MosaikAligner

MosaikAligner description.

### Bamtools Bamtools description. ### Freebayes Freebayes description. #Tutorials All of the tutorials included on the _gkno_ website are included below for completeness. For each tutorial, a list of files included with _gkno_ is provided. These file are provided so that the user can actually run _gkno_ and modify configuration files in order to learn how to do everything necessary for their individual analysis needs.

The tutorials are listed such that general tutorials (e.g. how to download/install gkno and how to run basic commands) come first, followed by general methods for building and modifying pipelines. Finally, tutorials related to individual tools and pipelines are last.

##General tutorials ##Building and modifying pipelines ##Constructing output filenames ####Tutorial files * '<_gkno\_path>/config\_files/pipes/tutorial-construct-filenames1.json_' * '<_gkno\_path>/config\_files/pipes/tutorial-construct-filenames2.json_'

####Description We don't want to have to input the output filenames for every tool (especially if there are a lot of tools in the pipeline) as this would mean we need to type a very long command line and keep track of what we are naming all of our outputs. As long as the input file is to a task is unambiguous, gkno will use this file as the basis for constructing the output filename by switching the input files' extension to that of the output file. Sometimes, though, we want the output filename to be more informative and contain some information about the parameters used in its creation. We provide the framework for constructing output filenames from any number of input parameters and the method is described in this tutorial.

The configuration file provided for this tutorial is a simple Mosaik reference file builder. The first task (build_reference) takes the resource file test_genome.fa as an input. Since the mosaik tool configuration file instructs gkno to use this input as a basis for the output file, the output from the first task is the file test_genome.dat. No information is required in the pipeline configuration file to make this happen.

The second task in the pipeline is build-jump-database and depends on the input parameter --hash-size (-hs). The default value for this has been set to 10 for this tutorial. The input to this task is test_genome.dat and we want the output to be test_genome_10. So what we really want is the output filename to be the input filename without the extension and then add on the hash-size. We can do this by adding the following json code to the pipeline configuration file. Look at the contents of the configuration file
(less <gkno_path>/config\_files/pipes/tutorial-construct-filenames1.json)
to see how this fits in with the rest of the configuration file.

"construct filenames" : {  
  "build-jump-database" : {  
    "-out" : {  
      "filename root" : "from argument",  
      "get root from task" : "build-reference",  
      "get root from argument" : "-fr",  
      "remove input extension" : "true",  
      "additional text from parameters" : {  
        "order" : [  
          "filename root",  
          "hash-size"  
        ],  
        "hash-size" : {  
          "get parameter from task" : "build-jump-database",  
          "get parameter from argument" : "-hs"  
        }  
      }  
    }  
  }  
}  

The individual elements of this file are described below: #####'filename_root' We use this to tell gkno the root of the filename. We can just type in some text and this will be used, or we can type 'from argument' if we want to use another argument, for example the input file, as the root for the filename. If we use from argument, we need to include the next three pieces of information. #####'get root from task' If we are building the filename root from another filename, this tells gkno which task the filename comes from, #####'get root from argument' This tells gkno which command line argument with the task is the filename that we want to use as the filename root. #####'remove input extension' We don't have to remove the extension from the input file used as the root of the output filename, but we usually do. Setting this to true or false will let gkno what to do. ####'additional text from parameters' This is where we tell gkno what other parameters to build into the filename. First, we tell gkno the order that the parameters should appear in the output filename, usually beginning with the filename root. In the example json code above, we want the filename root followed by the hash-size. The name hash-size is arbitrary and you can call the elements comprising the output filename by any name you wish. For each of the elements in the order list, we need to provide gkno with the task and command line argument within that task from which to get the parameter. In this case, we are getting the hash-size from the -hs argument within the build-jump-database task.

If the parameter being added is itself a filename, gkno will remove the path from the filename and if we include extension along with the other information and set it to true, the extension will also be removed.

By default, parameters in the filename are separated by a '_'. This can be changed by including the separator key/value pair in this section. This is demonstrated below in the worked example. ####Worked example Create a new directory, move into it and then type the following command:

gkno pipe tutorial-construct-filenames1

You should see the reference files get created and on completion, the following files should be present:

  • test_genome_10_keys.jmp
  • test_genome_10_meta.jmp
  • test_genome_10_positions.jmp
  • test_genome.dat (the mosaik reference file)
  • tutorial-construct-filenames11.make (the Makefile)

As you can see, the jump database filenames are the mosaik reference filename (test_genome.dat) without the extension followed by the hash-size (10).

We are allowed to use other filenames in constructing the output filename. If you want to try this yourself, make a copy of the tutorial-construct-filenames1.json configuration file (in the <gkno_path>/config_files/pipes directory) and extend the filename construction by performing the following steps:

  • Add a new item to the order list (let's call it extra-name),
  • add a description of the information to add:
  • add a comma after the end of the hash-size section,
  • add a new section (just like the hash-size section),
  • set get parameter from task to build-reference,
  • set get parameter from argument to -fr,
  • add a new line (putting a comma after the -fr) called extension and set its value to true.

This should now add the filename test_genome after the hash-size in the output filename. Check this by running your new pipeline. The pipeline has already beed constructed as <-gkno_path>/config_files/pipes/tutorial-construct-filenames2.json_. This can be run using the command:

gkno pipe tutorial-construct-filenames2

You should now see that the files:

  • test_genome_10_test_genome_keys.jmp
  • test_genome_10_test_genome_meta.jmp
  • test_genome_10_test_genome_positions.jmp

have been created, as expected. ####Cleanup Remove any temporary configuration files that you produced in this tutorial.

##Deleting intermediate files ##Performing multiple runs of a pipeline ####Tutorial files * ‘<_gkno\_path>/config\_files/pipes/tutorial\-multiple\-runs.json_', * '<_gkno\_path>/resources/tutorial\-multiple\-inputs\-list.json_', * '<_gkno\_path>/resources/simulated\_reads\_1.fq_', * '<_gkno\_path>/resources/simulated\_reads\_2.fq_', * '<_gkno\_path>/resources/simulated\_reads\_set2\_1.fq_', * '<_gkno\_path>/resources/simulated\_reads\_set2\_2.fq_'

####Description When we have data from multiple samples, we want to analyse each set of data using the same pipeline. If we only have a couple of samples, it is easy enough to run the same pipeline a couple of times and just specify the different input files on the command line, but for larger data sets, this becomes a real hassle. We have constructed gkno in a way that allows us to specify all of the files that we want to analyse in a json file. gkno reads this file and creates a file to perform each separate analysis. This files are automatically executed sequentially, unless otherwise requested.

This method isn't restricted to specifying different input files. We also find that is extremely useful to analyse the same data set with different parameters. We can use the json file here to specify different parameters and again let gkno perform the analysis for each.

####Method Execution of the pipeline is almost identical to ordinary operation. The only modification is to add the --multiple-runs (-mr) argument on the command line, followed by the name of the json file containing the list of files/parameters. The json file needs to contain two things: 1) a description of the data contained in the file and 2) a list of input files parameters. This is an example of a simple json file.

{
  "format of data list" : [
    "argument 1",
    "argument 2"
  ],
  "data list" : [
    "A1",
    "A2",
    "B1",
    "B2",
  ]
}

The "format of data list" describes what to expect in the "data list". In the example above, we are providing values for the two pipeline command line arguments, argument 1 and argument 2 (see the worked example section below for a real example). This means that the "data list" must contain values for these arguments (in this order). So in the above example, gkno will use A1 as the value for argument 1 and A2 as the argument for argument 2 and generate a script to execute the pipeline. Next, another script is generated using the values B1 and B2 and so on for as many sets of data contained in the list. Each argument must be given a value for each run and so the number of elements in the "data list" must equal the number of runs to perform multiplied by the number of arguments in the "format of data list" section.

####Worked example Now lets look at an actual example of this. In a new directory, execute the following pipeline with gkno:

gkno pipe tutorial-multiple-runs --hash-size 10 --multiple-runs <gkno\_path>/resources/tutorial-multiple-runs-list.json

The --hash-size argument is just used as an example argument that will be applied to each run (it also makes the pipeline run quicker!). When finished, your directory should now contain the outputs from both runs. Specifically, the final outputs from each run are:

  • simulated_reads_1.bam
  • simulated_reads_set2_1.bam

Now lets try modifying the input file list so that each run has a different hash size. First create a new json file in the local directory:

cp <gkno_path>/resources/tutorial-multiple-runs-list.json ./test.json

and then edit the file manually:

vi test.json

If you are unfamiliar with vi, feel free to search the web for information, but we will explain all the necessary steps anyway. If you are familiar with vi, forgive the extra detail! Let's demand that for each run, we provide not only the files, but also the hash sizes. Type 'i' to enter insert mode in vi and then use then modify the "format of data list" to the following:

format of data list" : [
  "--fastq",
  "--fastq2",
  "--hash-size"
],

Note the comma after "--fastq2". We have now added the requirement that in the data list, each run must supply (in order), the first and second fastq files and the hash size. Let's use a hash size of 10 for the first run and 9 for the second. We do this be modifying the data list to look like the following:

"data list" : [
  "<gkno_path>/resources/simulated_reads_1.fq",
  "<gkno_path>/resources/simulated_reads_2.fq",
  "10",
  "<gkno_path>/resources/simulated_reads_set2_1.fq",
  "<gkno_path>/resources/simulated_reads_set2_2.fq",
  "9"
]

Again, note that each line (except for that ending with "9") ends with a comma. We have finished modifying the file, so press 'esc' to exit inert mode in vi and the type ":wq" (you will see this appear at the bottom of the screen) and press "Return'. Now, rerun gkno, but this time use the modified file list:

gkno pipe tutorial-multiple-runs -mr ./test.json

We used the short form of the --multiple-runs command line argument above for simplicity. Now if you scroll back through the outputs to screen, you will notice two things. Firstly, nothing happened for the first run. This is because we provided exactly the same inputs as the first time we ran gkno and so all the outputs files are already there and nothing needed to be done. This is stated in the following lines:

Executing makefile: tutorial-multiple-runs1.make
make: Nothing to be done for `all'.

gkno completed tasks successfully.

The second run, however, is different as we used a hash size of 9. Looking through the output messages, we can see that the build-reference task (the first in the pipeline, used to generate the file test_genome.dat was not run for the same reason as above. The file already exists and does not need updating. The rest of the pipeline (using a hash size of 9) was run as this creates all new files. If you look at the contents of your directory, you'll see the new jump database files (required by Mosaik):

  • test_genome_9_keys.jmp
  • test_genome_9_meta.jmp
  • test_genome_9_positions.jmp

and the final file from the pipeline:

  • simulated_reads_set2_1.bam

was recreated using this new jump database.

##Available tools ##Available pipelines