From 2157cc9652ae4da05bdd4e102b32c4d7c2a0cb9a Mon Sep 17 00:00:00 2001 From: Ivory Date: Sun, 17 Jan 2021 23:42:30 -0500 Subject: [PATCH] Fix reference path for user guide page. --- docs/index.html | 4 ++-- docs/search/search_index.json | 2 +- docs/sitemap.xml.gz | Bin 216 -> 216 bytes mkdocs/user-guide/docs/index.md | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/index.html b/docs/index.html index 6e0e29c48..fbc7ac213 100644 --- a/docs/index.html +++ b/docs/index.html @@ -263,7 +263,7 @@

BioLockJ User Guide:FAQ & Troubleshooting

-

BioLockJ Developers Guide

+

BioLockJ Developers Guide

Repository of functional tests
https://github.com/BioLockJ-Dev-Team/sheepdog_testing_suite

The user guide for our latest stable version
@@ -326,5 +326,5 @@

Citing BioLockJ * {@value #LAUNCHER_DESC} */ protected static final String LAUNCHER = \"genMod.launcher\"; private static final String LAUNCHER_DESC = \"Define executable language command if it is not included in your $PATH\"; In this example, the descriptions for PARAM and SCRIPT are written in the addNewProperty() method. The description for LAUNCHER is stored as its own string ( LAUNCHER_DESC ), and that string is referenced in the addNewProperty method and in the javadoc description for LAUNCHER . This rather verbose option IS NOT necissary, but it allows the description to be viewed through the api AND through javadocs, and IDE's; this is appropriate if you expect other classes to use the properties defined in your module. The descriptions for properties should be brief. Additional details such as interactions between properties or the effects of different values should be part of the getDetails() method. It should always be clear to a user what will happen if the value is \"null\". If there is a logical default for the property, that can passed as an additional argument to addNewProperty() . This value will only be used if there is no value given for the property in the config file (including any defaultProps layers and standard.properties). If your module uses any general properties (beyond any uses by the the super class), then you should register it in the module's constructor using the addGeneralProperty() method. public QiimeClosedRefClassifier() { super(); addGeneralProperty( Constants.EXE_AWK ); } The existing description and type for this property (defined in biolockj.Properties) will be returned if the module is queried about this property. For a list of general properties, run: biolockj_api listProps Finally, to very polished, you should override the isValidProp() method. Be sure to include the call to super. @Override public Boolean isValidProp( String property ) throws Exception { Boolean isValid = super.isValidProp( property ); switch(property) { case HN2_KEEP_UNINTEGRATED: try {Config.getBoolean( this, HN2_KEEP_UNINTEGRATED );} catch(Exception e) { isValid = false; } isValid = true; break; case HN2_KEEP_UNMAPPED: try {Config.getBoolean( this, HN2_KEEP_UNMAPPED );} catch(Exception e) { isValid = false; } isValid = true; break; } return isValid; } In the example above, the Humann2Parser module uses two properties that are not used by any super class. The call to super.isValidProp( property ) tests the property if it is used by a super class. This class only adds checks for its newly defined properties. Any property that is not tested, but is registered in the modules constructor will return true. This method is called through the API, and should be used to test one property at a time as if that is the only property in the config file. Tests to make sure that multiple properties are compatiable with each other should go in the checkDependencies() method. Generate user guide pages # For modules in the main BioLockJ project, the user guide pages are generated using the ApiModule methods as part of the deploy process. Third party developers can use the same utilities to create matching documentation. Suppose you have created one or more modules in a package com.joesCode and saved the compiled code in a jar file, /Users/joe/dev/JoesMods.jar . Set up a mkdocs project: # See https://www.mkdocs.org/#installation pip install mkdocs mkdocs --version mkdocs new joes-modules mkdir joes-modules/docs/GENERATED This mkdocs project will render markdown (.md) files into an html site. Mkdocs supports a lot of really nice features, including a very nice default template. Generate the .md files from your modules: java -cp $BLJ/dist/BioLockJ.jar:/Users/joe/dev/JoesMods.jar \\ biolockj.api.BuildDocs \\ joes-modules/docs/GENERATED \\ com.joesCode Put a link to your list of modules in the main index page. cd joes-modules echo \"[view module list](GENERATED/all-modules.md)\" >> docs/index.md The BuildDocs utility creates the .md files, but it assumes that these are part of a larger project, and you will need to make appropriate links to the generated pages from your main page. Preview your user guide: mkdocs serve Open up http://127.0.0.1:8000/ in your browser, and you'll see the default home page being displayed, with a link at the bottom to view module list , which links to a page listing all of the modules in the joes.modules pacakge. You can build this documentation locally using mkdocs build and then push to your prefered hosting site, or set up a service such as ReadTheDocs to render and host your documentation from your docs folder. Even if you choose not to build user guide pages for your module, you should still implement the ApiModule interface. Anyone who uses your module can generate the user guide pages if they want them, and even incorporate them into a custom copy of the main BioLockJ user guide. Any other support program, such as a GUI, could make use the the ApiModule methods as well. Using External Modules # To use a module that you have created yourself or aquired from a third party, you need to: Save the compiled code in a folder on your machine, for example: /Users/joe/biolockjModules/JoesMods.jar Include your module in the module run order in your config file, for example: #BioModule com.joesCode.biolockj.RunTool Be sure to include any properties your module needs in the config file. Use the --external-modules option when you call biolockj: biolockj --external-modules /Users/joe/biolockjModules myPipeline.properties Any other modules you have made or aquired can also be in the /Users/joe/biolockjModules folder. Finding and Sharing Modules # The official repository for external BioLockJ modules is blj_ext_modules . Each module has a folder at the top level of the repository and should include the java code as well a config file to test the module alone, a test file to run a multi-module pipeline that includes the module, and (where applicable) a dockerfile. This is work in progress.","title":"Building Modules"},{"location":"Building-Modules/#building-new-modules","text":"Any Java class that implements the BioModule interface can be added to a BioLockJ pipeline. The BioLockJ v1.0 implementation is currently focused on metagenomics analysis, but the generalized application framework is not limited to this domain. Users can implement new BioModules to automate a wide variety of bioinformatics and report analytics. The BioModule interface was designed so that users can develop new modules on their own.","title":"Building New Modules"},{"location":"Building-Modules/#beginners","text":"See the BioModule hello world tutorial.","title":"Beginners"},{"location":"Building-Modules/#coding-your-module","text":"To create a new BioModule , simply extend one of the abstract Java superclasses, code it's abstract methods, and add it to your pipeline with #BioModule tag your Config file:","title":"Coding your module"},{"location":"Building-Modules/#to-support-a-new-classifier-create-3-modules-that-implement-the-following-interfaces","text":"ClassifierModule : Implement to generate bash scripts needed to call classifier program ParserModule : Implement to parse classifier output, configure as classifier post-requisite OtuNode : Classifier specific implementation holds OTU information for 1 sequence","title":"To support a new classifier, create 3 modules that implement the following interfaces:"},{"location":"Building-Modules/#biomoduleimpl-is-the-top-level-superclass-for-all-modules","text":"Method Description checkDependencies() Must override. Called before executeTask() to identify Configuration errors and perform runtime validations. executeTask() Must override. Executes core module logic. cleanUp() Called after executeTask() to run cleanup operations, update Config properties, etc. getInputFiles() Return previous module output. getModuleDir() Return module root directory. getOutputDir() Return module output directory. getPostRequisiteModules() Returns a list of BioModules to run after the current module. getPreRequisiteModules() Returns a list of BioModules to run before the current module. getSummary() Return output directory summary. Most modules override this method by adding module specific summary details to super.getSummary(). getTempDir() Return module temp directory. setModuleDir(path) Set module directory.","title":"BioModuleImpl is the top-level superclass for all modules."},{"location":"Building-Modules/#scriptmoduleimpl-extends-biomoduleimpl-superclass-for-script-generating-modules","text":"Method Description buildScript(files) Must override. Called by executeTask() for datasets with forward reads only. The return type is a list of lists. Each nested list contains the bash script lines required to process 1 sample. Obtains sequence files from getInputFiles(). buildScriptForPairedReads(files) Calls back to buildScript(files) by default. Subclasses override this method to generate unique scripts for datasets containing paired reads. checkDependencies() Called before executeTask() to validate script.batchSize , script.exitOnError , script.numThreads , script.permissions , script.timeout getJobParams() Return shell command to execute the MAIN script. getScriptDir() Return module script directory. getSummary() Adds the script directory summary to super.getSummary(). Most modules override this method by adding module specific summary details to super.getSummary(). getTimeout() Return script.timeout . getWorkerScriptFunctions() Return bash script lines for any functions needed in the worker scripts.","title":"ScriptModuleImpl extends BioModuleImpl: superclass for script-generating modules."},{"location":"Building-Modules/#javamoduleimpl-extends-scriptmoduleimpl-superclass-for-pure-java-modules","text":"To avoid running code on the cluster head node, a temporary instance of BioLockJ is spawned on a cluster node which is launched by the sole worker script from the job queue. Method Description runModule() Must override. Executes core module logic. buildScript(files) This method returns a single line calling java on the BioLockJ source code, passing -d parameter to run in direct mode and the full class name of the JavaModule to indicate the module to run. getSource() Determines if running code from Jar or source code in order to write valid bash script lines. getTimeout() Return java.timeout . moduleComplete() Create the script success indicator file. moduleFailed() Create the script failures indicator file.","title":"JavaModuleImpl extends ScriptModuleImpl: superclass for pure Java modules."},{"location":"Building-Modules/#classifiermoduleimpl-extends-scriptmoduleimpl-biolockjmoduleclassifier-superclass","text":"Method Description buildScriptForPairedReads(files) Called by executeTask() for datasets with paired reads. The return type is a list of lists, where each nested list contains the bash script lines required to process 1 sample. Obtains sequence files from SeqUtil .getPairedReads(getInputFiles()). checkDependencies() Validate Configuration properties exe.classifier and exe.classifierParams , verify sequence file format, log classifier version info, and verify no biolockj.module.seq modules are configured run after the ClassifierModule . Subclasses should call super.checkDependencies() if overriding this method to retain these verifications. executeTask() Call buildScript(files) or buildScriptForPairedReads(files) based input sequence format and calls BashScriptBuilder to generate the main script + 1 worker script for every script.batchSize samples. To change the batch scheme, override this method to call the alternate BashScriptBuilder .buildScripts() method signiture and hard code the batch size. All biolockj.module.classifier modules override this method. getClassifierExe() Return Configuration property exe.classifier to call the classifier program in the bash scripts. If the classifier is not included in cluster.modules , validate that value is a valid file path. If exe.classifier is undefined, replace the property prefix exe with the lowercase prefix of the module class name (less the standard module suffix classifier ). For example, use rdp.classifier for RdpClassifier and kraken.classifier for KrakenClassifier . This allows users to define all classifier programs in a default Configuration file rather than setting exe.clssifier in each project Configuration file. getClassifierParams() Return Configuration property exe.classifierParams which may contain a list of parameters (without hyphens) to pass to the classifier program in the bash scripts. If exe.classifierParams is undefined, replace the property prefix exe with the lowercase prefix of the module class name as described for exe.classifier . getSummary() Adds input directory summary to super.getSummary(). Most modules override this method to add module specific summary details to super.getSummary(). logVersion() Run exe.classifier --version to log version info. RDP overrides this method to return null since the version switch is not supported.","title":"ClassifierModuleImpl extends ScriptModuleImpl: biolockj.module.classifier superclass."},{"location":"Building-Modules/#parsermoduleimpl-extends-javamoduleimpl-biolockjmoduleimplicitparser-superclass","text":"Method Description parseSamples() Must override. Called by executeTask() to populate the Set returned by getParsedSamples(). Each classifier requires a unique parser module to decode its output. This method should iterate through the classifier reports to build OtuNode s for each sample-OTU found in the report. The OtuNode s are stored in a ParsedSample and cached via addParsedSample( ParsedSample ). addParsedSample( sample ) Add the ParsedSample to the Set returned by getParsedSamples(). buildOtuTables() Generate OTU abundance tables from ClassifierModule output. checkDependencies() Validate Configuration properties ( report.minOtuCount , report.minOtuThreshold , report.logBase ) and verify no biolockj.module.classifier modules are configured to run after the ParserModule . executeTask() If report.numHits =Y, add \"Num_Hits\" column to metadata containing the number of reads that map to any OTU for each sample. Calls buildOtuTables() to generate module output. getParsedSample(id) Return the ParsedSample from the the Set returned by getParsedSamples() for a given id. getParsedSamples() Return 1 ParsedSample for each classified sample in the dataset.","title":"ParserModuleImpl extends JavaModuleImpl: biolockj.module.implicit.parser superclass."},{"location":"Building-Modules/#otunodeimpl-is-the-superclass-for-the-biolockjnode-package","text":"Method Description addOtu(level, otu) A node represents a single OTU, each level in the taxonomic hierarchy is populated with this method. getCount() Get the OTU count. getLine() Get the classifier report line used to create the node. getOtuMap() This map may contain 1 element for each of the report.taxonomyLevels and is populated by addOtu(level, otu). getSampleId() Get the sample ID to which the OTU belongs. report() Print node info to log file as DEBUG line - not visible unless pipeline.logLevel=DEBUG . setCount(num) Set the OTU count. setLine(line) Set the classifier report line used to create the node. setSampleId(id) set the sample ID to which the OTU belongs. OtuNodeImpl methods do not need to be overridden. New OtuNode implementations should call existing methods from their constructor.","title":"OtuNodeImpl is the superclass for the biolockj.node package."},{"location":"Building-Modules/#document-your-module","text":"The BioLockJ API allows outside resources to get information about the BioLockJ program and any available modules. To interface with the API, your module will need to implement the ApiModule interface .","title":"Document your module"},{"location":"Building-Modules/#api-generated-html-documentation","text":"The BioLockJ documentation is stored in markdown files and rendered into html using mkdocs. The BioLockJ API is designed to generate a markdown document, which is ready to be rendered into an html file using mkdocs.","title":"API-generated html documentation"},{"location":"Building-Modules/#built-in-descriptions","text":"Override the getCitationString() method. This should include citation information for any tool that your module wraps and a credit to yourself for creating the wrapper. Override the getDescription() method to return a short description of what your module does, this should be one to two sentences. For a more extensive description, including details about properties, expected inputs, assumptions, etc; override the getDetails() method (optional). If your module has any pre-requisit modules or post-requisit modules, the modules Details should include the names of these modules and information about when and why these modules are added.","title":"Built-in descriptions"},{"location":"Building-Modules/#documenting-properties","text":"If your module introduces any NEW configuration properties, those properties should registered to the module so the API can retrieve them. Register properties using the addNewProperty() method in the modules constructor. For example, the GenMod module defines three properties: public GenMod() { super(); addNewProperty( PARAM, Properties.STRING_TYPE, \"parameters to pass to the user's script\" ); addNewProperty( SCRIPT, Properties.FILE_PATH, \"path to user script\" ); addNewProperty( LAUNCHER, Properties.STRING_TYPE, LAUNCHER_DESC ); } protected static final String PARAM = \"genMod.param\"; protected static final String SCRIPT = \"genMod.scriptPath\"; /** * {@link biolockj.Config} property: {@value #LAUNCHER}
* {@value #LAUNCHER_DESC} */ protected static final String LAUNCHER = \"genMod.launcher\"; private static final String LAUNCHER_DESC = \"Define executable language command if it is not included in your $PATH\"; In this example, the descriptions for PARAM and SCRIPT are written in the addNewProperty() method. The description for LAUNCHER is stored as its own string ( LAUNCHER_DESC ), and that string is referenced in the addNewProperty method and in the javadoc description for LAUNCHER . This rather verbose option IS NOT necissary, but it allows the description to be viewed through the api AND through javadocs, and IDE's; this is appropriate if you expect other classes to use the properties defined in your module. The descriptions for properties should be brief. Additional details such as interactions between properties or the effects of different values should be part of the getDetails() method. It should always be clear to a user what will happen if the value is \"null\". If there is a logical default for the property, that can passed as an additional argument to addNewProperty() . This value will only be used if there is no value given for the property in the config file (including any defaultProps layers and standard.properties). If your module uses any general properties (beyond any uses by the the super class), then you should register it in the module's constructor using the addGeneralProperty() method. public QiimeClosedRefClassifier() { super(); addGeneralProperty( Constants.EXE_AWK ); } The existing description and type for this property (defined in biolockj.Properties) will be returned if the module is queried about this property. For a list of general properties, run: biolockj_api listProps Finally, to very polished, you should override the isValidProp() method. Be sure to include the call to super. @Override public Boolean isValidProp( String property ) throws Exception { Boolean isValid = super.isValidProp( property ); switch(property) { case HN2_KEEP_UNINTEGRATED: try {Config.getBoolean( this, HN2_KEEP_UNINTEGRATED );} catch(Exception e) { isValid = false; } isValid = true; break; case HN2_KEEP_UNMAPPED: try {Config.getBoolean( this, HN2_KEEP_UNMAPPED );} catch(Exception e) { isValid = false; } isValid = true; break; } return isValid; } In the example above, the Humann2Parser module uses two properties that are not used by any super class. The call to super.isValidProp( property ) tests the property if it is used by a super class. This class only adds checks for its newly defined properties. Any property that is not tested, but is registered in the modules constructor will return true. This method is called through the API, and should be used to test one property at a time as if that is the only property in the config file. Tests to make sure that multiple properties are compatiable with each other should go in the checkDependencies() method.","title":"Documenting Properties"},{"location":"Building-Modules/#generate-user-guide-pages","text":"For modules in the main BioLockJ project, the user guide pages are generated using the ApiModule methods as part of the deploy process. Third party developers can use the same utilities to create matching documentation. Suppose you have created one or more modules in a package com.joesCode and saved the compiled code in a jar file, /Users/joe/dev/JoesMods.jar . Set up a mkdocs project: # See https://www.mkdocs.org/#installation pip install mkdocs mkdocs --version mkdocs new joes-modules mkdir joes-modules/docs/GENERATED This mkdocs project will render markdown (.md) files into an html site. Mkdocs supports a lot of really nice features, including a very nice default template. Generate the .md files from your modules: java -cp $BLJ/dist/BioLockJ.jar:/Users/joe/dev/JoesMods.jar \\ biolockj.api.BuildDocs \\ joes-modules/docs/GENERATED \\ com.joesCode Put a link to your list of modules in the main index page. cd joes-modules echo \"[view module list](GENERATED/all-modules.md)\" >> docs/index.md The BuildDocs utility creates the .md files, but it assumes that these are part of a larger project, and you will need to make appropriate links to the generated pages from your main page. Preview your user guide: mkdocs serve Open up http://127.0.0.1:8000/ in your browser, and you'll see the default home page being displayed, with a link at the bottom to view module list , which links to a page listing all of the modules in the joes.modules pacakge. You can build this documentation locally using mkdocs build and then push to your prefered hosting site, or set up a service such as ReadTheDocs to render and host your documentation from your docs folder. Even if you choose not to build user guide pages for your module, you should still implement the ApiModule interface. Anyone who uses your module can generate the user guide pages if they want them, and even incorporate them into a custom copy of the main BioLockJ user guide. Any other support program, such as a GUI, could make use the the ApiModule methods as well.","title":"Generate user guide pages"},{"location":"Building-Modules/#using-external-modules","text":"To use a module that you have created yourself or aquired from a third party, you need to: Save the compiled code in a folder on your machine, for example: /Users/joe/biolockjModules/JoesMods.jar Include your module in the module run order in your config file, for example: #BioModule com.joesCode.biolockj.RunTool Be sure to include any properties your module needs in the config file. Use the --external-modules option when you call biolockj: biolockj --external-modules /Users/joe/biolockjModules myPipeline.properties Any other modules you have made or aquired can also be in the /Users/joe/biolockjModules folder.","title":"Using External Modules"},{"location":"Building-Modules/#finding-and-sharing-modules","text":"The official repository for external BioLockJ modules is blj_ext_modules . Each module has a folder at the top level of the repository and should include the java code as well a config file to test the module alone, a test file to run a multi-module pipeline that includes the module, and (where applicable) a dockerfile. This is work in progress.","title":"Finding and Sharing Modules"},{"location":"Built-in-modules/","text":"BioModules # Some modules are packaged with BioLockJ (see below). To use modules created by a third-party, add the compiled files (jar file) to your biolockj extentions folder. When you call biolockj , use the --external-modules arg to pass in the location of the extra modules: biolockj --external-modules To create your own modules, see Building-Modules . In all cases, add modules to your BioModule order section to include them in your pipeline. Built-in BioModules: # classifiers # r16s classifiers wgs classifiers implicit modules # implicit parsers module.implicit.parser.r16s.md module.implicit.parser.wgs.md implicit qiime modules report modules # humann2 report by otu report by taxon R reports taxa table modules # BuildTaxaTables AddPseudoCount NormalizeTaxaTables NormalizeByReadsPerMillion LogTransformTaxaTables AddMetadataToTaxaTables sequence modules # BioLockJ comes packaged with several modules for sequence pre-processing. AwkFastaConverter Gunzipper KneadData Multiplexer PearMergeReads RarefySeqs SeqFileValidator TrimPrimers DIY modules # GenMod Rmarkdown List All # See generated docs for all modules .","title":"BioModules"},{"location":"Built-in-modules/#biomodules","text":"Some modules are packaged with BioLockJ (see below). To use modules created by a third-party, add the compiled files (jar file) to your biolockj extentions folder. When you call biolockj , use the --external-modules arg to pass in the location of the extra modules: biolockj --external-modules To create your own modules, see Building-Modules . In all cases, add modules to your BioModule order section to include them in your pipeline.","title":"BioModules"},{"location":"Built-in-modules/#built-in-biomodules","text":"","title":"Built-in BioModules:"},{"location":"Built-in-modules/#classifiers","text":"r16s classifiers wgs classifiers","title":"classifiers"},{"location":"Built-in-modules/#implicit-modules","text":"implicit parsers module.implicit.parser.r16s.md module.implicit.parser.wgs.md implicit qiime modules","title":"implicit modules"},{"location":"Built-in-modules/#report-modules","text":"humann2 report by otu report by taxon R reports","title":"report modules"},{"location":"Built-in-modules/#taxa-table-modules","text":"BuildTaxaTables AddPseudoCount NormalizeTaxaTables NormalizeByReadsPerMillion LogTransformTaxaTables AddMetadataToTaxaTables","title":"taxa table modules"},{"location":"Built-in-modules/#sequence-modules","text":"BioLockJ comes packaged with several modules for sequence pre-processing. AwkFastaConverter Gunzipper KneadData Multiplexer PearMergeReads RarefySeqs SeqFileValidator TrimPrimers","title":"sequence modules"},{"location":"Built-in-modules/#diy-modules","text":"GenMod Rmarkdown","title":"DIY modules"},{"location":"Built-in-modules/#list-all","text":"See generated docs for all modules .","title":"List All"},{"location":"Check-Dependencies/","text":"BioLockJ is designed find all problems in one sitting. Every module includes a check dependencies method, which quickly detects issues that would cause an error during execution. This is run for all modules in a pipeline before the first module executes. When BioLockJ runs, it has three major phases: pipeline formation - string together the modues specified in the config file along with any additional modules that the program adds on the users behalf; and initiate the utilities needed for the pipeline (such as docker, metadata, determine input type). check dependencies - scan the pipeline for anything that may cause an error during execution run pipeline - execute each module in the sequence. Precheck a pipeline # By including the --precheck-only argument (or -p ) when running biolockj ; you are running in precheck mode. BioLockJ will do the first two phases, and then stop. This allows you to quickly test changes to your pipeline configuration without actually running a pipeline. It also allows you to see any modules that are automatically added to your pipeline.","title":"Check Dependencies"},{"location":"Check-Dependencies/#precheck-a-pipeline","text":"By including the --precheck-only argument (or -p ) when running biolockj ; you are running in precheck mode. BioLockJ will do the first two phases, and then stop. This allows you to quickly test changes to your pipeline configuration without actually running a pipeline. It also allows you to see any modules that are automatically added to your pipeline.","title":"Precheck a pipeline"},{"location":"Commands/","text":"The BioLockJ program is launched through the biolockj script. See biolockj --help . Support programs can access information about BioLockJ modules and properties through biolockj-api . There are also several helper scripts for small specific tasks, these are all found under $BLJ/script and added to the $PATH after the basic installation: Bash Commands # Command Description last-pipeline Get the path to the most recent pipeline. ideal for: cd $(last-pipeline) ls `last-pipeline` cd-blj Go to most recent pipeline & list contents. This is not a script, it is an alias that is added to your bash profile by the install script. The line defining it should look like: alias cd-blj='cd $(last-pipeline); quick_pipeline_view' quick_pipeline_view essentially just pwd and ls ; designed for the cd-blj alias. blj_reset Reset pipeline status to incomplete. If restarted, execution will start with the current module. Deprecated Commands # Command Description (Replacement) blj_log Tail last 1K lines from current or most recent pipeline log file. Replacement : cd $(last-pipeline); tail -1000 *.log blj_summary Print current or most recent pipeline summary. Replacement : cd $(last-pipeline); cat summary.txt blj_complete Manually completes the current module and pipeline status. This functionality should never be needed. For the rare occasions when it is appropriate, it can be done manually. Replacement : touch biolockjComplete blj_reset Reset pipeline status to incomplete. If restarted, execution will start with the current module. The need for this functionality is common; and a bash wrapper script still exists. Alternative : java -cp ${BLJ}/dist/BioLockJ.jar biolockj.launch.Reset ${PWD} blj_download If on cluster, extract and print the command syntax from the summary.txt file to download pipeline results to your local workstation directory: pipeline.downloadDir . no replacement : You will need to review your pipelines summary file to find the download command.","title":"Commands"},{"location":"Commands/#bash-commands","text":"Command Description last-pipeline Get the path to the most recent pipeline. ideal for: cd $(last-pipeline) ls `last-pipeline` cd-blj Go to most recent pipeline & list contents. This is not a script, it is an alias that is added to your bash profile by the install script. The line defining it should look like: alias cd-blj='cd $(last-pipeline); quick_pipeline_view' quick_pipeline_view essentially just pwd and ls ; designed for the cd-blj alias. blj_reset Reset pipeline status to incomplete. If restarted, execution will start with the current module.","title":"Bash Commands"},{"location":"Commands/#deprecated-commands","text":"Command Description (Replacement) blj_log Tail last 1K lines from current or most recent pipeline log file. Replacement : cd $(last-pipeline); tail -1000 *.log blj_summary Print current or most recent pipeline summary. Replacement : cd $(last-pipeline); cat summary.txt blj_complete Manually completes the current module and pipeline status. This functionality should never be needed. For the rare occasions when it is appropriate, it can be done manually. Replacement : touch biolockjComplete blj_reset Reset pipeline status to incomplete. If restarted, execution will start with the current module. The need for this functionality is common; and a bash wrapper script still exists. Alternative : java -cp ${BLJ}/dist/BioLockJ.jar biolockj.launch.Reset ${PWD} blj_download If on cluster, extract and print the command syntax from the summary.txt file to download pipeline results to your local workstation directory: pipeline.downloadDir . no replacement : You will need to review your pipelines summary file to find the download command.","title":"Deprecated Commands"},{"location":"Configuration/","text":"A configuration file encapsulates an analysis pipeline. BioLockJ takes a single configuration file as a runtime parameter. biolockj config.properties Every line in a BioLockJ configuration file is one of: BioModule (line starts with #BioModule ) comment (all other lines that start with # , has no effect) property ( name=value ) BioModule execution order # To include a BioModule in your pipeline, add a #BioModule line to the top your configuration file, as shown in the examples found in templates . Each line has the #BioModule keyword followed by the path for that module. For example: #BioModule biolockj.module.seq.PearMergeReads #BioModule biolockj.module.classifier.wgs.Kraken2Classifier #BioModule biolockj.module.report.r.R_PlotMds This line is given at the top of the user guide page for each module. BioModules will be executed in the order they are listed in here. A typical pipeline contians one classifier module . Any number of sequence pre-processing modules may come before the classifier module. Any number of report modules may come after the classifier module. In addition to the BioModules specified in the configuration file, BioLockJ may add implicit modules that the are required by specified modules. See Example Pipeline . A module can be given an alias by using the AS keyword in its execution line: #BioModule biolockj.module.seq.PearMergeReads AS Pear This is is generally used for modules that are used more than once in the same pipeline. Given this alias, the folder for this module will be called 01_Pear instead of 01_PearMergeReads , and any general properties directed to this module would use the prefix Pear instead of PearMergedReads . An alias must start with a capital letter, and cannot duplicate a name/alias of any other module in the same pipeline. Properties # Properties are defined as name-value pairs. List-values are comma separated. Leading and trailing whitespace is removed so \"propName=x,y\" is equivalent to \"propName = x, y\". See the list of available properties . Variables # Bash variables can be referenced in the config. They must be \"fully dressed\": ${VAR} There are two variables that BioLockJ requires: BLJ is the file path to the BioLockJ directory and BLJ_PROJ is the directory where pipelines created by BioLockJ are stored and run. After installation these are defined in the shell profile. These can referenced in the config file. The ~ (\"tilde\") is replaced with ${HOME} if (and only if) the ~ is the first character. Variables can also be defined in the config file and referenced in the same way: DIR=/path/to/big/data/dir sra.destinationDir=${DIR}/seqs sra.sraAccList=${DIR}/SraAccList.txt input.dirPaths=${DIR}/seqs Variables that are defined in the config file, can be referenced within the config file, however these variables are not added to the module script environment. If you are referencing environment variables and running in docker, you will need to use the -e parameter to biolockj to pass the variables into the docker environment (even if the variable is defined in the config file). For example: biolockj --docker -e SHEP=$SHEP,DIR=/path/to/big/data/dir config.properties Most environement variables will NOT be part of the module script environment. However, any environment variable that is referenced in the configuration file is considered necissary for the pipeline, and it is passed into the main program environment, docker containers, module runtime enviroments. Environment variables are not the best way to get information to a script because they can be difficult to trace / troubleshoot. However if your script or tool requires a particular environment variable, you can define it in your local environment, and reference it in the config file using an arbitrary property name, for example: my.variable=${QIIME_CONFIG_FP} This has essentially the same effect as using the -e QIIME_CONFIG_FP=$QIIME_CONFIG_FP argument in the biolockj command. If this variable is required this is one way to communicate that the value of QIIME_CONFIG_FP may change from one system to the next, but the existence of QIIME_CONFIG_FP is essential for the pipeline to run. Relative file paths # File paths can be given using relative paths. The path should start with ./ . The location . is interpreted as being the directory where the primary configuration file is. Example file structure: /users/joe/analysis01/ config.properties metadata.txt /sra/ SraAccList.txt Properties in config.properties can use relative paths: metadata.filePath=./metadata.txt sra.sraAccList=./sra/SraAccList.txt Note: ../ is also supported but it does not stack ( ../../../data/ is not supported). With this design, the \"analysis01\" folder could be shared or moved and the configuration file would not need to be updated to reflect the new location of the project files it references. Special properties # Some properties invoke special handling. pipeline.defaultProps # pipeline.defaultProps is a handled before any other property. It is used to link another properties file. The properties from that file are added to the MASTER set. The pipeline.defaultProps property itself is not included in the MASTER properties set. Module-specific forms # Many pipeline properties (usually those used by pipeline utilities) can be directed to a specific module. For example, script.numThreads is a general property that specifies that number of threads alloted to each script launched by any module; and PearMergeReads.numThreads overrides that property ONLY for the PearMergeReads module. exe.* properties # exe. properties are used to specify the path to common executables. exe. properties are special in that they have the automatic default of returning the the property name minus the exe. -prefex as their value. Modules are sometimes written to use a common tool, such as Rscript or bowtie . These modules will write scripts with the assumption that this command is on the $PATH when the script is executed UNLESS exe.Rscript is given specifying a path to use. The exe. properties are often specified in a defaultProps file for a given environment rather than in individual project properties files. Most often, docker containers are used because of the executables baked into them, and any exe. configurations are only applicable when not runnig in docker. In a pipeline running in docker, all references to an exe. property will return the default value (by removing the exe. prefix), regardless of how the exe. property is configured. In the rare case where you do need to give the path to executable within a container, you can specify this by using the prefix dockerExe. in place of exe. . In the even rarer case where you want to use an executable from your local machine, while running a pipeline in docker, you can specify this by using the prefix hostExe. in place of exe. . Chaining configuration files # Although all properties can be configured in one file, we recommend chaining default files through the pipeline.defaultProps option. This can often improve the portability, maintainability, and readability of the project-specific configuration files. Standard Properties # BioLockJ will always apply the standard.properties file packaged with BioLockJ under resources/config/default/ ; you do not need to specify this file in your pipeline.defaultProps chain. IFF running a pipeline in docker, then BioLockJ will apply the docker.properties file packaged with BioLockJ under resources/config/default/ . User-specified Defaults # We recommend creating an environment.properties file to assign envionment-specific defaults. Set cluster & script properties Set paths to key executables through exe properties Override standard.properties as needed. This information is the same for many (or all) projects run in this environment, and entering the info anew for each project is tedious, time-consuming and error-prone. If using a shared system, consider using a user.properties file. Set user-specific properties such as download.dir and mail.to. For shared projects, use a path that will be updated per-user, such as ~/biolock_user.properties Other logical intermediates my also present themselves. For example, some group of projects may need to override several of the defaults set in environmment.properties, but others still use the those defaults. Projects in this set can use pipeline.defaultProps=group2.properties and the group2.properties files may include pipeline.defaultProps=environment.properties Project Properties # Create a new configuration file for each pipeline to assign project-specific properties: Set the BioModule execution order Set pipeline.defaultProps = environment.properties You may use multiple default config files: pipeline.defaultProps=environment.properties,groupSettings.properties Override environment.properties and standard.properties as needed Example project configuration files can be found in templates . If the same property is given in multiple config files, the highest priority goes to the file used to launch the pipeline. Standard.properties always has the lowest priority. A copy of each configuration file is stored in the pipeline root directory to serve as primary project documentation.","title":"Configuration"},{"location":"Configuration/#biomodule-execution-order","text":"To include a BioModule in your pipeline, add a #BioModule line to the top your configuration file, as shown in the examples found in templates . Each line has the #BioModule keyword followed by the path for that module. For example: #BioModule biolockj.module.seq.PearMergeReads #BioModule biolockj.module.classifier.wgs.Kraken2Classifier #BioModule biolockj.module.report.r.R_PlotMds This line is given at the top of the user guide page for each module. BioModules will be executed in the order they are listed in here. A typical pipeline contians one classifier module . Any number of sequence pre-processing modules may come before the classifier module. Any number of report modules may come after the classifier module. In addition to the BioModules specified in the configuration file, BioLockJ may add implicit modules that the are required by specified modules. See Example Pipeline . A module can be given an alias by using the AS keyword in its execution line: #BioModule biolockj.module.seq.PearMergeReads AS Pear This is is generally used for modules that are used more than once in the same pipeline. Given this alias, the folder for this module will be called 01_Pear instead of 01_PearMergeReads , and any general properties directed to this module would use the prefix Pear instead of PearMergedReads . An alias must start with a capital letter, and cannot duplicate a name/alias of any other module in the same pipeline.","title":"BioModule execution order"},{"location":"Configuration/#properties","text":"Properties are defined as name-value pairs. List-values are comma separated. Leading and trailing whitespace is removed so \"propName=x,y\" is equivalent to \"propName = x, y\". See the list of available properties .","title":"Properties"},{"location":"Configuration/#variables","text":"Bash variables can be referenced in the config. They must be \"fully dressed\": ${VAR} There are two variables that BioLockJ requires: BLJ is the file path to the BioLockJ directory and BLJ_PROJ is the directory where pipelines created by BioLockJ are stored and run. After installation these are defined in the shell profile. These can referenced in the config file. The ~ (\"tilde\") is replaced with ${HOME} if (and only if) the ~ is the first character. Variables can also be defined in the config file and referenced in the same way: DIR=/path/to/big/data/dir sra.destinationDir=${DIR}/seqs sra.sraAccList=${DIR}/SraAccList.txt input.dirPaths=${DIR}/seqs Variables that are defined in the config file, can be referenced within the config file, however these variables are not added to the module script environment. If you are referencing environment variables and running in docker, you will need to use the -e parameter to biolockj to pass the variables into the docker environment (even if the variable is defined in the config file). For example: biolockj --docker -e SHEP=$SHEP,DIR=/path/to/big/data/dir config.properties Most environement variables will NOT be part of the module script environment. However, any environment variable that is referenced in the configuration file is considered necissary for the pipeline, and it is passed into the main program environment, docker containers, module runtime enviroments. Environment variables are not the best way to get information to a script because they can be difficult to trace / troubleshoot. However if your script or tool requires a particular environment variable, you can define it in your local environment, and reference it in the config file using an arbitrary property name, for example: my.variable=${QIIME_CONFIG_FP} This has essentially the same effect as using the -e QIIME_CONFIG_FP=$QIIME_CONFIG_FP argument in the biolockj command. If this variable is required this is one way to communicate that the value of QIIME_CONFIG_FP may change from one system to the next, but the existence of QIIME_CONFIG_FP is essential for the pipeline to run.","title":"Variables"},{"location":"Configuration/#relative-file-paths","text":"File paths can be given using relative paths. The path should start with ./ . The location . is interpreted as being the directory where the primary configuration file is. Example file structure: /users/joe/analysis01/ config.properties metadata.txt /sra/ SraAccList.txt Properties in config.properties can use relative paths: metadata.filePath=./metadata.txt sra.sraAccList=./sra/SraAccList.txt Note: ../ is also supported but it does not stack ( ../../../data/ is not supported). With this design, the \"analysis01\" folder could be shared or moved and the configuration file would not need to be updated to reflect the new location of the project files it references.","title":"Relative file paths"},{"location":"Configuration/#special-properties","text":"Some properties invoke special handling.","title":"Special properties"},{"location":"Configuration/#pipelinedefaultprops","text":"pipeline.defaultProps is a handled before any other property. It is used to link another properties file. The properties from that file are added to the MASTER set. The pipeline.defaultProps property itself is not included in the MASTER properties set.","title":"pipeline.defaultProps"},{"location":"Configuration/#module-specific-forms","text":"Many pipeline properties (usually those used by pipeline utilities) can be directed to a specific module. For example, script.numThreads is a general property that specifies that number of threads alloted to each script launched by any module; and PearMergeReads.numThreads overrides that property ONLY for the PearMergeReads module.","title":"Module-specific forms"},{"location":"Configuration/#exe-properties","text":"exe. properties are used to specify the path to common executables. exe. properties are special in that they have the automatic default of returning the the property name minus the exe. -prefex as their value. Modules are sometimes written to use a common tool, such as Rscript or bowtie . These modules will write scripts with the assumption that this command is on the $PATH when the script is executed UNLESS exe.Rscript is given specifying a path to use. The exe. properties are often specified in a defaultProps file for a given environment rather than in individual project properties files. Most often, docker containers are used because of the executables baked into them, and any exe. configurations are only applicable when not runnig in docker. In a pipeline running in docker, all references to an exe. property will return the default value (by removing the exe. prefix), regardless of how the exe. property is configured. In the rare case where you do need to give the path to executable within a container, you can specify this by using the prefix dockerExe. in place of exe. . In the even rarer case where you want to use an executable from your local machine, while running a pipeline in docker, you can specify this by using the prefix hostExe. in place of exe. .","title":"exe.* properties"},{"location":"Configuration/#chaining-configuration-files","text":"Although all properties can be configured in one file, we recommend chaining default files through the pipeline.defaultProps option. This can often improve the portability, maintainability, and readability of the project-specific configuration files.","title":"Chaining configuration files"},{"location":"Configuration/#standard-properties","text":"BioLockJ will always apply the standard.properties file packaged with BioLockJ under resources/config/default/ ; you do not need to specify this file in your pipeline.defaultProps chain. IFF running a pipeline in docker, then BioLockJ will apply the docker.properties file packaged with BioLockJ under resources/config/default/ .","title":"Standard Properties"},{"location":"Configuration/#user-specified-defaults","text":"We recommend creating an environment.properties file to assign envionment-specific defaults. Set cluster & script properties Set paths to key executables through exe properties Override standard.properties as needed. This information is the same for many (or all) projects run in this environment, and entering the info anew for each project is tedious, time-consuming and error-prone. If using a shared system, consider using a user.properties file. Set user-specific properties such as download.dir and mail.to. For shared projects, use a path that will be updated per-user, such as ~/biolock_user.properties Other logical intermediates my also present themselves. For example, some group of projects may need to override several of the defaults set in environmment.properties, but others still use the those defaults. Projects in this set can use pipeline.defaultProps=group2.properties and the group2.properties files may include pipeline.defaultProps=environment.properties","title":"User-specified Defaults"},{"location":"Configuration/#project-properties","text":"Create a new configuration file for each pipeline to assign project-specific properties: Set the BioModule execution order Set pipeline.defaultProps = environment.properties You may use multiple default config files: pipeline.defaultProps=environment.properties,groupSettings.properties Override environment.properties and standard.properties as needed Example project configuration files can be found in templates . If the same property is given in multiple config files, the highest priority goes to the file used to launch the pipeline. Standard.properties always has the lowest priority. A copy of each configuration file is stored in the pipeline root directory to serve as primary project documentation.","title":"Project Properties"},{"location":"Dependencies/","text":"BioLockJ requires Java 1.8+ and a Unix-like operating system such as Darwin/macOS , see Notes about environments . BioLockJ is a pipeline manager, designed to integrate and manage external tools. These external tools are not packaged into the BioLockJ program. BioLockJ must run in an environment where these other tools have been installed OR run through docker using docker images that have the tools installed. The core program, and all modules packaged with it, have corresponding docker images. Dependencies are required by modules listed in the BioModule Function column. Users DO NOT NEED TO INSTALL dependencies if not interested in the listed modules. For example, if you intend to classify 16S samples with RDP and WGS samples with Kraken, do not install: Bowtie2, GNU Awk, GNU Gzip, MetaPhlAn2, Python, QIIME 1, or Vsearch. # Program Version BioModule Function Link 1 Bowtie2 2.3.2 Metaphlan2Classifier : Build reference indexes download 2 GNU Awk 4.0.2 AwkFastaConverter : Convert Fastq to Fasta BuildQiimeMapping : Format metadata as QIIME mapping QiimeClosedRefClassifier : Build batch mapping files download 3 GNU Gzip 1.5 AwkFastaConverter : Decompress .gz files Gunzipper : Decompress .gz files download 4 Kraken 0.10.5-beta KrakenClassifier : Report WGS taxonomic summary download 5 MetaPhlAn2 2.0 Metaphlan2Classifier : Report WGS taxonomic summary (WGS) download 6 Python 2.7.12 BuildQiimeMapping : Run validate_mapping_file.py MergeQiimeOtuTables : Run merge_otu_tables.py QiimeClosedRefClassifier : Run pick_closed_reference_otus.py QiimeDeNovoClassifier : Run pick_de_novo_otus.py QiimeOpenRefClassifier : Run pick_open_reference_otus.py QiimeClassifier : Run add_alpha_to_mapping_file.py, add_qiime_labels.py, alpha_diversity.py, filter_otus_from_otu_table.py, print_qiime_config.py, and summarize_taxa.py Metaphlan2Classifier : Run metaphlan2.py download 7 PEAR 0.9.8 Paired-End reAd merger PearMergeReads Merge paired Fastq files since some classifiers ( RDP & QIIME ) will not accept paired reads. download 8 QIIME 1 1.9.1 Quantitative Insights Into Microbial Ecology BuildQiimeMapping : Validate QIIME mapping MergeQiimeOtuTables : Merge otu_table.biom files QiimeClosedRefClassifier : Pick OTUs by reference QiimeDeNovoClassifier : Pick OTUs by clustering QiimeOpenRefClassifier : Pick OTUs by reference and clustering QiimeClassifier : Report 16S taxonomic summary download 9 R 3.5.0 R_CalculateStats : Statistical modeling R_PlotPvalHistograms : Plot p-value histograms for each reportable metadata field R_PlotOtus : Build OTU-metadata boxplots and scatterplots R_PlotMds : Plot by top MDS axis R_PlotEffectSize : Build barplot of effect magnetude by OTU/taxa download 10 R-coin 1.2 COnditional Inference procedures in a permutatioN test framework R_CalculateStats : Compute exact Wilcox_test p-values download 11 R-ggpubr 0.1.8 R_PlotPvalHistograms : Set color palette R_PlotMds : Set color palette R_PlotEffectSize : Set color palette download 12 R-Kendall 2.2 R_CalculateStats : Compute rank correlation p-values for continuous data types download 13 R-properties 0.0-9 R_Module : Reads in the MASTER configuration properties file from the pipeline root directory download 14 R-stringr 1.2.0 R_Module : For string manipulation for handling Configuration properties download 15 R-vegan 2.5-2 R_PlotMds : Ordination methods, diversity analysis and other functions for ecologists. download 16 RDP 2.12 Ribosomal Database Project RdpClassifier : Report 16S taxonomic summary download 17 Vsearch 2.4.3 QiimeDeNovoClassifier : Chimera detection QiimeOpenRefClassifier : Chimera detection download Version Dependencies # The Version column contains the version tested during BioLockJ development, but other versions can often be substituted. Major releases (such as Python 2 vs. Python 3) contain API changes that will not integrate with the current BioLockJ code. Application APIs often change over time, so not all versions are supported. For example, Bowtie2 did not add the large index functionality until version 2.3.2.","title":"Dependencies"},{"location":"Dependencies/#version-dependencies","text":"The Version column contains the version tested during BioLockJ development, but other versions can often be substituted. Major releases (such as Python 2 vs. Python 3) contain API changes that will not integrate with the current BioLockJ code. Application APIs often change over time, so not all versions are supported. For example, Bowtie2 did not add the large index functionality until version 2.3.2.","title":"Version Dependencies"},{"location":"DevNotes-main/","text":"BioLockJ Developers Guide # Release process # Release process Javadocs # https://BioLockJ-Dev-Team.github.io/BioLockJ/javadocs/ Guidelines for new modules # Building Modules","title":"BioLockJ Developers Guide"},{"location":"DevNotes-main/#biolockj-developers-guide","text":"","title":"BioLockJ Developers Guide"},{"location":"DevNotes-main/#release-process","text":"Release process","title":"Release process"},{"location":"DevNotes-main/#javadocs","text":"https://BioLockJ-Dev-Team.github.io/BioLockJ/javadocs/","title":"Javadocs"},{"location":"DevNotes-main/#guidelines-for-new-modules","text":"Building Modules","title":"Guidelines for new modules"},{"location":"DevNotes-releaseProcess/","text":"Release process # The release process must be performed by someone with write permission on the main BioLockJ repository. Since that repository is owned by a GitHub group, anyone with owner permission in the group can perform the steps. Merge any pull requests that should be included in the release. Edit the version file to show the release version (ie, remove the \"-dev\" suffix) Render all documentation: cd $BLJ/resources; ant userguide Commit these changes, often with with message \"version++ to vx.y.z; render docs\" Tag the current master with the tag \"v.x.y.z-rc\" (\"release candidate\") Run release tests ( see details below ) Tag the current main branch of the BioLockJ repository with the official release tag. After saving the results of tests, use the same tag for the sheepdog_testing_suite main branch. Push the commits and tags to the central main: git push --tags upstream Build the distribution tarball ( see details below ) In GitHub, go to tags, select the new release tag, edit it, and upload the tarball you just created. Trigger DockerHub builds by pushing to linked github repository ( see details below ) Set new dev version Use next patch release (even if the next release is expected to be major). After release v1.3.14, set the version file to say \"v1.3.15-dev\". Commit this with message \"Dev continues toward v1.3.15\". Review : Use the link to the latest release on the Getting-Started page, and make sure the release appears correct. Make sure the user guide link(s) in the top repo README both reflect the latest release The view through github.io is controlled under the Settings for the BioLockJ repository. The view through readthedocs is controlled by the biolockj project which has multiple admins. Look for failed docker builds . The auto builds are configured through the biolockjdevteam organization on DockerHub, which as of late 2020 is a paid account, and has multiple admins, Running release tests # Use the tools in the repository: BioLockJ_Dev_Team/sheepdog_testing_suite. The tools in this suite will automatically build the BioLockJ program from source, but they will not build the updated docker image. Many tests run in docker use the --blj arg so that the current BioLockJ folder is mapped in, so there is no need to update the image to test a local copy of BioLockJ. For individual modules, the corresponding docker image probably hasn't changed since the last version, so you can save a bit of time during testing by simply re-tagging the old images with the new version: $BLJ/resources/docker/docker_build_scripts/retagForLocalTests.sh v1.3.15 v1.3.16 Any image whose dockerfile was changed should be built. And the biolockj_controller should be built (since presumably that has changed since the last version). To build all images, use the buildDockerImages.sh script with no args. With one arg, any image matching that string will be built. $BLJ/resources/docker/docker_build_scripts/buildDockerImages.sh controller The sheepdog_testing_suite has further instructions for setting up the tests. Use the main branch, and tag it with the same release candidate tag used for the BioLockJ repository. Run each of the /test/run_*_testCollection.sh scripts in the corresponding environment. Save results file under archived_testCollection_results (see existing examples for which files to save) (recommended) Locally save the pipelines for all tests for later reference. But DO NOT commit these in either repository. If tests fail (that previously passed), reconsider the release. Make and commit quick fixes if that is feasible. Assuming tests pass, proceed with release process. Building for deployment # Best practice for packaging the official release is to download a fresh copy of the official repo, and build within a docker image. The fresh clone ensures that git-ignored files that are in the local repo copy are incorporated in the official deployment. Using the docker image promotes consistency, and reduces the chances of invisible dependencies. (Not to mention its downright convenient!) git clone https://github.com/BioLockJ-Dev-Team/BioLockJ.git cd BioLockJ docker run --rm -v $PWD:/biolockj biolockjdevteam/build_and_deploy If needed, the git clone command could be replaced with wget https://github.com/BioLockJ-Dev-Team/BioLockJ/archive/main.zip , or any other download command. Triggering docker builds # BioLockJ docker images, most importantly biolockj_controller, are hosted on docker hub under the organization \"biolockjdevteam\". The images for modules in that are packaged with the main program, and the image for the BioLockJ program itself, are set up to build on docker hub infrastructure automatically. For the modules, this typically creates an identical image, and gives it a new tag matching the current release version. This automated build is triggered when a tag matching our version format (ie v1.2.3) is pushed to the linked github repository. As of this writting, dockerhub and github have a nice integration, but it does not allow for linking to a repository owned by an organization (like our BioLockJ repository is owned by the biolockj_dev_team organization). So we have a separate fork of the repository that exists solely to trigger builds on dockerhub. The bot user is \"biolockjBuilder\". In order to push to this repo, you will need permission. Any new user who will do the release process will need to be added as a collaborator to that repository. (first time only) Set up the biolockjBuilder fork as a remote for you BioLockJ git repository: git remote add DockerBuilder https://github.com/biolockjBuilder/BioLockJ.git Push the release tag to this repository. git push DockerBuilder --tags Within a few minutes there should be builds scheduled on DockerHub for the auto-build repositories. They may take some time to actually build. After a few hours, check the repositories to see that new builds exist and that no builds failed. Failed docker builds # Sometimes there are random failures (maybe a website was down temporarily) and you will need to build the image locally and push it with the desired tag. If the build fails for the biolockj_controller image, that is a big problem and you need to figure out why. If the build fails for one of the modules, that usually means that a url in the dockerfile needs to be updated. In some cases, some dependency is no longer available (no longer hosted). In that case, pull the previous version of the image, retag it with the current tag and push to dockerhub. Make in issue to resolve the problem before the next release. If the dockerfile can be updated to create a functional image to run the module, great, do that. If that is not possible, then the most recent image is the image, and the module's docker tag method should no longer use the current biolockj version, but should instead by hard-coded to the most recent version. Turn off auto-builds for that image. This is probably a red-flag that the software is no longer supported, and the module will (eventually) need to be replaced.","title":"Release process"},{"location":"DevNotes-releaseProcess/#release-process","text":"The release process must be performed by someone with write permission on the main BioLockJ repository. Since that repository is owned by a GitHub group, anyone with owner permission in the group can perform the steps. Merge any pull requests that should be included in the release. Edit the version file to show the release version (ie, remove the \"-dev\" suffix) Render all documentation: cd $BLJ/resources; ant userguide Commit these changes, often with with message \"version++ to vx.y.z; render docs\" Tag the current master with the tag \"v.x.y.z-rc\" (\"release candidate\") Run release tests ( see details below ) Tag the current main branch of the BioLockJ repository with the official release tag. After saving the results of tests, use the same tag for the sheepdog_testing_suite main branch. Push the commits and tags to the central main: git push --tags upstream Build the distribution tarball ( see details below ) In GitHub, go to tags, select the new release tag, edit it, and upload the tarball you just created. Trigger DockerHub builds by pushing to linked github repository ( see details below ) Set new dev version Use next patch release (even if the next release is expected to be major). After release v1.3.14, set the version file to say \"v1.3.15-dev\". Commit this with message \"Dev continues toward v1.3.15\". Review : Use the link to the latest release on the Getting-Started page, and make sure the release appears correct. Make sure the user guide link(s) in the top repo README both reflect the latest release The view through github.io is controlled under the Settings for the BioLockJ repository. The view through readthedocs is controlled by the biolockj project which has multiple admins. Look for failed docker builds . The auto builds are configured through the biolockjdevteam organization on DockerHub, which as of late 2020 is a paid account, and has multiple admins,","title":"Release process"},{"location":"DevNotes-releaseProcess/#running-release-tests","text":"Use the tools in the repository: BioLockJ_Dev_Team/sheepdog_testing_suite. The tools in this suite will automatically build the BioLockJ program from source, but they will not build the updated docker image. Many tests run in docker use the --blj arg so that the current BioLockJ folder is mapped in, so there is no need to update the image to test a local copy of BioLockJ. For individual modules, the corresponding docker image probably hasn't changed since the last version, so you can save a bit of time during testing by simply re-tagging the old images with the new version: $BLJ/resources/docker/docker_build_scripts/retagForLocalTests.sh v1.3.15 v1.3.16 Any image whose dockerfile was changed should be built. And the biolockj_controller should be built (since presumably that has changed since the last version). To build all images, use the buildDockerImages.sh script with no args. With one arg, any image matching that string will be built. $BLJ/resources/docker/docker_build_scripts/buildDockerImages.sh controller The sheepdog_testing_suite has further instructions for setting up the tests. Use the main branch, and tag it with the same release candidate tag used for the BioLockJ repository. Run each of the /test/run_*_testCollection.sh scripts in the corresponding environment. Save results file under archived_testCollection_results (see existing examples for which files to save) (recommended) Locally save the pipelines for all tests for later reference. But DO NOT commit these in either repository. If tests fail (that previously passed), reconsider the release. Make and commit quick fixes if that is feasible. Assuming tests pass, proceed with release process.","title":"Running release tests"},{"location":"DevNotes-releaseProcess/#building-for-deployment","text":"Best practice for packaging the official release is to download a fresh copy of the official repo, and build within a docker image. The fresh clone ensures that git-ignored files that are in the local repo copy are incorporated in the official deployment. Using the docker image promotes consistency, and reduces the chances of invisible dependencies. (Not to mention its downright convenient!) git clone https://github.com/BioLockJ-Dev-Team/BioLockJ.git cd BioLockJ docker run --rm -v $PWD:/biolockj biolockjdevteam/build_and_deploy If needed, the git clone command could be replaced with wget https://github.com/BioLockJ-Dev-Team/BioLockJ/archive/main.zip , or any other download command.","title":"Building for deployment"},{"location":"DevNotes-releaseProcess/#triggering-docker-builds","text":"BioLockJ docker images, most importantly biolockj_controller, are hosted on docker hub under the organization \"biolockjdevteam\". The images for modules in that are packaged with the main program, and the image for the BioLockJ program itself, are set up to build on docker hub infrastructure automatically. For the modules, this typically creates an identical image, and gives it a new tag matching the current release version. This automated build is triggered when a tag matching our version format (ie v1.2.3) is pushed to the linked github repository. As of this writting, dockerhub and github have a nice integration, but it does not allow for linking to a repository owned by an organization (like our BioLockJ repository is owned by the biolockj_dev_team organization). So we have a separate fork of the repository that exists solely to trigger builds on dockerhub. The bot user is \"biolockjBuilder\". In order to push to this repo, you will need permission. Any new user who will do the release process will need to be added as a collaborator to that repository. (first time only) Set up the biolockjBuilder fork as a remote for you BioLockJ git repository: git remote add DockerBuilder https://github.com/biolockjBuilder/BioLockJ.git Push the release tag to this repository. git push DockerBuilder --tags Within a few minutes there should be builds scheduled on DockerHub for the auto-build repositories. They may take some time to actually build. After a few hours, check the repositories to see that new builds exist and that no builds failed.","title":"Triggering docker builds"},{"location":"DevNotes-releaseProcess/#failed-docker-builds","text":"Sometimes there are random failures (maybe a website was down temporarily) and you will need to build the image locally and push it with the desired tag. If the build fails for the biolockj_controller image, that is a big problem and you need to figure out why. If the build fails for one of the modules, that usually means that a url in the dockerfile needs to be updated. In some cases, some dependency is no longer available (no longer hosted). In that case, pull the previous version of the image, retag it with the current tag and push to dockerhub. Make in issue to resolve the problem before the next release. If the dockerfile can be updated to create a functional image to run the module, great, do that. If that is not possible, then the most recent image is the image, and the module's docker tag method should no longer use the current biolockj version, but should instead by hard-coded to the most recent version. Turn off auto-builds for that image. This is probably a red-flag that the software is no longer supported, and the module will (eventually) need to be replaced.","title":"Failed docker builds"},{"location":"Example-Pipeline/","text":"In our example analysis, we investigate the differences between the microbiome of 20 rural and 20 recently urbanized subjects from the Chinese province of Hunan. For more information on this dataset, please review the analysis Fodor Lab published in the Sep 2017 issue of the journal Microbiome: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0338-7 Step 1: Prepare BioLockJ Config File # The BioLockJ project Config chinaKrakenFullDB.properties lists 5 BioModules to run (lines 3-7) + 13 properties: #BioModule biolockj.module.implicit.RegisterNumReads #BioModule biolockj.module.classifier.wgs.KrakenClassifier #BioModule biolockj.module.report.taxa.NormalizeTaxaTables #BioModule biolockj.module.report.r.R_PlotPvalHistograms #BioModule biolockj.module.report.r.R_PlotOtus In addition to the 5 listed BioModules, 4 additional implicit BioModules will also run: Mod# Module Description 1 ImportMetadata Always run 1st (for all pipelines) 2 KrakenParser Always run after KrakenClassifier 3 AddMetadataToOtuTables Always run just before the 1st R module 4 CalculateStats Always run as the 1st R module. Key properties: Line# Property Description 08 cluster.jobHeader Each script will run on 1 node, 16 cores, and 128GB RAM for up to 30 minutes 10 pipeline.defaultProps Default config file defines most properties \u2013 in this case copperhead.properties 12 input.dirPaths Directory path containing 40 gzipped whole genome sequencing (WGS) fastq files 18 metadata.filePath Metadata file path: chinaMetadata.tsv BioLockJ must associate sequence files in input.dirPaths with the correct metadata row. This is done by matching sequence file names to the 1st column in the metadata file. If the Sample ID is not found in your file names, the file names must be updated. Use the following properties to ignore a file prefix or suffix when matching the sample IDs. input.suffixFw input.suffixRv input.trimPrefix input.trimSuffix Sample IDs from 1st column of the metadata file: 081A, 082A, 083A...etc. Sequence file names: 081A_R1.fq.gz, 082A_R1.fq.gz, 083A_R1.fq.gz...etc. The default Config file, copperhead.properties, has its own default Config file standard.properties which defines the property input.suffixFw=_R1 . As a result, all characters starting with (and including) \u201c_R1\u201d are ignored when matching the file name to the metadata sample ID. Step 2: Run BioLockJ Pipeline # > biolockj ~/chinaKrakenFullDB.properties Look in the BioLockJ pipeline output directory defined by $BLJ_PROJ for a new pipeline directory named after the property file + today\u2019s date: ~/projects/chinaKrakenFullDB_2018Apr09 The 5 configured modules have run in order, with the addition of 2 implicit modules (1st and last) which are added to all pipelines automatically. The biolockjComplete file indicates the pipeline ran successfully. Step 3: Review Pipeline Summary # Run the blj_summary command to review the pipeline execution summary. > blj_summary Pipeline Summary Step 4: Download R Reports # Run the blj_download command to get the command needed to download the analysis. > blj_download > rsync Step 5: Analyze R Reports # Open downloadDir on your local filesystem to review the analysis. This directory contains: Output Description /temp Directory where R log files are saved if R script runs locally. /tables Directory containing the OTU tables. /local Directory where R script output is saved if R script runs locally and r.debug=Y . *.RData The saved R sessions for R modules run if r.saveRData=Y . chinaKrakenFullDB.log The pipeline Java log file. MAIN_*.R Each R script for each module that generated reports has been updated to run on your local filesystem. *.tsv files Spreadsheets containing p-value and R^2 statistics for each OTU in the taxonomy level. *.pdf files P-value histograms, and bar-charts or scatterplots for each OTU in the taxonomy level. Each R module generates a report for each report.taxonomyLevel configured: Open chinaKrakenFullDB_Log10_genus.pdf # The report begins with the unadjusted P-Value Distributions: Since r.numHistogramBreaks=20 so the 1st bar represents the p-values < 0.05. The ruralUrban attribute appears significant, as indicated by the high number p-values < 0.05. For each OTU, a bar-chart or scatterplot is output with adjusted parametric and non-parametric p-values formatted using in the plot header. The p-value format is defined by r.pValFormat . The p-adjust method is defined by rStats.pAdjustMethod . P-values that meet the r.pvalCutoff threshold are highlighted with r.colorHighlight .","title":"Example Pipeline"},{"location":"Example-Pipeline/#step-1-prepare-biolockj-config-file","text":"The BioLockJ project Config chinaKrakenFullDB.properties lists 5 BioModules to run (lines 3-7) + 13 properties: #BioModule biolockj.module.implicit.RegisterNumReads #BioModule biolockj.module.classifier.wgs.KrakenClassifier #BioModule biolockj.module.report.taxa.NormalizeTaxaTables #BioModule biolockj.module.report.r.R_PlotPvalHistograms #BioModule biolockj.module.report.r.R_PlotOtus In addition to the 5 listed BioModules, 4 additional implicit BioModules will also run: Mod# Module Description 1 ImportMetadata Always run 1st (for all pipelines) 2 KrakenParser Always run after KrakenClassifier 3 AddMetadataToOtuTables Always run just before the 1st R module 4 CalculateStats Always run as the 1st R module. Key properties: Line# Property Description 08 cluster.jobHeader Each script will run on 1 node, 16 cores, and 128GB RAM for up to 30 minutes 10 pipeline.defaultProps Default config file defines most properties \u2013 in this case copperhead.properties 12 input.dirPaths Directory path containing 40 gzipped whole genome sequencing (WGS) fastq files 18 metadata.filePath Metadata file path: chinaMetadata.tsv BioLockJ must associate sequence files in input.dirPaths with the correct metadata row. This is done by matching sequence file names to the 1st column in the metadata file. If the Sample ID is not found in your file names, the file names must be updated. Use the following properties to ignore a file prefix or suffix when matching the sample IDs. input.suffixFw input.suffixRv input.trimPrefix input.trimSuffix Sample IDs from 1st column of the metadata file: 081A, 082A, 083A...etc. Sequence file names: 081A_R1.fq.gz, 082A_R1.fq.gz, 083A_R1.fq.gz...etc. The default Config file, copperhead.properties, has its own default Config file standard.properties which defines the property input.suffixFw=_R1 . As a result, all characters starting with (and including) \u201c_R1\u201d are ignored when matching the file name to the metadata sample ID.","title":"Step 1: Prepare BioLockJ Config File"},{"location":"Example-Pipeline/#step-2-run-biolockj-pipeline","text":"> biolockj ~/chinaKrakenFullDB.properties Look in the BioLockJ pipeline output directory defined by $BLJ_PROJ for a new pipeline directory named after the property file + today\u2019s date: ~/projects/chinaKrakenFullDB_2018Apr09 The 5 configured modules have run in order, with the addition of 2 implicit modules (1st and last) which are added to all pipelines automatically. The biolockjComplete file indicates the pipeline ran successfully.","title":"Step 2: Run BioLockJ Pipeline"},{"location":"Example-Pipeline/#step-3-review-pipeline-summary","text":"Run the blj_summary command to review the pipeline execution summary. > blj_summary Pipeline Summary","title":"Step 3: Review Pipeline Summary"},{"location":"Example-Pipeline/#step-4-download-r-reports","text":"Run the blj_download command to get the command needed to download the analysis. > blj_download > rsync","title":"Step 4: Download R Reports"},{"location":"Example-Pipeline/#step-5-analyze-r-reports","text":"Open downloadDir on your local filesystem to review the analysis. This directory contains: Output Description /temp Directory where R log files are saved if R script runs locally. /tables Directory containing the OTU tables. /local Directory where R script output is saved if R script runs locally and r.debug=Y . *.RData The saved R sessions for R modules run if r.saveRData=Y . chinaKrakenFullDB.log The pipeline Java log file. MAIN_*.R Each R script for each module that generated reports has been updated to run on your local filesystem. *.tsv files Spreadsheets containing p-value and R^2 statistics for each OTU in the taxonomy level. *.pdf files P-value histograms, and bar-charts or scatterplots for each OTU in the taxonomy level. Each R module generates a report for each report.taxonomyLevel configured:","title":"Step 5: Analyze R Reports"},{"location":"Example-Pipeline/#open-chinakrakenfulldb_log10_genuspdf","text":"The report begins with the unadjusted P-Value Distributions: Since r.numHistogramBreaks=20 so the 1st bar represents the p-values < 0.05. The ruralUrban attribute appears significant, as indicated by the high number p-values < 0.05. For each OTU, a bar-chart or scatterplot is output with adjusted parametric and non-parametric p-values formatted using in the plot header. The p-value format is defined by r.pValFormat . The p-adjust method is defined by rStats.pAdjustMethod . P-values that meet the r.pvalCutoff threshold are highlighted with r.colorHighlight .","title":"Open chinaKrakenFullDB_Log10_genus.pdf"},{"location":"FAQ/","text":"FAQ, Troublshooting and Special Cases # Question: How much does it cost to use BioLockJ ? # Answer: BioLockJ itself free and open-source. BioLockJ is designed for large data sets; and it is often necissary to purchase computational resources to handle large datasets and to run the processes that BioLockJ will manage. This cost often comes in the form of buying an effective computer, subscribing to a cluster, or purchasing cloud computeing power. Question: What are the system requirements for running BioLockJ ? # Answer: Either unix-and-java or docker, details below. Easy mode: you have a unix system and you can run docker. You're covered. BioLockJ requires java, but if you can run docker, then all of the java-components can run inside the docker container. Easy-ish mode: no unix, but you can run docker. See Pure-Docker . Local host mode: No docker. You need to have a unix-like system and java 1.8 or later. The launch process for BioLockJ will be easy, but the majority of modules have essential dependencies and you will have to install each of those dependencies on your own system. See Dependencies . In terms of memory, ram and cpus; the amount required really depends on the size of the data you are processing and the needs of the algorithms you are running. In general, processing sequence data requires a computer cluster or a cloud-computing system (more than a typical individual-user machine). After sequence data have been summarized as tables, all subsequent steps are orders of magnetude smaller and can usually run on a laptop within a matter of minutes. Most datasets can be dramatically sub-sampled to allow a laptop user to run a test of the pipeline; this does not produce usable results, but allows the user to test and troubleshoot the pipeline in a convenient setting before moving it to a bigger system. Question: BioLockJ says that my pipeline is running...now what? # Answer: Check on your pipeline's progress. See the Getting Started page . If you are using a unix-like system, you can use the cd-blj alias to jump to the most recent pipeline. On any system, the path to the new pipeline is printed during the launch process, it will be folder immediatly under your $BLJ_PROJ folder. Look in that directory. When I pipeline forms it creates the \"precheckStarted\" flag and then replaces that with the \"precheckComplete\" flag when all dependencies/settings are confirmed. Then the pipeline starts the first module, and the flag is replaced with \"biolockjStarted\". This generally takes a few seconds or less. The subfolder for the current module will also have the \"biolockjStarted\" flag. When a module is finished, the module flag is replaced with \"biolockjComplete\". When the last module is finished, the pipeline flag is finally changed to \"biolockjComplete\". From the pipeline folder, ls 0* is a quick way to see the current progress, becuase that will show the flag files and subfolders for each of the first ten modules. (That's \"LS zero star\", or \"LS one star\" if you have more than ten modules.) If any module encounters an error, and cannot complete, then that module is marked with the \"biolockjFailed\" flag, the pipeline shuts down, and the pipeline is also marked with \"biolockjFailed\". Extensive information is available in the pipeline's log file. A more concise message describing the error, and sometimes solutions, is written to the biolockjFailed flag. If your pipeline fails, use cat biolockjFailed to see the error message. Question: My pipeline failed...now what? # Answer: See Failure Recovery Most often, there is a consice error message that may even have instructions for fixing the pipeline. cd-blj cat biolockjFailed Don't be discouraged. It is normal to go through several, even many, failed attempts as you figure out how all the parts come together. Question: If biolockj indicates that my pipeline may have failed to start, how do I determine the cause of the failure? # Answer: Use -f . By default, BioLockJ runs the java component in the background, and only a minimal, helpful message is printed on the screen. If there was some problem in getting that short, helpful message to the screen, you can use the --foreground or -f option to force biolockj to run in the foreground, thus printing everything to the screen. Often the print-out ends shortly after a helpful message. Question: Sometimes BioLockJ adds modules to my pipeline. How can I tell what modules will be added? # Answer: Read the docs; or use -p With the --precheck-only or -p option, BioLockJ will create the pipeline and go through the check-dependencies phase for each module, but even without finding errors it will not actually run the pipeline. This allows you see what modules will be run, see the pipeline folder layout, and see if any errors will prevent the pipeline from starting. This is also ideal when you know you want to change more options or add more modules before you run the pipeline; but you want to check if there anything that needs to be fixed in what you have so far. In the documentation for each module, there is a section called \"Adds modules\". A module may give the class path of another module that it adds before or after itself. Many modules say \"none found\" to indicate that this module does not add any other modules before or after itself. Sometimes this section will say \"pipeline-dependent\" and more details are given in the \"Details\" section to explain which other modules might be added and when / why. Modules that are added by other modules are called pre-requisite modules . Modules that are added by the BioLockJ backbone are called implicit modules . These can be disabled with the properties pipeline.disableAddPreReqModules and pipeline.disableAddImplicitModules , respectively. Question: I get an error message about a property, but I have that property configured correctly. What gives? # Answer: Use -u . This is often the result of a typo somewhere. Generally, BioLockJ runs a check-dependencies protocol on each module, and all required properties should be checked during that process, and it stops when it first finds a problem. With the --unused-props or -u option, biolockj will check dependencies for all modules, even after one fails, and any properties that were never used will be printed to the screen. This often highlights typos in property names, or properties that are not used by the currenlty configured modules. Keep in mind, this only reports properties in your primary config file, not in any of your defaultProps files. Question: A module script is failing because an environent variable is missing. But I know I defined that variable, and I can see it with echo . Why can't the script see it ? # Answer: Use -e ; or reference it in your configuration file in the ${VAR} format Where possible, avoid relying on environment variables. Consider defining a value in your config file and/or adding the value to a parameter list that will be used with the script. Variables from your local envirnment must be explicitly passed into the module environments. See the Configuration page . Question: On a cluster system, I need a particular module to run on the head node. # Answer: Use module-specific properties to control the cluster properties for that module. See the Configuration page for more details about module-specific forms of general properties. Example: # On this cluster, the compute nodes do not have internet access, only the head node does. The first module in the pipeline is the SraDownload module to get the data, which requries internet access. All pipelines run on this cluster include a reference to the properties set up specifically for this cluster: pipeline.defaultProps=${BLJ}/ourCluster.properties This group chose to store their system configurations in the BioLockJ folder, which they reference using the fully dressed ${BLJ} variable. In this file, they have configurations for launching jobs: cluster.batchCommand = qsub SraDownload.batchCommand = /bin/bash BioLockJ launches jobs using qsub