Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerise update software #16

Open
wants to merge 20 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Cohort samples:
## Prerequisites

- **Prerequisite hardware:** [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) (for GPU accelerated runs) (tested with NVIDIA V100)
- **Prerequisite software:** [NVIDIA CLARA parabricks and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/) (for GPU accelerated runs) (tested with parabricks version 3.6.1-1), [Git](https://git-scm.com/) (tested with version 1.8.3.1), [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.19.1) with [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.11.0), [gsutil](https://pypi.org/project/gsutil/) (tested with version 4.34), [gunzip](https://linux.die.net/man/1/gunzip) (tested with version 1.5), [R](https://www.r-project.org/) (tested with version 3.5.1)
- **Prerequisite software:** [NVIDIA CLARA parabricks and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/) (for GPU accelerated runs) (tested with parabricks version 3.6.1-1), [Git](https://git-scm.com/) (tested with version 1.8.3.1), [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.11.0), [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.19.1)

## Test vcf_annotation_pipeline

Expand Down
123 changes: 76 additions & 47 deletions docs/running_on_a_hpc.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,28 @@
- [3. Setup files and directories](#3-setup-files-and-directories)
- [Test data](#test-data)
- [4. Get prerequisite software/hardware](#4-get-prerequisite-softwarehardware)
- [5. Create a local copy of the GATK resource bundle (either b37 or hg38)](#5-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38)
- [5. Create and activate a conda environment with software for downloading databases](#5-create-and-activate-a-conda-environment-with-software-for-downloading-databases)
- [6. Create a local copy of the GATK resource bundle (either b37 or hg38)](#6-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38)
- [b37](#b37)
- [hg38](#hg38)
- [6. Create a local copy of other databases (either GRCh37 or GRCh38)](#6-create-a-local-copy-of-other-databases-either-grch37-or-grch38)
- [7. Create a local copy of other databases (either GRCh37 or GRCh38)](#7-create-a-local-copy-of-other-databases-either-grch37-or-grch38)
- [GRCh37](#grch37)
- [GRCh38](#grch38)
- [7. Modify the configuration file](#7-modify-the-configuration-file)
- [8. Modify the configuration file](#8-modify-the-configuration-file)
- [Overall workflow](#overall-workflow)
- [Pipeline resources](#pipeline-resources)
- [Variant filtering](#variant-filtering)
- [Single samples](#single-samples)
- [Cohort samples](#cohort-samples)
- [VCF annotation](#vcf-annotation)
- [8. Configure to run on a HPC](#8-configure-to-run-on-a-hpc)
- [9. Modify the run scripts](#9-modify-the-run-scripts)
- [10. Create and activate a conda environment with python and snakemake installed](#10-create-and-activate-a-conda-environment-with-python-and-snakemake-installed)
- [11. Run the pipeline](#11-run-the-pipeline)
- [12. Evaluate the pipeline run](#12-evaluate-the-pipeline-run)
- [13. Commit and push to your forked version of the github repo](#13-commit-and-push-to-your-forked-version-of-the-github-repo)
- [14. Repeat step 13 each time you re-run the analysis with different parameters](#14-repeat-step-13-each-time-you-re-run-the-analysis-with-different-parameters)
- [15. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)](#15-raise-issues-create-feature-requests-or-create-a-pull-request-with-the-upstream-repo-to-merge-any-useful-changes-to-the-pipeline-optional)
- [9. Configure to run on a HPC](#9-configure-to-run-on-a-hpc)
- [10. Modify the run scripts](#10-modify-the-run-scripts)
- [11. Create and activate a conda environment with software for running the pipeline](#11-create-and-activate-a-conda-environment-with-software-for-running-the-pipeline)
- [12. Run the pipeline](#12-run-the-pipeline)
- [13. Evaluate the pipeline run](#13-evaluate-the-pipeline-run)
- [14. Commit and push to your forked version of the github repo](#14-commit-and-push-to-your-forked-version-of-the-github-repo)
- [15. Repeat step 14 each time you re-run the analysis with different parameters](#15-repeat-step-14-each-time-you-re-run-the-analysis-with-different-parameters)
- [16. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)](#16-raise-issues-create-feature-requests-or-create-a-pull-request-with-the-upstream-repo-to-merge-any-useful-changes-to-the-pipeline-optional)

## 1. Fork the pipeline repo to a personal or lab account

Expand Down Expand Up @@ -106,48 +107,63 @@ bash ./test/setup_test.sh -a cohort

## 4. Get prerequisite software/hardware

For GPU accelerated runs, you'll need [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/). Talk to your system administrator to see if the HPC has this hardware and software available.
For GPU accelerated runs, you'll need [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) (tested with NVIDIA V100) and [NVIDIA CLARA PARABRICKS and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/) (tested with parabricks version 3.6.1-1). Talk to your system administrator to see if the HPC has this hardware and software available.

Other software required to get setup and run the pipeline:

- [Git](https://git-scm.com/) (tested with version 2.7.4)
- [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.8.2)
- [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.4.4) (note. [mamba can be installed via conda with a single command](https://mamba.readthedocs.io/en/latest/installation.html#existing-conda-install))
- [gsutil](https://pypi.org/project/gsutil/) (tested with version 4.52)
- [gunzip](https://linux.die.net/man/1/gunzip) (tested with version 1.6)
- [Git](https://git-scm.com/) (tested with version 1.8.3.1)
- [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.11.0)
- [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.19.1) (note. [mamba can be installed via conda with a single command](https://mamba.readthedocs.io/en/latest/installation.html#existing-conda-install))

Most of this software is commonly pre-installed on HPC's, likely available as modules that can be loaded. Talk to your system administrator if you need help with this.
This software is commonly pre-installed on HPC's, likely available as modules that can be loaded. Talk to your system administrator if you need help with this.

## 5. Create a local copy of the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) (either b37 or hg38)
## 5. Create and activate a conda environment with software for downloading databases

This installs [gsutil](https://cloud.google.com/storage/docs/gsutil), [ensembl-vep](https://grch37.ensembl.org/info/docs/tools/vep/index.html), [wget](https://www.gnu.org/software/wget/) and their dependencies

```bash
cd ./workflow/
mamba env create -f ./envs/vap_download_db_env.yaml
conda activate vap_download_db_env
```

## 6. Create a local copy of the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) (either b37 or hg38)

### b37

Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37?prefix=)

```bash
gsutil cp -r gs://gatk-legacy-bundles/b37 /where/to/download/
gsutil -m cp -r gs://gatk-legacy-bundles/b37 /where/to/download/
```

### hg38

Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0)

```bash
gsutil cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/
gsutil -m cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/
```

## 6. Create a local copy of other databases (either GRCh37 or GRCh38)
## 7. Create a local copy of other databases (either GRCh37 or GRCh38)

### GRCh37

Download the [Ensembl-VEP](https://asia.ensembl.org/info/docs/tools/vep/index.html) database using a [conda version of Ensembl-VEP](https://anaconda.org/bioconda/ensembl-vep)
Download the [Ensembl-VEP](https://asia.ensembl.org/info/docs/tools/vep/index.html) database

```bash
conda create -n download_data_env python=3.7
conda activate download_data_env
conda install -c bioconda ensembl-vep=99.2
vep_install -a cf -s homo_sapiens -y GRCh37 -c /output/file/path/GRCh37 --CONVERT
conda deactivate
```

The same version of [Ensembl-VEP](https://asia.ensembl.org/info/docs/tools/vep/index.html) that is run in the pipeline needs to be used to create the [Ensembl-VEP](https://asia.ensembl.org/info/docs/tools/vep/index.html) database, therefore, if prompted to install a newer version of ensembl-vep, choose `continue (n)`. For example:

```bash

Version check reports a newer release of 'ensembl-vep' is available (installed: 105, available: 106)

You should exit this installer and re-download 'ensembl-vep' if you wish to update

Do you wish to exit so you can get updates (y) or continue (n): n
```

Download the [CADD database](https://cadd.gs.washington.edu/download) and it's associated index file.
Expand All @@ -161,14 +177,21 @@ Create a custom [dbNSFP database](https://sites.google.com/site/jpopgen/dbNSFP)

### GRCh38

Download [Ensembl-VEP](https://asia.ensembl.org/info/docs/tools/vep/index.html) database using a [conda install of Ensembl-VEP](https://anaconda.org/bioconda/ensembl-vep)
Download [Ensembl-VEP](https://asia.ensembl.org/info/docs/tools/vep/index.html) database

```bash
mamba create -n download_data_env python=3.7
conda activate download_data_env
mamba install -c bioconda ensembl-vep=99.2
vep_install -a cf -s homo_sapiens -y GRCh38 -c /output/file/path/GRCh38 --CONVERT
conda deactivate
```

The same version of [Ensembl-VEP](https://asia.ensembl.org/info/docs/tools/vep/index.html) that is run in the pipeline needs to be used to create the [Ensembl-VEP](https://asia.ensembl.org/info/docs/tools/vep/index.html) database, therefore, if prompted to install a newer version of ensembl-vep, choose `continue (n)`. For example:

```bash

Version check reports a newer release of 'ensembl-vep' is available (installed: 105, available: 106)

You should exit this installer and re-download 'ensembl-vep' if you wish to update

Do you wish to exit so you can get updates (y) or continue (n): n
```

Download the [CADD database](https://cadd.gs.washington.edu/download) and it's associated index file.
Expand All @@ -180,7 +203,7 @@ wget https://krishna.gs.washington.edu/download/CADD/v1.5/GRCh38/whole_genome_SN

Create a custom [dbNSFP database](https://sites.google.com/site/jpopgen/dbNSFP) build by following [this documentation](https://github.com/GenomicsAotearoa/dbNSFP_build)

## 7. Modify the configuration file
## 8. Modify the configuration file

Edit 'config.yaml' found within the config directory.

Expand Down Expand Up @@ -262,7 +285,7 @@ Set the maximum number of GPU's to be used per rule/sample for gpu-accelerated r
GPU: 1
```

It is a good idea to consider the number of samples that you are processing. For example, if you set `THREADS: "8"` and set the maximum number of cores to be used by the pipeline in the run script to `-j/--cores 32` (see [step 9](#9-modify-the-run-scripts)), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set `THREADS: "1"` and `-j/--cores 32`, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting `MAXMEMORY` + `--resources mem_mb` and `GPU` + `--resources gpu`.
It is a good idea to consider the number of samples that you are processing. For example, if you set `THREADS: "8"` and set the maximum number of jobs to be run in the run script to `-j/--jobs 32` (see [step 9](#9-modify-the-run-scripts)), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set `THREADS: "1"` and `-j/--jobs 32`, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting `MAXMEMORY` + `--resources mem_mb` and `GPU` + `--resources gpu`.

### Variant filtering

Expand Down Expand Up @@ -341,7 +364,7 @@ dbNSFP: "/scratch/publicData/dbNSFP/GRCh37/dbNSFPv4.0a.hg19.custombuild.gz"
CADD: "/scratch/publicData/CADD/GRCh37/whole_genome_SNVs.tsv.gz"
```

## 8. Configure to run on a HPC
## 9. Configure to run on a HPC

*This will deploy the non-GPU accelerated rules to slurm and deploy the GPU accelerated rules locally (pbrun_cnnscorevariants). Therefore, if running the pipeline gpu accelerated, the pipeline should be deployed from the machine with the GPU's.*

Expand All @@ -362,16 +385,18 @@ Configure `account:` and `partition:` in the default section of 'cluster.json' i

There are a plethora of additional slurm parameters that can be configured (and can be configured per rule). If you set additional slurm parameters, remember to pass them to the `--cluster` flag in the runscripts. See [here](https://snakemake-on-nesi.sschmeier.com/snake.html#slurm-and-nesi-specific-setup) for a good working example of deploying a snakemake workflow to [NeSi](https://www.nesi.org.nz/).

## 9. Modify the run scripts
## 10. Modify the run scripts

Set the singularity bind location to a directory that contains your pipeline working directory with the `--singularity-args '-B'` flag. Set the number maximum number of cores to be used with the `--cores` flag and the maximum amount of memory to be used (in megabytes) with the `resources mem_mb=` flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the `--resources gpu=` flag. For example:
Set the singularity bind location to a directory that contains your pipeline working directory with the `--singularity-args '-B'` flag. Set the number maximum number of job to be deployed with the `--jobs` flag and the maximum amount of memory to be used (in megabytes) with the `resources mem_mb=` flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the `--resources gpu=` flag. For example:

Dry run (dryrun_hpc.sh):

```bash
#!/bin/bash -x

snakemake \
--dryrun \
--cores 32 \
--jobs 32 \
--resources mem_mb=150000 \
--resources gpu=2 \
--use-conda \
Expand All @@ -389,8 +414,10 @@ snakemake \
Full run (run_hpc.sh):

```bash
#!/bin/bash -x

snakemake \
--cores 32 \
--jobs 32 \
--resources mem_mb=150000 \
--resources gpu=2 \
--use-conda \
Expand All @@ -407,15 +434,16 @@ snakemake \

See the [snakemake documentation](https://snakemake.readthedocs.io/en/v4.5.1/executable.html#all-options) for additional run parameters.

## 10. Create and activate a conda environment with python and snakemake installed
## 11. Create and activate a conda environment with software for running the pipeline

This installs [snakemake](https://snakemake.github.io/) and it's dependencies

```bash
cd ./workflow/
mamba env create -f pipeline_run_env.yml
mamba env create -f ./envs/pipeline_run_env.yaml
conda activate pipeline_run_env
```

## 11. Run the pipeline
## 12. Run the pipeline

First carry out a dry run

Expand All @@ -429,24 +457,25 @@ If there are no issues, start a full run
bash run_hpc.sh
```

## 12. Evaluate the pipeline run
## 13. Evaluate the pipeline run

Generate an interactive html report

```bash
bash report.sh
```

## 13. Commit and push to your forked version of the github repo
## 14. Commit and push to your forked version of the github repo

To maintain reproducibility, commit and push:

- All documentation
- All configuration files
- All run scripts
- The final report

## 14. Repeat step 13 each time you re-run the analysis with different parameters
## 15. Repeat step 14 each time you re-run the analysis with different parameters

## 15. Raise issues, create feature requests or create a pull request with the [upstream repo](https://github.com/ESR-NZ/vcf_annotation_pipeline) to merge any useful changes to the pipeline (optional)
## 16. Raise issues, create feature requests or create a pull request with the [upstream repo](https://github.com/ESR-NZ/vcf_annotation_pipeline) to merge any useful changes to the pipeline (optional)

See [the README](https://github.com/ESR-NZ/vcf_annotation_pipeline/blob/dev/README.md#contribute-back) for info on how to contribute back to the pipeline!
Loading