Long read assembly pipeline

1. Guppy basecalling

We used Guppy version 5.0.7 to basecall the fast5 files produced by minions. It was ran in GPU partition.

Script for Guppy

#!/bin/bash -e

#SBATCH --job-name=Guppy_EW                 
#SBATCH --account=uoo02752              
#SBATCH --time=10:00:00                
#SBATCH --partition=gpu                 #guppy runs faster in gpu partition in Nesi, than other partition
#SBATCH --gres=gpu:1                    #some configuration for gpu partition, suggested by Nesi support
#SBATCH --mem=6G                                
#SBATCH --ntasks=4                              
#SBATCH --cpus-per-task=1               
#SBATCH --output=%x-%j.out              
#SBATCH --error=%x-%j.err               
#SBATCH --mail-type=ALL
#SBATCH --mail-user=bhaup057@student.otago.ac.nz

module load ont-guppy-gpu/5.0.7
guppy_basecaller -i ../path/to/all/fast5/files/ \
                 -s ./path/to/output/directory/ \
                 --flowcell FLO-MIN106 \
                 --kit SQK-LSK109 \
                 --num_callers 4 -x auto \
                 --recursive \                                  # recursive flag looks for fast5 files in subfolders as well
                 --trim_barcodes --disable_qscore_filtering     # disabled the quality filtering to get all the fastq files produced in one folder instead of pass, and fail. we will do QC later.

Then we concatenate all the fastq files produced in one fastq file EW_nanopore_merged.fastq

2. PycoQC

We used PycoQC to check the quality of our long read data. It produces an interactive html file with detailed description of data quality. It uses sequencing_summary.txt file generated by guppy as an input.

Script for PycoQC

#!/bin/bash -e

#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks 5
#SBATCH --job-name PycoQc
#SBATCH --mem=2G
#SBATCH --time=00:20:00
#SBATCH --account=uoo02752
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=bhaup057@student.otago.ac.nz
#SBATCH --hint=nomultithread

export PATH="/nesi/nobackup/uoo02752/nematode/bin/miniconda3/bin:$PATH"

pycoQC -f path/to/sequencing_summary.txt            #sequencing_summary.txt is produced by guppy in its output folder.
        -o path/to/output/file/pycoQC_output.html

3. Nanolyse

We have used DNACS during nanopore sequencing library preparation so we will use Nanolyse to remove lamda DNA from our fastq files.

Script for Nanolyse

#!/bin/bash -e

#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks 5
#SBATCH --job-name Nanolyse.EW
#SBATCH --mem=1G
#SBATCH --time=01:00:00
#SBATCH --account=uoo02752
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=bhaup057@student.otago.ac.nz
#SBATCH --hint=nomultithread

module load NanoLyse/1.2.0-gimkl-2020a

cat path/to/EW_nanopore_merged.fastq | NanoLyse | gzip > EW_nanopore_merged_filtered.fastq.gz

4. Porechop

Guppy only removes adapters attached to the ends of the reads. So we are using Porechop on top of that to remove any remaining adapter in middle of reads.

Script for Porechop

#!/bin/bash -e

#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks 10
#SBATCH --job-name PorechopEW
#SBATCH --mem=40G
#SBATCH --time=02:00:00
#SBATCH --account=uoo02752
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=bhaup057@student.otago.ac.nz
#SBATCH --hint=nomultithread

module load Porechop/0.2.4-gimkl-2020a-Python-3.8.2

porechop -i /path/to/EW_nanopore_merged_filtered.fastq.gz -o /path/to/output/EW_nanopore_merged_filtered_porechop.fastq.gz --threads 10

5. Nanoqc

NanoQC produces a html report on the quality of the reads. Script for NanoQC

#!/bin/bash -e

#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --ntasks 10
#SBATCH --job-name NanoQC.EW
#SBATCH --mem=10G
#SBATCH --time=03:00:00
#SBATCH --account=uoo02752
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=bhaup057@student.otago.ac.nz
#SBATCH --hint=nomultithread

export PATH="/nesi/nobackup/uoo02752/nematode/bin/miniconda3/bin:$PATH"   #NanoQC was intalled in conda env

nanoQC /path/to/EW_nanopore_merged_filtered_porechop.fastq.gz -o /path/to/output/directory

6. Flye

To assemble the genome with long reads, we tried various available long read genome assemblers. Among all available softwares, Flye performed the best. We also tried different version of Flye and the v.2.7.1 (not the latest) performed better. So we ran Flye (v.2.7.1) with three iterations of polishing.

Script for Flye

#!/bin/bash -e

#SBATCH --nodes 1
#SBATCH --cpus-per-task 1
#SBATCH --job-name Flye.EW.v7
#SBATCH --mem=80G
#SBATCH --time=5-00:00:00
#SBATCH --account=uoo02752
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=bhaup057@student.otago.ac.nz
#SBATCH --hint=nomultithread

module load Flye/2.7.1-gimkl-2020a-Python-2.7.18

flye --nano-hq /path/to/EW_nanopore_merged_filtered_porechop.fastq.gz -o /path/to/output/directory-EW_Flye.v2.7 -t 10 -i 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long_read_assembly.md

Long_read_assembly.md

Long read assembly pipeline

1. Guppy basecalling

2. PycoQC

3. Nanolyse

4. Porechop

5. Nanoqc

6. Flye

Files

Long_read_assembly.md

Latest commit

History

Long_read_assembly.md

File metadata and controls

Long read assembly pipeline

1. Guppy basecalling

2. PycoQC

3. Nanolyse

4. Porechop

5. Nanoqc

6. Flye