Includes straightforward sign in to Zeus.
The following steps cover installation of the tools required for performing the de novo assembly of the Fat-tailed Dunnart genome using Falcon on Pawsey's HPC system Zeus. The original Falcon documentation that this workflow is derived from can be found here.
On Zeus, you will need to run your jobs on a work node; to do so, start an interactive session as follows:
> salloc -n 1 -t 1:00:00
The Dunnary repository contains the scripts and config files required for running the workflow. Clone it to your current working directory, e.g. /group/$PAWSEY_PROJECT/$USER
> cd /group/$PAWSEY_PROJECT/$USER
> git clone https://github.com/AustralianBioCommons/falcon.git
> cd falcon
You will need both the fasta and bam formats of the raw sequencing data. Download them to your current working directory, which should be the falcon/
directory.
Ensure the filenames are set in the falcon-conda.sh
script. Do so by replacing the words YOUR_FASTA_FILE_NAME
and YOUR_BAM_FILE_NAME
with your raw filenames for the below commands.
Note: Your fasta and bam filenames should have a suffix .subreads.fasta.gz and .subreads.bam, respectively.
> sed -i "s|F1_bull_test.subreads.fasta.gz|YOUR_FASTA_FILE_NAME|g" falcon-conda.sh
> sed -i "s|F1_bull_test.subreads.bam|YOUR_BAM_FILE_NAME|g" falcon-conda.sh
Set your HiC filename in the Nextflow main.nf
script.
Note: The files should have the suffixes .HiC_R1.fastq.gz and .HiC_R2.fastq.gz.
You will replace the words YOUR_HIC_FILE_NAME
with your HiC filename for the below. It should look similar to this - sample1.HiC_R*.fastq.gz
.
> sed -i "s|F1_bull_test.HiC_R*.fastq.gz|YOUR_HIC_FILE_NAME|g" main.nf
Run the FALCON script falcon-conda.sh.
> bash falcon-conda.sh
Once the FALCON set up script has completed running, exit the session.
> exit
Run FALCON with the sbatch script as follows:
> sbatch --account=$PAWSEY_PROJECT sbatch_nextflow.sh
While FALCON is running, you can check the progress of your jobs.
Jobs left:
> ls 0-rawreads/daligner-chunks/ | wc -l
Jobs completed:
> find 0-rawreads/daligner-runs/j_*/uow-00 -name "daligner.done" | wc -l
Stats for reads and pre-assembled reads:
> singularity exec pb-assembly_0.0.8.sif DBstats 0-rawreads/build/raw_reads.db
> singularity exec pb-assembly_0.0.8.sif DBstats 1-preads_ovl/build/preads.db
Check pre-assembly performance:
> cat 0-rawreads/report/pre_assembly_stats.json
Check assembly performance:
> python get_asm_stats.py 2-asm-falcon/p_ctg.fasta
Check haplotype resolution
> python get_asm_stats.py 3-unzip/all_p_ctg.fasta
> python get_asm_stats.py 3-unzip/all_h_ctg.fasta
> head 3-unzip/all_h_ctg.paf
Check phase polishing
> python get_asm_stats.py 4-polish/cns-output/cns_p_ctg.fasta
> python get_asm_stats.py 4-polish/cns-output/cns_h_ctg.fasta
See haplotig placement file
> head 5-phase/placement-output/haplotig.placement
See final output stats
> python get_asm_stats.py 5-phase/output/phased.0.fasta
> python get_asm_stats.py 5-phase/output/phased.1.fasta
We encountered a bug in the 2-asm_falcon ovlp_filtering stage, where preads.m4 had an erroneous '---' at the end of the file. We fixed this by following this github issue: PacificBiosciences/pbbioconda#294. This step is now automatically taken care of in the nextflow pipeline.
Falcon depends on on the pbgcpp
package. You might find the software version needs to be changed, depending on the chemistry used for your sequencing. If the pbgcpp
versions clashes with the sequencing chemisty, you will encounter an error in the quiver-run
step. Specifically, something along the lines of the following:
20211209 03:48:25.313 -|- FATAL -|- Run -|- 0x2af6570cb7c0|| -|- gcpp ERROR: [pbbam] chemistry compatibility ERROR: unsupported sequencing chemistry combination:
binding kit: 100-862-200
sequencing kit: 100-861-800
basecaller version: 4.0.0.189308
See SMRT Link release documentation for details about consumables compatibility or contact PacBio Technical Support.
It might not be obvious which version of pbgcpp
is required. In this case, you can test each version, working backwards from the most recent, until you find which one works. The steps would be as follows:
> conda activate pb-assembly
> conda install -c bioconda pbgcpp=2.0.2
[test falcon unzip; if same error, try rolling back to version]
> conda install -c bioconda pbgcpp=2.0.0
[test falcon unzip; if same error, try rolling back to version]
> conda install -c bioconda pbgcpp=1.9.0
[test falcon unzip; if same error, try rolling back to version]
> conda install -c bioconda pbgcpp=1.0.0
Falcon works best on smaller genomes, less than 0.7Gbp. Below this size, it tends to run with a reasonable wall time (<24 hours) and creates <500,000 small intermediate files for the fc_run and fc_unzip steps combined. Falcon struggled with larger reference datasets due to either exeeding the file number limit on the /scratch partition (1,000,000 file limit per user at Pawsey), or taking too long to perform each step and therefore having a total job time that would run into weeks or months. This appears to be linked to the inefficiency of reading and writing such a large number of small files. Hi-Fi data is meant to reduce some of the computational overhead and may therefore allow Falcon to be used for larger genomes. This remains to be tested in our hands, as of Jan 2022.
Parameter | Data |
---|---|
Genome size | 2571010 |
Wall time | 11:00:36 |
CPU time | 23:55:06 |
Memory Efficiency | 100.61% of 112.00 GB |
Exemplar 2: Saccharomyces cerevisiae, yeast 28 cores, 112GB memory, fc_run and fc_unzip steps combined
Parameter | Data |
---|---|
Genome size | 12000000 |
Wall time | 01:46:50 |
CPU time | 03:54:12 |
Memory Efficiency | 97.44% of 112.00 GB |
Exemplar 3: Anabas testudineus, climbing perch, 28 cores, 112GB memory, fc_run and fc_unzip steps combined
Parameter | Data |
---|---|
Genome size | 660000000 |
Wall time | 09:52:33 |
CPU time | 3-13:28:05 |
Memory Efficiency | 21.53% of 112 GB |
Number of files produced | 485,086 |