An efficient assembly toolkit for organellar genomes
If you encounter any problems in using PMAT2, please contact the authors by e-mail (Changwei Bi: [email protected]; Fuchuan Han: [email protected]) to join the WeChat group (please note your name + organization + PMAT2 in the message).
Install using git
git clone https://github.com/aiPGAB/PMAT2
cd PMAT2
make
./PMAT --help
Install by downloading the source codes
wget https://github.com/aiPGAB/PMAT2/archive/refs/tags/v2.1.0.tar.gz
tar -zxvf PMAT2-2.1.0.tar.gz
cd PMAT2-2.1.0
make
./PMAT --help
- BLASTn > 2.2.29 Needs to be installed in
PATH
. - Singularity or Apptainer is required for PMAT2. You can find installation instructions here.
- Canu > v2.0 or NextDenovo is required for CLR or ONT sequencing data.
- zlib Needs to be installed in
PATH
.
Run PMAT autoMito --help
to view the usage guide.
Usage: PMAT autoMito [-i INPUT] [-o OUTPUT] [-t SEQTYPE] [options]
Example:
PMAT autoMito -i hifi.fastq.gz -o hifi_assembly -t hifi -m -T 8
PMAT autoMito -i ont.fastq.gz -o ont_assembly -t ont -S nextdenovo -C canu -N nextdenovo
PMAT autoMito -i clr.fastq.gz -o clr_assembly -t clr -S canu -C canu
Required options:
-i, --input Input sequence file (fasta/fastq)
-o, --output Output directory
-t, --seqtype Sequence type (hifi/ont/clr)
Optional options:
-k, --kmer kmer size for estimating genome size (default: 31)
-g, --genomesize Genome size (g/m/k), skip genome size estimation if set
-p, --task Task type (0/1), skip error correction for ONT/CLR by selecting 0, otherwise 1 (default: 1)
-G, --organelles Genome organelles (mt/pt/all, default: mt)
-x, --taxo Specify the organism type (0/1), 0: plants, 1: animals (default: 0)
-S, --correctsoft Error correction software (canu/nextdenovo, default: nextdenovo)
-C, --canu Canu path
-N, --nextdenovo NextDenovo path
-n, --cfg Config file for nextdenovo (default: temprun.cfg)
-F, --factor Subsample factor (default: 1)
-D, --subseed Random number seeding when extracting subsets (default: 6)
-K, --breaknum Break long reads (>30k) with this (default: 20000)
-I, --minidentity Set minimum overlap identity (default: 90)
-L, --minoverlaplen Set minimum overlap length (default: 40)
-T, --cpu Number of threads (default: 8)
-m, --mem Keep sequence data in memory to speed up computation
-h, --help Show this help message and exit
Notes:
- Make sure BLASTn was installed in PATH.
- If you want to use nextdenovo for ONT/CLR error correction, you can skip providing a cfg file, and the program will generate a temporary cfg file automatically.
-k
: If seqtype is hifi, skip kmer frequency estimation and genome size estimation.-m
: Keep sequence data in memory to speed up computation.-I
: The default value is 90 bp. If the assembly graph is complex, you can increase it appropriately.-L
: minimum overlap identity, the default is 40, if it is HiFi data, you can increase it appropriately.
If PMAT fails to generate the assembly graph in 'autoMito' mode, you can use this command to manually select seeds for assembly.
Run PMAT graphBuild --help
to view the usage guide.
Usage: PMAT graphBuild [-i SUBSAMPLE] [-a ASSEMBLY] [-o OUTPUT] [options]
Example:
PMAT graphBuild -i assembly_test1/subsample -a assembly_test1/assembly_result -o graphBuild_result -s 1 312 356 -T 8
PMAT graphBuild -i assembly_test1/subsample -a assembly_test1/assembly_result -o graphBuild_result -d 5 -s 1 312 356 -T 8
Required options:
-i, --subsample Input subsample directory (assembly_test1/subsample)
-a, --graphinfo Input assembly result directory (assembly_test1/assembly_result)
-o, --output Output directory
Optional options:
-G, --organelles Genome organelles (mt: mitochondria/pt: plastid, default: mt)
-x, --taxo Specify the organism type (0/1), 0: plants, 1: animals (default: 0)
-d, --depth Contig depth threshold
-s, --seeds ContigID for extending. Multiple contigIDs should be separated by space. For example: 1 312 356
-T, --cpu Number of threads (default: 8)
-h, --help Show this help message and exit
Notes:
- Make sure BLASTn was installed in PATH.
-i
: assembly_test1/subsample generated by autoMito command.-a
: assembly_test1/assembly_result generated by autoMito command.-s
: Manually select the seeds for the extension. Use spaces to split between different seed IDs, e.g. 1,312,356.
## download the dataset
wget https://github.com/bichangwei/PMAT/releases/download/v1.1.0/Arabidopsis_thaliana_550Mb.fa.gz
## run autoMito command
PMAT autoMito -i Arabidopsis_thaliana_550Mb.fa.gz -o ./test1 -t hifi -m
## run graphBuild command (when autoMito fails)
PMAT graphBuild -i ./test1/subsample/ -a ./test1/assembly_result/ -o ./test1_gfa -s 1 2 3 -d 5
The PMAT_orgAss.txt
file contains the following information:
==========================================================
Mitochondrial Assembly Assessment
==========================================================
Basic Statistics:
----------------------------------------------------------
Total contigs: 16
Total length: 367.8 kb
Average depth: 28.4 x
Total genes found: 24/24 (100.0%)
Duplicated contigs: 3
Per-contig Details:
----------------------------------------------------------
Contig ID Genes Gene List
----------------------------------------------------------
300 4 atp1,cox1,nad1,nad2
2150 1 atp6
908 4 atp9,ccmB,cox2,nad9
1221 2 atp4,nad4L
729 4 ccmC,ccmFn,cox3,nad3
727 1 nad3
1524 1 atp9
2150 1 atp6
749 6 atp8,matR,mttB,na...
298 3 ccmFc,cob,nad6
----------------------------------------------------------
## download the dataset
wget https://github.com/bichangwei/PMAT/releases/download/v1.1.0/Malus_domestica.540Mb.fasta.gz
## run autoMito command
PMAT autoMito -i Malus_domestica.540Mb.fasta.gz -o ./test2 -t hifi -m
## run graphBuild command (when autoMito fails)
PMAT graphBuild -i ./test2/subsample/ -a ./test2/assembly_result/ -o ./test2_gfa -s 10 20 30 -d 5
The PMAT_orgAss.txt
file contains the following information:
==========================================================
Mitochondrial Assembly Assessment
==========================================================
Basic Statistics:
----------------------------------------------------------
Total contigs: 4
Total length: 397.0 kb
Average depth: 31.1 x
Total genes found: 24/24 (100.0%)
Duplicated contigs: 1
Per-contig Details:
----------------------------------------------------------
Contig ID Genes Gene List
----------------------------------------------------------
1 20 atp1,atp4,atp8,at...
2 6 atp6,atp9,matR,na...
----------------------------------------------------------
- Download tested CLR data for Phaseolus vulgaris using IBM Aspera:
ascp -v -QT -l 400m -P33001 -k1 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/SRR291/006/SRR2912756/SRR2912756_subreads.fastq.gz .
- then run the autoMito command for one-click assembly (CLR):
PMAT autoMito -i SRR2912756_subreads.fastq.gz -o ./test_clr -t clr -N path/nextDenovo -m
- Download tested ONT data for Populus deltoides using IBM Aspera:
ascp -v -QT -l 400m -P33001 -k1 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh [email protected]:/vol1/fastq/SRR122/038/SRR12202038/SRR12202038_1.fastq.gz .
- then run the autoMito command for one-click assembly (ONT):
PMAT autoMito -i SRR12202038_1.fastq.gz -o ./test_ont -t ont -S canu -C path/canu -m
Dataset | Size | Options | Run time | Coverage |
---|---|---|---|---|
Arabidopsis thaliana | 550Mb | -T 50 |
6m27s | 4x |
Arabidopsis thaliana | 550Mb | -T 50 -m |
6m38s | 4x |
Malus domestica | 540Mb | -T 50 |
7m38s | <1x |
Malus domestica | 540Mb | -T 50 -m |
7m19s | <1x |
Juncus effusus | 216Mb | -T 50 |
4m56s | <1x |
Juncus effusus | 216Mb | -T 50 -m |
4m48s | <1x |
output_dir/
├── assembly_result/
│ ├── PMATAllContigs.fna # Assembly contigs
│ └── PMATContigGraph.txt # Contig relationships
├── gfa_result/
│ ├── PMAT_mt_raw.gfa # Initial mitogenome graph
│ ├── PMAT_mt_main.gfa # Optimized mitogenome graph
│ ├── PMAT_mt.fasta # Final mitogenome assembly
│ ├── PMAT_pt_raw.gfa # Initial chloroplast graph
│ ├── PMAT_pt_main.gfa # Optimized chloroplast graph
│ └── PMAT_pt_main.fa # Final chloroplast assembly
├── gkmer_result/
| ├── gkmer_histo.txt # Kmer frequency
| └── summary.txt # genome size estimation
├── subsample/
│ └── PMAT_cut_seq.fa # Subsampled reads for assembly
└── PMAT_orgAss.txt # Organellar assembly assessment/
PMAT version 2.0.1 (24/11/21)
Updates:
- Optimized the assembly strategy for organellar genomes, enabling faster and more accurate capture of organellar genome sequences.
- Implemented the assembly of animal and plant organellar genomes.
- Enhanced the genome graph untangling functionality for organellar genomes, enabling resolution of more complex structures.
- Parallelized key steps in the workflow, significantly improving runtime efficiency.
PMAT version 2.0.1 (25/2/1)
Updates:
- Added
orgAss
module to evaluate the completeness of the assembly results.
Bi C, Shen F, Han F, Qu Y, et al. PMAT: an efficient plant mitogenome assembly toolkit using ultra-low coverage HiFi sequencing data. Horticulture Research. (2024). uhae023, https://doi.org/10.1093/hr/uhae023.
Bi C, Qu Y, Hou J, Wu K, Ye N, and Yin T. (2022). Deciphering the multi-chromosomal mitochondrial genome of Populus simonii. Front. Plant Sci. 13:914635.doi:10.3389/fpls.2022.914635.