PBSIM3 is a tool for simulating PacBio and ONT (Oxford Nanopore Technologies) reads. This guide will walk you through the installation process and provide an example of basic usage.
PBSIM3 requires the following dependencies:
- GNU make: Typically installed on most Unix systems.
- GCC or compatible compiler: For compiling the source code.
- zlib: Necessary for decompression in read simulation.
You may install these dependencies using a package manager such as apt
(for Ubuntu) or yum
(for CentOS), if they are not already available.
To install PBSIM3, follow these steps:
-
Clone the repository:
git clone https://github.com/yukiteruono/pbsim3.git cd pbsim3
-
Configure the installation directory by replacing
PREFIX
with your desired installation path. For example, to install in/usr/local
, use--prefix=/usr/local
:./configure --prefix=PREFIX
-
Compile and install the program:
make make install
-
After installation, the
pbsim
executable will be located in thepbsim3/bin
directory under the specifiedPREFIX
path.
The main executable for PBSIM3 is pbsim, which allows you to simulate PacBio and ONT reads. Below is a basic example to simulate reads from a FASTA file.
./pbsim
To simulate reads from a reference genome in FASTA format:
pbsim [options]
-- prefix
: Set output file prefix (default: "sd").
-- id-prefix
: Prefix for read IDs (default: "S").
-- seed
: Seed for the random number generator (default: Unix time).
-- strategy
: Set sequencing strategy to wgs
(whole genome sequencing).
-- genome
: Input genome file in FASTA format.
-- depth
: Set coverage depth (default: 20.0).
-- length-min/--length-max
: Minimum and maximum read length (default: 100, 1,000,000).
-- strategy
: Set to trans
for transcriptome sequencing.
-- transcript
: Input transcript file in original format.
-- length-min/--length-max
: Minimum and maximum read length (default: 100, 1,000,000).
-- strategy
: Set to templ
for template sequencing.
-- template
: Input template file in FASTA format.
-- method
: Set to qshmm
for quality score modeling.
-- qshmm
: Quality score model file.
-- length-mean/--length-sd
: Mean and standard deviation of read length (default: 9000, 7000).
-- accuracy-mean
: Mean accuracy (default: 0.85).
-- pass-num
: Number of sequencing passes (default: 1).
-- difference-ratio
: Error ratio as substitution:insertion (default: 6:55:39).
-- hp-del-bias
: Bias in homopolymer deletions (default: 1).
-- method
: Set to errhmm
for error modeling.
-- errhmm
: Error model file.
Other options are similar to those in the quality score model.
-- sample
: Input FASTQ sample file.
-- sample-profile-id
: Profile ID for sampled reads.
-- accuracy-min/--accuracy-max
: Minimum and maximum accuracy (default: 0.75, 1.0).
-- difference-ratio
: Error ratio for sample-based method (default: 6:55:39).
-- hp-del-bias
: Bias in homopolymer deletions for sample-based reads (default: 1).
./pbsim --strategy wgs --method qshmm --qshmm pbsim3/data/QSHMM-RSII.model --depth 15 --genome hg38.fasta --pass-num 7 --prefix hg38
-
--
strategy wgs
- Specifies the simulation strategy as Whole Genome Sequencing (WGS).
- The tool will simulate reads from the entire genome.
-
--
method qshmm
- Indicates the use of the Quality Score Hidden Markov Model (QSHMM) to simulate quality scores and errors in the reads.
-
--
qshmm pbsim3/data/QSHMM-RSII.model
- Provides the quality score model file. In this case, the model file 'QSHMM-RSII.model' (included in PBSIM3's data folder) is specific to the PacBio RS II sequencer.
-
--
depth 15
- Specifies the sequencing depth as 15x. This means that, on average, each base of the input genome will be covered by 15 reads.
-
--
genome hg38.fasta
- Sets the input genome file to 'hg38.fasta'. This is the reference genome in FASTA format from which reads will be simulated.
-
--
pass-num 7
- Specifies the number of sequencing passes per read. A higher pass count often improves consensus accuracy for sequencing methods like PacBio.
-
--
prefix hg38
- Defines the prefix for output files. All output files generated by the simulation will start with 'hg38'.
This command simulates PacBio RS II sequencing reads based on the human genome ('hg38.fasta') with:
- A depth of 15x coverage.
- Quality scores modeled using QSHMM.
- A read accuracy adjusted by 7 sequencing passes (7 reads for the same segment).
The output will include files such as:
- .fastq file(s): Containing simulated reads.
- .maf file: Containing read-to-reference alignments.
- .ref file: Containing each sequence from the reference genome.
- .sam file: Containing read alignments, useful for downstream analysis.
Note: All quality codes of simulated reads by error model is "!".