Skip to content

Latest commit

 

History

History
172 lines (104 loc) · 5.42 KB

pbsim3.md

File metadata and controls

172 lines (104 loc) · 5.42 KB

PBSIM3 Guide

PBSIM3 is a tool for simulating PacBio and ONT (Oxford Nanopore Technologies) reads. This guide will walk you through the installation process and provide an example of basic usage.

Prerequisites

PBSIM3 requires the following dependencies:

  • GNU make: Typically installed on most Unix systems.
  • GCC or compatible compiler: For compiling the source code.
  • zlib: Necessary for decompression in read simulation.

You may install these dependencies using a package manager such as apt (for Ubuntu) or yum (for CentOS), if they are not already available.

Installation

To install PBSIM3, follow these steps:

  1. Clone the repository:

    git clone https://github.com/yukiteruono/pbsim3.git
    cd pbsim3
  2. Configure the installation directory by replacing PREFIX with your desired installation path. For example, to install in /usr/local, use --prefix=/usr/local:

    ./configure --prefix=PREFIX
  3. Compile and install the program:

    make
    make install
  4. After installation, the pbsim executable will be located in the pbsim3/bin directory under the specified PREFIX path.

Usage

The main executable for PBSIM3 is pbsim, which allows you to simulate PacBio and ONT reads. Below is a basic example to simulate reads from a FASTA file.

1. To check if the installation was successful and view available options, run:

./pbsim

2. Basic Command Example:

To simulate reads from a reference genome in FASTA format:

pbsim [options]

General Options

-- prefix: Set output file prefix (default: "sd").

-- id-prefix: Prefix for read IDs (default: "S").

-- seed: Seed for the random number generator (default: Unix time).

Whole Genome Sequencing Options

-- strategy: Set sequencing strategy to wgs (whole genome sequencing).

-- genome: Input genome file in FASTA format.

-- depth: Set coverage depth (default: 20.0).

-- length-min/--length-max: Minimum and maximum read length (default: 100, 1,000,000).

Transcriptome Sequencing Options

-- strategy: Set to trans for transcriptome sequencing.

-- transcript: Input transcript file in original format.

-- length-min/--length-max: Minimum and maximum read length (default: 100, 1,000,000).

Template Sequencing Options

-- strategy: Set to templ for template sequencing.

-- template: Input template file in FASTA format.

Quality Score Model Options

-- method: Set to qshmm for quality score modeling.

-- qshmm: Quality score model file.

-- length-mean/--length-sd: Mean and standard deviation of read length (default: 9000, 7000).

-- accuracy-mean: Mean accuracy (default: 0.85).

-- pass-num: Number of sequencing passes (default: 1).

-- difference-ratio: Error ratio as substitution:insertion (default: 6:55:39).

-- hp-del-bias: Bias in homopolymer deletions (default: 1).

Error Model Options

-- method: Set to errhmm for error modeling.

-- errhmm: Error model file.

Other options are similar to those in the quality score model.

Sample-Based Simulation Options

(Applicable for wgs strategy only)

-- sample: Input FASTQ sample file.

-- sample-profile-id: Profile ID for sampled reads.

-- accuracy-min/--accuracy-max: Minimum and maximum accuracy (default: 0.75, 1.0).

-- difference-ratio: Error ratio for sample-based method (default: 6:55:39).

-- hp-del-bias: Bias in homopolymer deletions for sample-based reads (default: 1).

Sample Command

./pbsim --strategy wgs --method qshmm --qshmm pbsim3/data/QSHMM-RSII.model --depth 15 --genome hg38.fasta --pass-num 7 --prefix hg38

Breakdown of Options

Breakdown of Options:
  1. --strategy wgs

    • Specifies the simulation strategy as Whole Genome Sequencing (WGS).
    • The tool will simulate reads from the entire genome.
  2. --method qshmm

    • Indicates the use of the Quality Score Hidden Markov Model (QSHMM) to simulate quality scores and errors in the reads.
  3. --qshmm pbsim3/data/QSHMM-RSII.model

    • Provides the quality score model file. In this case, the model file 'QSHMM-RSII.model' (included in PBSIM3's data folder) is specific to the PacBio RS II sequencer.
  4. --depth 15

    • Specifies the sequencing depth as 15x. This means that, on average, each base of the input genome will be covered by 15 reads.
  5. --genome hg38.fasta

    • Sets the input genome file to 'hg38.fasta'. This is the reference genome in FASTA format from which reads will be simulated.
  6. --pass-num 7

    • Specifies the number of sequencing passes per read. A higher pass count often improves consensus accuracy for sequencing methods like PacBio.
  7. --prefix hg38

    • Defines the prefix for output files. All output files generated by the simulation will start with 'hg38'.
Expected Outcome:

This command simulates PacBio RS II sequencing reads based on the human genome ('hg38.fasta') with:

  • A depth of 15x coverage.
  • Quality scores modeled using QSHMM.
  • A read accuracy adjusted by 7 sequencing passes (7 reads for the same segment).

3. Output Files:

The output will include files such as:

  • .fastq file(s): Containing simulated reads.
  • .maf file: Containing read-to-reference alignments.
  • .ref file: Containing each sequence from the reference genome.
  • .sam file: Containing read alignments, useful for downstream analysis.

Note: All quality codes of simulated reads by error model is "!".