Skip to content

Latest commit

 

History

History
130 lines (101 loc) · 10.8 KB

README.md

File metadata and controls

130 lines (101 loc) · 10.8 KB

creo_pipelines

Introduction

This application makes RNA-Seq analysis: quality control, pre-processing, alignment, transcript quantification and differential expression analysis on BAM files. Given the input files and the working directory, the pipeline is completely automated. First, quality control on FastQ files is performed with FastQC and FastQ Screen. FastQC makes quality control and creates one report for sample. FastQ Screen estimates approximately the percentage of reads that can be mapped on genomes other than human, like ribosomal genomes, PhiX genome and mouse genome. This allows to evaluate the presence of contaminating genomes. Pre-processing follows quality control: the reads are aligned on PhiX genome and ribosomal genome to eliminate contaminations. Alignment can be performed with TopHat2 or HISAT2; in the first case quantification is performed with Cufflinks and DEA with cummeRbund, in the second case quantification is performed with featureCounts and DEA with DESeq2 or edgeR. A second intermediate quality control analysis is also performed on the aligned BAM files with some of the RSeQC scripts and in particular: inner_distance, junction_annotation, junction_saturation, bam_stat, read_distribution, geneBody_coverage. A final report is generated from the results using multiqc. It is possible to perform an optional meta-analysis on the results. It consists in Gene Ontology enrichment analysis and KEGG Pathway enrichment analysis on the differentially expressed genes (with absolute Fold Change value higher than 1.5 and adjusted p-value lower than 0.05).

Prerequisites

For application to work properly, you must first make sure that you have installed the necessary applications for the pipeline.

Applications

For Graphical User Interface and scripts running:

For quality control on fastq files and pre-processing:

For alignment and quality control on BAM files:

For quantification:

Input files

The necessary input files are as follows:

  • FastQ files of the samples. Since this is a differential analysis, at least two cases and two controls must be present, furthermore, if it is a paired-end analysis, files of read 1 and read 2 must be provided for each sample while if it is a single-end analysis only files of read 1 must be provided.

  • Reference genome for Bowtie2 with relative indexes, necessary for alignment with TopHat2. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/index/bowtie2/hg19. It is not necessary to report the extension, but only the name of the file and the indexes must be present in the same folder.

  • Reference genome with relative indexes, necessary for alignment with HISAT2. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/index/hisat2/hg19. The reference genome file is the same as TopHat, but the indexes are obtained differently (see below).

  • BED file, necessary for quality control on BAM files. The path in which the file is searched by default is as follows: /opt/genome/human/hg19/annotation/hg19.refseq.bed12.

  • PhiX genome with relative indexes, necessary for pre-processing. The path in which the file with its indexes is searched by default is as follows: /opt/genome/control/phix174/bwa/phiX174.fa.

  • Ribosomal genome 5s with relative indexes, necessary for pre-processing. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/contam/bwa/hum5SrDNA.fa.

  • Ribosomal genome with relative indexes, necessary for pre-processing. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/contam/bwa/humRibosomal.fa.

  • GTF file, necessary for alignment with TopHat2 and quantification with featureCounts and Cufflinks. The path in which the file is searched by default is as follows: /opt/genome/human/hg19/annotation/hg19.refgene.sorted.gtf.

  • Reference genome with relative indexes, necessary for quantification with Cufflinks. The path in which the file with its indexes is searched by default is as follows: /opt/genome/human/hg19/index/hg19.fa. The reference genome file is the same, but the indexes change.

Creating indexes:

The indexes can be created starting from the Fasta file and require different commands based on the alignment software for which they will be used.

In the case of TopHat, the necessary command is the following:

bowtie2-build file.fa file

In the case of HISAT, the necessary command is:

hisat2-build file.fa file

In the case of BWA, instead, the necessary command is:

bwa index file.fa

Finally, in the case of BWA, the necessary command is:

samtools faidx file.fa

Running application

Command line

After installing the necessary tools and getting the required input files, you can launch the application using command-line. The following is the use for analysis in single-end and paired-end, with the arguments that must necessarily be inserted, as they do not provide default options, and the optional arguments, which can be omitted if the paths and options you want match the standard ones.

Usage

Paired-end

RNAseq_pipeline -n <project_name> -pn <pool_name> -sn <sample_A,sample_B,sample_C,sample_D> -r1 <sample_A_read1,sample_B_read1,sample_C_read1,sample_D_read1> -r2 <sample_A_read2,sample_B_read2,sample_C_read2,sample_D_read2> -type <cntrl,cntrl,treat,treat> -comp <cntrl_VS_treat> -o <output_directory> [options]*

Single-end

RNAseq_pipeline -n <project_name> -pn <pool_name> -sn <sample_A,sample_B,sample_C,sample_D> -r1 <sample_A_read1,sample_B_read1,sample_C_read1,sample_D_read1> -type <cntrl,cntrl,treat,treat> -comp <cntrl_VS_treat> -o <output_directory> [options]*

Arguments

-n Project name. No default options.
-pn Pool name. No default options.
-sn Sample names (',' sep). No default options.
-r1 Read 1 FastQ path (',' sep). No default options. Files must appear in the same order as sample names.
-r2 Read 2 FastQ path (',' sep). Required only for paired-end analysis. Files must appear in the same order as sample names.
-type Sample types (',' sep). No default options. Types must appear in the same order as sample names.
-comp Comparisons between samples (allows multiple comparisons comma separated). No default option.
-o Output directory. No default options.

Options

-rb Reference genome file path for bowtie. Default: /opt/genome/human/hg19/index/bowtie2/hg19
-rh Reference genome file path for HISAT2. Default: /opt/genome/human/hg19/index/hisat2/hg19
-bed Reference genome annotation file path. Default: /opt/genome/human/hg19/annotation/hg19.refseq.bed12
-ph PhiX genome file path. Default: /opt/genome/control/phix174/bwa/phiX174.fa
-rib1 Ribosomal genome file path. Default: /opt/genome/human/hg19/contam/bwa/hum5SrDNA.fa
-rib2 Ribosomal genome file path. Default: /opt/genome/human/hg19/contam/bwa/humRibosomal.fa
-t Max thread number. Default: 12.
-g GTF file path. Default: /opt/genome/human/hg19/annotation/hg19.refgene.sorted.gtf
-a Alignment method. Default: hisat; alternative: tophat.
-l Library type. Default: fr-firststrand; alternative: fr-secondstrand.
-q Quantification method. Default: featureCounts; alternative: Cufflinks.
-r Reference genome file path. Default: /opt/genome/human/hg19/index/hg19.fa
-dea Differential Expression Analysis method. Default: edgeR; alternatives: DESeq2, cummeRbund.
-r_path Script directory (alignment, quantification and DEA). Default: /opt/applications/src/creo_pipelines/pipelines.
-meta Analysis with or without final meta-analysis. Default: full; alternative: quant.
-cat Max number of categories showed in R plots for meta-analysis. Default: 5.

Example

To make the launch of the command easier, we have created a script run.sh where you can enter all the desired variables. Once done, you can launch the pipeline with the following command:

bash run.sh

Graphical User Interface

Usage

GUI_RNAseq_pipeline

A series of windows allows to indicate input files and to choose between various options.

Example

A first window will appear in which you will be asked to enter the data relating to the experiment, as in the example below.

This is followed by a series of windows that allow you to choose between different options or ask to select the input files or folders, as shown in the following two images.

If they exist, the default paths will be shown automatically. Finally, once all the inputs and the parameters have been selected, a summary window will appear. If the data entered are correct, it will be sufficient to confirm to start the analysis, otherwise it is possible to cancel and start from the beginning.

Shiny app

To observe the results of the analysis in an interactive form, it is possible to launch a shiny app.