diff --git a/content/DifferentialAbundance.md b/content/DifferentialAbundance.md index 4b3ea4b..c99b6f6 100644 --- a/content/DifferentialAbundance.md +++ b/content/DifferentialAbundance.md @@ -4,7 +4,50 @@ title = 'nf-core/DifferentialAbundance' ![DifferentialAbundance_pipeline](DifferentialAbundance_pipeline.png) -#### nf-core/DifferentialAbundance +nf-core/differentialabundance is a bioinformatics pipeline that can be used to analyse data represented as matrices, comparing groups of observations to generate differential statistics and downstream analyses. The pipeline supports RNA-seq data such as that generated by the nf-core rnaseq workflow, and Affymetrix arrays via .CEL files. Other types of matrix may also work with appropriate changes to parameters, and PRs to support additional specific modalities are welcomed. + +1. Optionally generate a list of genomic feature annotations using the input GTF file (if a table is not explicitly supplied). +2. Cross-check matrices, sample annotations, feature set and contrasts to ensure consistency. +3. Run differential analysis over all contrasts specified. +4. Optionally run a differential gene set analysis. +5. Generate exploratory and differential analysis plots for interpretation. +6. Optionally build and (if specified) deploy a Shiny app for fully interactive mining of results. +7. Build an HTML report based on R markdown, with interactive plots (where possible) and tables. + +# Set-up run +***for those with more confidence*** this run can be prepared from scratch in a new directory. +- the pipeline will taake ~40mins to run +- for this reason there is a completed run in ~/workshop/nfDifferentialAbundance/, where you can go through the motions +- a good chance to demonstrate the **resume** feature of nextflow + + +```bash +mkdir ~/workshop/nfDifferentialAbundance2 +cd ~/workshop/nfDifferentialAbundance2 +ls -l +``` +***optional*** you can use the nf-core launch command to build a launch command, but these instructions will be using a 'nextflow run' command + +The minimal input requirements are +1. Sample sheet +- containg the sample information, metadata and group relationships + +2 + + + + +nf-core launch +``` + + + + + + + +#### +nf-core/DifferentialAbundance There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a *nexflow run nf-core/* commands. You can check the ~/.nextflow/assets folders to see what is already installed diff --git a/content/nfRNAseq.md b/content/nfRNAseq.md index afcd136..1000c69 100644 --- a/content/nfRNAseq.md +++ b/content/nfRNAseq.md @@ -13,8 +13,8 @@ nf-core list -s stars lets use nf-core tools to build a command to run the nf-core/RNAseq pipeline ```bash -mkdir ~/workshop/nfcore -cd ~/workshop/nfcore +mkdir -p ~/workshop/nfRNAseq +cd ~/workshop/nfRNAseq nf-core launch -h ``` @@ -60,7 +60,7 @@ nf-core launch --id 1728482936_6064841c2138 ``` - alternatively you can use a nextflow run command -- no need to run this command, in the interest of time (and the lack of disk space on your intance), I've pre-prepared the outputs for this run. We will run the next pipeline to completion. +- no need to run this command, in the interest of time (and the lack of disk space on your intance), I've pre-prepared the outputs for this run navigate to run directory to see the nextflow run command ```bash @@ -84,81 +84,97 @@ The dataset used throughout this workshop is as follow: - treatment vs control - 4 replicates for each ``` -- the information is reflected in the samplesheet and run command - -# Results - - - - 1. Merge re-sequenced FastQ files (cat) - 2. Sub-sample FastQ files and auto-infer strandedness (fq, Salmon) - 3. Read QC (FastQC) - 4. UMI extraction (UMI-tools) - 5. Adapter and quality trimming (Trim Galore!) - 6. Removal of genome contaminants (BBSplit) - 7. Removal of ribosomal RNA (SortMeRNA) - 8. Choice of multiple alignment and quantification routes: - 1. STAR -> Salmon - 2. STAR -> RSEM - 3. HiSAT2 -> NO QUANTIFICATION - 9. Sort and index alignments (SAMtools) - 10. UMI-based deduplication (UMI-tools) - 11. Duplicate read marking (picard MarkDuplicates) - 12. Transcript assembly and quantification (StringTie) - 13. Create bigWig coverage files (BEDTools, bedGraphToBigWig) - 14. Extensive quality control: - 1. RSeQC - 2. Qualimap - 3. dupRadar - 4. Preseq - 5. DESeq2 - 15. Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional) - 16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R) - -In this section we will work through setting up the nf-core/RNAseq pipeline. We will be chosing specific parameters and finally, going through was the output looks like. - -# setting up nf-core/RNAseq +- most of this information is reflected in the samplesheet and run command -#### dataset -The dataset used throughout this workshop is as follow: +#### Run Summary +list results directory. All nf-core outputs have a consistant structure of the outputs +```bash +tree /home/workshop/workshop/nfRNAseq/outs ``` -- 16 Samples sequenced with an MGI400 sequencer at SAGC using the Tecan Universal RNA-seq library protocol. -- 2 different cancer cell lines (human) -- treatment vs control -- 4 replicates for each -``` -***NOTE*** some concessions had to be made to work with this for workshop. Taking into account the large file sizes, the long run times and need for high compute resources. -#### what are the inputs to an RNAseq pipeline +[link to execution timeline](../execution_timeline_2024-10-05_16-02-39.html) -# .fastq /.fastq.gz -``` +[link to execution report](../execution_report_2024-10-05_16-02-39.html) + +# Genomics Files Background +Quick background on genomic file formats to help describe the inputs and outputs of an NGS experiment. +Knowing what these files are isn't only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformatics +Being able to use and manipulate each file open's up many opportunities, and is often required for troubleshooting +Many are plain text files, this means they can be manipluated with basic text editing. + +##### .fastq /.fastq.gz -``` ![](../fastqfile.png) -# .fasta +#### .fasta +the genome.fa file is a plain text representation of the genome sequence. This is the 'reference' to which the sequencing files (.fastq) are alligned. -# .gtf +```bash +head ./smallRNA/outs/mirtrace/[:]/mirtrace/qc_passed_reads.rnatype_unknown.collapsed/Acontrol1.fastp.fasta +``` +#### .gtf +genes.gtf ; a genomic interval file referencing the genome. This file depicts the genes/ transcripts. It's a required for counting where reads are mapped to, and often has alot more annotation data in regards to the gene/transcript. -# .bed +#### .bed ![](../bedfile.png) -#### Genomic file formats -- now that we have gone through how to run the nf-core/RNAseq pipeline. Let's look at the inputs and outputs in detail. -- using this chance to describe what are the key genomic file formats used in RNAseq and beyond -- Knowing what these files are isn't only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformati$ -- being able to use and manipulate each file open's up many opportunities, and is often required for troubleshooting wehn something has gone w$ -- many files are plain text files, this means they can be manipluated with basic text editing. I'll be going through some examples. -- For a more visual perspective, we'll also be using a genome browser, IGV (Integrative Genome Browser) to get a feel for what information eac$ - -# bam +#### bam ![](../bamfile.png) -# bigwig - +# Walkthough of outputs nf-core/RNAseq +the multiqc summary is a real strength of nf-core pipeline. +alot of the key analyses are captured [link to multiqc report](../multiqc_report.html) + +``` +Overview of the key processes run with nf-core/RNAseq. +- identifying steps to trouble shooting the success of a run. +- far more than just mapping (star) and counting (RSEM) +- demonstrating the need for a workflow manager +``` +### Preprocessing +1. cat - Merge re-sequenced FastQ files +2. FastQC - Raw read QC + +[link to fastqc.html](../Acontrol1_1_fastqc.html) + +3. UMI-tools extract - UMI barcode extraction +4. TrimGalore - Adapter and quality trimming +5. BBSplit - Removal of genome contaminants +6. SortMeRNA - Removal of ribosomal RNA + +### Alignment and quantification +1. STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification +2. STAR via RSEM - Alignment and quantification of expression levels +3. HISAT2 - Memory efficient splice aware alignment to a reference + +### Alignment post-processing +1. SAMtools - Sort and index alignments +2. UMI-tools dedup - UMI-based deduplication +3. picard MarkDuplicates - Duplicate read marking + +### Other steps +1. StringTie - Transcript assembly and quantification +2. BEDTools and bedGraphToBigWig - Create bigWig coverage files + +### Quality control +1. RSeQC - Various RNA-seq QC metrics +2. Qualimap - Various RNA-seq QC metrics +3. dupRadar - Assessment of technical / biological read duplication +4. Preseq - Estimation of library complexity +5. featureCounts - Read counting relative to gene biotype +6. DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram +7 MultiQC - Present QC for raw reads, alignment, read counting and sample similiarity + +### Pseudoalignment and quantification +1. Salmon - Wicked fast gene and isoform quantification relative to the transcriptome +2. Kallisto - Near-optimal probabilistic RNA-seq quantification +3. Workflow reporting and genomes +4. Reference genome files - Saving reference genome indices/files +5. Pipeline information - Report metrics generated during the workflow execution + +[link to execution report](../execution_report_2024-10-05_16-02-39.html) diff --git a/content/setup_environment.md b/content/setup_environment.md index d67d860..70d0f18 100644 --- a/content/setup_environment.md +++ b/content/setup_environment.md @@ -245,6 +245,7 @@ rm samtools-1.21.tar.bz2 cd samtools-1.21/ make make install +export PATH=$PATH:/home/workshop/bin/bin ``` [link to RNAseq Background slides](../Workshop_RNAseq_Intro.pdf) diff --git a/public/differentialabundance/index.html b/public/differentialabundance/index.html index 35a154e..20856d2 100644 --- a/public/differentialabundance/index.html +++ b/public/differentialabundance/index.html @@ -12,43 +12,23 @@ +for those with more confidence this run can be prepared from scratch in a new directory."> @@ -154,23 +134,66 @@



nf-core/differentialabundance is a bioinformatics pipeline that can be used to analyse data represented as matrices, comparing groups of observations to generate differential statistics and downstream analyses. The pipeline supports RNA-seq data such as that generated by the nf-core rnaseq workflow, and Affymetrix arrays via .CEL files. Other types of matrix may also work with appropriate changes to parameters, and PRs to support additional specific modalities are welcomed.

  1. Optionally generate a list of genomic feature annotations using the input GTF file (if a table is not explicitly supplied).
  2. +
  3. Cross-check matrices, sample annotations, feature set and contrasts to ensure consistency.
  4. +
  5. Run differential analysis over all contrasts specified.
  6. +
  7. Optionally run a differential gene set analysis.
  8. +
  9. Generate exploratory and differential analysis plots for interpretation.
  10. +
  11. Optionally build and (if specified) deploy a Shiny app for fully interactive mining of results.
  12. +
  13. Build an HTML report based on R markdown, with interactive plots (where possible) and tables.
  14. +
+ +

+ Set-up run +


for those with more confidence this run can be prepared from scratch in a new directory.

+ + + -

- nf-core/DifferentialAbundance -


There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a nexflow run nf-core/ commands. -You can check the ~/.nextflow/assets folders to see what is already installed

mkdir ~/workshop/nfDifferentialAbundance2
+cd ~/workshop/nfDifferentialAbundance2
+ls -l

optional you can use the nf-core launch command to build a launch command, but these instructions will be using a ’nextflow run’ command


The minimal input requirements are

  1. Sample sheet
  2. +
+ +



nf-core launch

ls -l ~/.nextflow/assets/nf-core/

If you don’t see the pipeline, you can pull it from the nf-core website.


+There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a *nexflow run nf-core/* commands.
+You can check the ~/.nextflow/assets folders to see what is already installed

ls -l ~/.nextflow/assets/nf-core/

nextflow pull nf-core/differentialabundance
If you don't see the pipeline, you can pull it from the nf-core website. 
+nextflow pull nf-core/differentialabundance

but this happeds automatically when running the nextflow run nf-core/differentialabundance


diff --git a/public/execution_report_2024-10-05_16-02-39.html b/public/execution_report_2024-10-05_16-02-39.html new file mode 100644 index 0000000..98be74c --- /dev/null +++ b/public/execution_report_2024-10-05_16-02-39.html @@ -0,0 +1,1041 @@ + + + + + + + + + + + [fabulous_blackwell] Nextflow Workflow Report + + + + + + + +
+ +

Nextflow workflow report



+ + +
+ Workflow execution completed successfully! +
+ + +
Run times
+ 05-Oct-2024 16:02:40 - 06-Oct-2024 05:08:34 + (duration: 12h 5m 54s) +
+ +
  511 succeeded  
  0 cached  
  0 ignored  
  6 failed  
+ +
Nextflow command
nextflow run nf-core/rnaseq -profile sahmri -c /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/nextflow.config -r 3.14.0 --input /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/nfSampleSheet.csv --outdir /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/outs --fasta /homes/daniel.thomson/References/GRCh38/Ensembl_download/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa --gtf /homes/daniel.thomson/References/GRCh38/Ensembl_download/Homo_sapiens.GRCh38.111.gtf --skip_dupradar --skip_qualimap -resume
+ +
730.1 (33.9% failed)
+ +
Launch directory
+ +
Work directory
+ +
Project directory
+ + +
Script name
+ + + +
Script ID
+ + +
Workflow session
+ + +
Workflow repository
https://github.com/nf-core/rnaseq, revision 3.14.0 (commit hash b89fac32650aacc86fcda9ee77e00612a1d77066)
+ + +
Workflow profile
+ + + +
Nextflow version
version 23.10.0, build 5889 (15-10-2023 15:07 UTC)
+ +

Resource Usage


These plots give an overview of the distribution of resource usage for each process.

+ +


+ +
+ +
+ +


+ +
+ +

Job Duration

+ +
+ +


+ +
+ +



This table shows information about each task in the workflow. Use the search box on the right + to filter rows for specific values. Clicking headers will sort the table by that value and + scrolling side to side will reveal more columns.

+ + +
+ +
+ (tasks table omitted because the dataset is too big) +
+ + + + + + + diff --git a/public/nfrnaseq/index.html b/public/nfrnaseq/index.html index 428304e..77b8199 100644 --- a/public/nfrnaseq/index.html +++ b/public/nfrnaseq/index.html @@ -22,8 +22,8 @@ -mkdir ~/workshop/nfcore -cd ~/workshop/nfcore +mkdir -p ~/workshop/nfRNAseq +cd ~/workshop/nfRNAseq nf-core launch -h Launch a pipeline using a web GUI or command line prompts. @@ -145,8 +145,8 @@

mkdir ~/workshop/nfcore
-cd ~/workshop/nfcore  
mkdir -p  ~/workshop/nfRNAseq
+cd ~/workshop/nfRNAseq 
 nf-core launch -h

Launch a pipeline using a web GUI or command line prompts.
@@ -193,7 +193,7 @@

alternatively you can use a nextflow run command

  • -

    no need to run this command, in the interest of time (and the lack of disk space on your intance), I’ve pre-prepared the outputs for this run. We will run the next pipeline to completion.


    no need to run this command, in the interest of time (and the lack of disk space on your intance), I’ve pre-prepared the outputs for this run

  • navigate to run directory to see the nextflow run command

    @@ -224,128 +224,167 @@

    - treatment vs control - 4 replicates for each


    - Results -

    1. Merge re-sequenced FastQ files (cat)
    -2. Sub-sample FastQ files and auto-infer strandedness (fq, Salmon)
    -3. Read QC (FastQC)
    -4. UMI extraction (UMI-tools)
    -5. Adapter and quality trimming (Trim Galore!)
    -6. Removal of genome contaminants (BBSplit)
    -7. Removal of ribosomal RNA (SortMeRNA)
    -8. Choice of multiple alignment and quantification routes:
    -	1. STAR -> Salmon
    -	2. STAR -> RSEM
    -9. Sort and index alignments (SAMtools)
    -10. UMI-based deduplication (UMI-tools)
    -11. Duplicate read marking (picard MarkDuplicates)
    -12. Transcript assembly and quantification (StringTie)
    -13. Create bigWig coverage files (BEDTools, bedGraphToBigWig)
    -14. Extensive quality control:
    -	1. RSeQC
    -	2. Qualimap
    -	3. dupRadar
    -	4. Preseq
    -	5. DESeq2
    -15. Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional)
    -16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R)

    In this section we will work through setting up the nf-core/RNAseq pipeline. We will be chosing specific parameters and finally, going through was the output looks like.

    - - - -

    - setting up nf-core/RNAseq -

    - - - -

    - dataset +

    + Run Summary


    The dataset used throughout this workshop is as follow:


    list results directory. All nf-core outputs have a consistant structure of the outputs

    - 16 Samples sequenced with an MGI400 sequencer at SAGC using the Tecan Universal RNA-seq library protocol.
    -- 2 different cancer cell lines (human)
    -- treatment vs control
    -- 4 replicates for each

    NOTE some concessions had to be made to work with this for workshop. Taking into account the large file sizes, the long run times and need for high compute resources.

    tree /home/workshop/workshop/nfRNAseq/outs

    link to execution timeline


    link to execution report


    - what are the inputs to an RNAseq pipeline -


    + Genomics Files Background +


    Quick background on genomic file formats to help describe the inputs and outputs of an NGS experiment. +Knowing what these files are isn’t only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformatics +Being able to use and manipulate each file open’s up many opportunities, and is often required for troubleshooting
    +Many are plain text files, this means they can be manipluated with basic text editing.



    .fastq /.fastq.gz -
    - - - -



    .fasta -

    - + +

    the genome.fa file is a plain text representation of the genome sequence. This is the ‘reference’ to which the sequencing files (.fastq) are alligned.

    + -

    - .gtf -

    head ./smallRNA/outs/mirtrace/[:]/mirtrace/qc_passed_reads.rnatype_unknown.collapsed/Acontrol1.fastp.fasta

    - .bed -



    + .gtf +


    genes.gtf ; a genomic interval file referencing the genome. This file depicts the genes/ transcripts. It’s a required for counting where reads are mapped to, and often has alot more annotation data in regards to the gene/transcript.


    - Genomic file formats +

    + .bed

    - +



    bam -



    - bigwig +

    + Walkthough of outputs nf-core/RNAseq


    the multiqc summary is a real strength of nf-core pipeline. +alot of the key analyses are captured

    link to multiqc report

    Overview of the key processes run with nf-core/RNAseq.
    +- identifying steps to trouble shooting the success of a run.
    +- far more than just mapping (star) and counting (RSEM)
    +- demonstrating the need for a workflow manager
    + + + +

    + Preprocessing +

    1. cat - Merge re-sequenced FastQ files
    2. +
    3. FastQC - Raw read QC
    4. +

    link to fastqc.html

    1. UMI-tools extract - UMI barcode extraction
    2. +
    3. TrimGalore - Adapter and quality trimming
    4. +
    5. BBSplit - Removal of genome contaminants
    6. +
    7. SortMeRNA - Removal of ribosomal RNA
    8. +
    + + + +

    + Alignment and quantification +

    1. STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification
    2. +
    3. STAR via RSEM - Alignment and quantification of expression levels
    4. +
    5. HISAT2 - Memory efficient splice aware alignment to a reference
    6. +
    + + + +

    + Alignment post-processing +

    1. SAMtools - Sort and index alignments
    2. +
    3. UMI-tools dedup - UMI-based deduplication
    4. +
    5. picard MarkDuplicates - Duplicate read marking
    6. +
    + + + +

    + Other steps +

    1. StringTie - Transcript assembly and quantification
    2. +
    3. BEDTools and bedGraphToBigWig - Create bigWig coverage files
    4. +
    + + + +

    + Quality control +

    1. RSeQC - Various RNA-seq QC metrics
    2. +
    3. Qualimap - Various RNA-seq QC metrics
    4. +
    5. dupRadar - Assessment of technical / biological read duplication
    6. +
    7. Preseq - Estimation of library complexity
    8. +
    9. featureCounts - Read counting relative to gene biotype
    10. +
    11. DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram +7 MultiQC - Present QC for raw reads, alignment, read counting and sample similiarity
    12. +
    + + + +

    + Pseudoalignment and quantification +

    1. Salmon - Wicked fast gene and isoform quantification relative to the transcriptome
    2. +
    3. Kallisto - Near-optimal probabilistic RNA-seq quantification
    4. +
    5. Workflow reporting and genomes
    6. +
    7. Reference genome files - Saving reference genome indices/files
    8. +
    9. Pipeline information - Report metrics generated during the workflow execution
    10. +

    link to execution report

    + + + diff --git a/public/setup_environment/index.html b/public/setup_environment/index.html index a2ca0c4..395cdfa 100644 --- a/public/setup_environment/index.html +++ b/public/setup_environment/index.html @@ -463,7 +463,8 @@

    rm samtools-1.21.tar.bz2 cd samtools-1.21/ make -make install +make install +export PATH=$PATH:/home/workshop/bin/bin

    link to RNAseq Background slides