diff --git a/content/DifferentialAbundance.md b/content/DifferentialAbundance.md index 4b3ea4b..c99b6f6 100644 --- a/content/DifferentialAbundance.md +++ b/content/DifferentialAbundance.md @@ -4,7 +4,50 @@ title = 'nf-core/DifferentialAbundance' ![DifferentialAbundance_pipeline](DifferentialAbundance_pipeline.png) -#### nf-core/DifferentialAbundance +nf-core/differentialabundance is a bioinformatics pipeline that can be used to analyse data represented as matrices, comparing groups of observations to generate differential statistics and downstream analyses. The pipeline supports RNA-seq data such as that generated by the nf-core rnaseq workflow, and Affymetrix arrays via .CEL files. Other types of matrix may also work with appropriate changes to parameters, and PRs to support additional specific modalities are welcomed. + +1. Optionally generate a list of genomic feature annotations using the input GTF file (if a table is not explicitly supplied). +2. Cross-check matrices, sample annotations, feature set and contrasts to ensure consistency. +3. Run differential analysis over all contrasts specified. +4. Optionally run a differential gene set analysis. +5. Generate exploratory and differential analysis plots for interpretation. +6. Optionally build and (if specified) deploy a Shiny app for fully interactive mining of results. +7. Build an HTML report based on R markdown, with interactive plots (where possible) and tables. + +# Set-up run +***for those with more confidence*** this run can be prepared from scratch in a new directory. +- the pipeline will taake ~40mins to run +- for this reason there is a completed run in ~/workshop/nfDifferentialAbundance/, where you can go through the motions +- a good chance to demonstrate the **resume** feature of nextflow + + +```bash +mkdir ~/workshop/nfDifferentialAbundance2 +cd ~/workshop/nfDifferentialAbundance2 +ls -l +``` +***optional*** you can use the nf-core launch command to build a launch command, but these instructions will be using a 'nextflow run' command + +The minimal input requirements are +1. Sample sheet +- containg the sample information, metadata and group relationships + +2 + + + + +nf-core launch +``` + + + + + + + +#### +nf-core/DifferentialAbundance There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a *nexflow run nf-core/* commands. You can check the ~/.nextflow/assets folders to see what is already installed diff --git a/content/nfRNAseq.md b/content/nfRNAseq.md index afcd136..1000c69 100644 --- a/content/nfRNAseq.md +++ b/content/nfRNAseq.md @@ -13,8 +13,8 @@ nf-core list -s stars lets use nf-core tools to build a command to run the nf-core/RNAseq pipeline ```bash -mkdir ~/workshop/nfcore -cd ~/workshop/nfcore +mkdir -p ~/workshop/nfRNAseq +cd ~/workshop/nfRNAseq nf-core launch -h ``` @@ -60,7 +60,7 @@ nf-core launch --id 1728482936_6064841c2138 ``` - alternatively you can use a nextflow run command -- no need to run this command, in the interest of time (and the lack of disk space on your intance), I've pre-prepared the outputs for this run. We will run the next pipeline to completion. +- no need to run this command, in the interest of time (and the lack of disk space on your intance), I've pre-prepared the outputs for this run navigate to run directory to see the nextflow run command ```bash @@ -84,81 +84,97 @@ The dataset used throughout this workshop is as follow: - treatment vs control - 4 replicates for each ``` -- the information is reflected in the samplesheet and run command - -# Results - - - - 1. Merge re-sequenced FastQ files (cat) - 2. Sub-sample FastQ files and auto-infer strandedness (fq, Salmon) - 3. Read QC (FastQC) - 4. UMI extraction (UMI-tools) - 5. Adapter and quality trimming (Trim Galore!) - 6. Removal of genome contaminants (BBSplit) - 7. Removal of ribosomal RNA (SortMeRNA) - 8. Choice of multiple alignment and quantification routes: - 1. STAR -> Salmon - 2. STAR -> RSEM - 3. HiSAT2 -> NO QUANTIFICATION - 9. Sort and index alignments (SAMtools) - 10. UMI-based deduplication (UMI-tools) - 11. Duplicate read marking (picard MarkDuplicates) - 12. Transcript assembly and quantification (StringTie) - 13. Create bigWig coverage files (BEDTools, bedGraphToBigWig) - 14. Extensive quality control: - 1. RSeQC - 2. Qualimap - 3. dupRadar - 4. Preseq - 5. DESeq2 - 15. Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional) - 16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R) - -In this section we will work through setting up the nf-core/RNAseq pipeline. We will be chosing specific parameters and finally, going through was the output looks like. - -# setting up nf-core/RNAseq +- most of this information is reflected in the samplesheet and run command -#### dataset -The dataset used throughout this workshop is as follow: +#### Run Summary +list results directory. All nf-core outputs have a consistant structure of the outputs +```bash +tree /home/workshop/workshop/nfRNAseq/outs ``` -- 16 Samples sequenced with an MGI400 sequencer at SAGC using the Tecan Universal RNA-seq library protocol. -- 2 different cancer cell lines (human) -- treatment vs control -- 4 replicates for each -``` -***NOTE*** some concessions had to be made to work with this for workshop. Taking into account the large file sizes, the long run times and need for high compute resources. -#### what are the inputs to an RNAseq pipeline +[link to execution timeline](../execution_timeline_2024-10-05_16-02-39.html) -# .fastq /.fastq.gz -``` +[link to execution report](../execution_report_2024-10-05_16-02-39.html) + +# Genomics Files Background +Quick background on genomic file formats to help describe the inputs and outputs of an NGS experiment. +Knowing what these files are isn't only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformatics +Being able to use and manipulate each file open's up many opportunities, and is often required for troubleshooting +Many are plain text files, this means they can be manipluated with basic text editing. + +##### .fastq /.fastq.gz -``` ![](../fastqfile.png) -# .fasta +#### .fasta +the genome.fa file is a plain text representation of the genome sequence. This is the 'reference' to which the sequencing files (.fastq) are alligned. -# .gtf +```bash +head ./smallRNA/outs/mirtrace/[:]/mirtrace/qc_passed_reads.rnatype_unknown.collapsed/Acontrol1.fastp.fasta +``` +#### .gtf +genes.gtf ; a genomic interval file referencing the genome. This file depicts the genes/ transcripts. It's a required for counting where reads are mapped to, and often has alot more annotation data in regards to the gene/transcript. -# .bed +#### .bed ![](../bedfile.png) -#### Genomic file formats -- now that we have gone through how to run the nf-core/RNAseq pipeline. Let's look at the inputs and outputs in detail. -- using this chance to describe what are the key genomic file formats used in RNAseq and beyond -- Knowing what these files are isn't only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformati$ -- being able to use and manipulate each file open's up many opportunities, and is often required for troubleshooting wehn something has gone w$ -- many files are plain text files, this means they can be manipluated with basic text editing. I'll be going through some examples. -- For a more visual perspective, we'll also be using a genome browser, IGV (Integrative Genome Browser) to get a feel for what information eac$ - -# bam +#### bam ![](../bamfile.png) -# bigwig - +# Walkthough of outputs nf-core/RNAseq +the multiqc summary is a real strength of nf-core pipeline. +alot of the key analyses are captured [link to multiqc report](../multiqc_report.html) + +``` +Overview of the key processes run with nf-core/RNAseq. +- identifying steps to trouble shooting the success of a run. +- far more than just mapping (star) and counting (RSEM) +- demonstrating the need for a workflow manager +``` +### Preprocessing +1. cat - Merge re-sequenced FastQ files +2. FastQC - Raw read QC + +[link to fastqc.html](../Acontrol1_1_fastqc.html) + +3. UMI-tools extract - UMI barcode extraction +4. TrimGalore - Adapter and quality trimming +5. BBSplit - Removal of genome contaminants +6. SortMeRNA - Removal of ribosomal RNA + +### Alignment and quantification +1. STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification +2. STAR via RSEM - Alignment and quantification of expression levels +3. HISAT2 - Memory efficient splice aware alignment to a reference + +### Alignment post-processing +1. SAMtools - Sort and index alignments +2. UMI-tools dedup - UMI-based deduplication +3. picard MarkDuplicates - Duplicate read marking + +### Other steps +1. StringTie - Transcript assembly and quantification +2. BEDTools and bedGraphToBigWig - Create bigWig coverage files + +### Quality control +1. RSeQC - Various RNA-seq QC metrics +2. Qualimap - Various RNA-seq QC metrics +3. dupRadar - Assessment of technical / biological read duplication +4. Preseq - Estimation of library complexity +5. featureCounts - Read counting relative to gene biotype +6. DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram +7 MultiQC - Present QC for raw reads, alignment, read counting and sample similiarity + +### Pseudoalignment and quantification +1. Salmon - Wicked fast gene and isoform quantification relative to the transcriptome +2. Kallisto - Near-optimal probabilistic RNA-seq quantification +3. Workflow reporting and genomes +4. Reference genome files - Saving reference genome indices/files +5. Pipeline information - Report metrics generated during the workflow execution + +[link to execution report](../execution_report_2024-10-05_16-02-39.html) diff --git a/content/setup_environment.md b/content/setup_environment.md index d67d860..70d0f18 100644 --- a/content/setup_environment.md +++ b/content/setup_environment.md @@ -245,6 +245,7 @@ rm samtools-1.21.tar.bz2 cd samtools-1.21/ make make install +export PATH=$PATH:/home/workshop/bin/bin ``` [link to RNAseq Background slides](../Workshop_RNAseq_Intro.pdf) diff --git a/public/differentialabundance/index.html b/public/differentialabundance/index.html index 35a154e..20856d2 100644 --- a/public/differentialabundance/index.html +++ b/public/differentialabundance/index.html @@ -12,43 +12,23 @@ +for those with more confidence this run can be prepared from scratch in a new directory."> @@ -154,23 +134,66 @@

DifferentialAbundance_pipeline

+

nf-core/differentialabundance is a bioinformatics pipeline that can be used to analyse data represented as matrices, comparing groups of observations to generate differential statistics and downstream analyses. The pipeline supports RNA-seq data such as that generated by the nf-core rnaseq workflow, and Affymetrix arrays via .CEL files. Other types of matrix may also work with appropriate changes to parameters, and PRs to support additional specific modalities are welcomed.

+
    +
  1. Optionally generate a list of genomic feature annotations using the input GTF file (if a table is not explicitly supplied).
  2. +
  3. Cross-check matrices, sample annotations, feature set and contrasts to ensure consistency.
  4. +
  5. Run differential analysis over all contrasts specified.
  6. +
  7. Optionally run a differential gene set analysis.
  8. +
  9. Generate exploratory and differential analysis plots for interpretation.
  10. +
  11. Optionally build and (if specified) deploy a Shiny app for fully interactive mining of results.
  12. +
  13. Build an HTML report based on R markdown, with interactive plots (where possible) and tables.
  14. +
+ +

+ Set-up run +

+

for those with more confidence this run can be prepared from scratch in a new directory.

+ + + -

- nf-core/DifferentialAbundance -

-

There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a nexflow run nf-core/ commands. -You can check the ~/.nextflow/assets folders to see what is already installed

+
mkdir ~/workshop/nfDifferentialAbundance2
+cd ~/workshop/nfDifferentialAbundance2
+ls -l
+

optional you can use the nf-core launch command to build a launch command, but these instructions will be using a ’nextflow run’ command

+

The minimal input requirements are

+
    +
  1. Sample sheet
  2. +
+ +

2

+

nf-core launch

-
ls -l ~/.nextflow/assets/nf-core/
-

If you don’t see the pipeline, you can pull it from the nf-core website.

+

+
+
+
+
+
+
+#### 
+nf-core/DifferentialAbundance
+
+There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a *nexflow run nf-core/* commands.
+You can check the ~/.nextflow/assets folders to see what is already installed
+

ls -l ~/.nextflow/assets/nf-core/

-
nextflow pull nf-core/differentialabundance
+
If you don't see the pipeline, you can pull it from the nf-core website. 
+```bash
+nextflow pull nf-core/differentialabundance

but this happeds automatically when running the nextflow run nf-core/differentialabundance

nfpull

diff --git a/public/execution_report_2024-10-05_16-02-39.html b/public/execution_report_2024-10-05_16-02-39.html new file mode 100644 index 0000000..98be74c --- /dev/null +++ b/public/execution_report_2024-10-05_16-02-39.html @@ -0,0 +1,1041 @@ + + + + + + + + + + + [fabulous_blackwell] Nextflow Workflow Report + + + + + + + +
+
+ +

Nextflow workflow report

+

[fabulous_blackwell]

+ + +
+ Workflow execution completed successfully! +
+ + +
+
Run times
+
+ 05-Oct-2024 16:02:40 - 06-Oct-2024 05:08:34 + (duration: 12h 5m 54s) +
+ +
+
+
  511 succeeded  
+
  0 cached  
+
  0 ignored  
+
  6 failed  
+
+
+ +
Nextflow command
+
nextflow run nf-core/rnaseq -profile sahmri -c /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/nextflow.config -r 3.14.0 --input /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/nfSampleSheet.csv --outdir /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/outs --fasta /homes/daniel.thomson/References/GRCh38/Ensembl_download/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa --gtf /homes/daniel.thomson/References/GRCh38/Ensembl_download/Homo_sapiens.GRCh38.111.gtf --skip_dupradar --skip_qualimap -resume
+
+ +
+
CPU-Hours
+
730.1 (33.9% failed)
+ +
Launch directory
+
/hpc/capacity/SAGC/workshop/Workshop_data/nfRNAseq
+ +
Work directory
+
/hpc/capacity/SAGC/workshop/Workshop_data/nfRNAseq/work
+ +
Project directory
+
/homes/daniel.thomson/.nextflow/assets/nf-core/rnaseq
+ + +
Script name
+
main.nf
+ + + +
Script ID
+
746820de9bf04c0ec23e1e0295aaf91b
+ + +
Workflow session
+
b8dc6ea0-b938-4179-ba77-9e115af63492
+ + +
Workflow repository
+
https://github.com/nf-core/rnaseq, revision 3.14.0 (commit hash b89fac32650aacc86fcda9ee77e00612a1d77066)
+ + +
Workflow profile
+
sahmri
+ + + +
Nextflow version
+
version 23.10.0, build 5889 (15-10-2023 15:07 UTC)
+
+
+
+ +
+

Resource Usage

+

These plots give an overview of the distribution of resource usage for each process.

+ +

CPU

+ +
+
+
+
+
+
+
+ +
+ +

Memory

+ +
+
+
+
+
+
+
+
+
+
+
+ +

Job Duration

+ +
+
+
+
+
+
+
+
+ +

I/O

+ +
+
+
+
+
+
+
+
+
+ +
+
+

Tasks

+

This table shows information about each task in the workflow. Use the search box on the right + to filter rows for specific values. Clicking headers will sort the table by that value and + scrolling side to side will reveal more columns.

+
+ + +
+
+
+
+
+ +
+ (tasks table omitted because the dataset is too big) +
+
+ + + + + + + diff --git a/public/nfrnaseq/index.html b/public/nfrnaseq/index.html index 428304e..77b8199 100644 --- a/public/nfrnaseq/index.html +++ b/public/nfrnaseq/index.html @@ -22,8 +22,8 @@ -mkdir ~/workshop/nfcore -cd ~/workshop/nfcore +mkdir -p ~/workshop/nfRNAseq +cd ~/workshop/nfRNAseq nf-core launch -h Launch a pipeline using a web GUI or command line prompts. @@ -145,8 +145,8 @@

-
mkdir ~/workshop/nfcore
-cd ~/workshop/nfcore  
+
mkdir -p  ~/workshop/nfRNAseq
+cd ~/workshop/nfRNAseq 
 
 nf-core launch -h

Launch a pipeline using a web GUI or command line prompts.
@@ -193,7 +193,7 @@

alternatively you can use a nextflow run command

  • -

    no need to run this command, in the interest of time (and the lack of disk space on your intance), I’ve pre-prepared the outputs for this run. We will run the next pipeline to completion.

    +

    no need to run this command, in the interest of time (and the lack of disk space on your intance), I’ve pre-prepared the outputs for this run

  • navigate to run directory to see the nextflow run command

    @@ -224,128 +224,167 @@

    - treatment vs control - 4 replicates for each

    -

    - Results -

    -
    1. Merge re-sequenced FastQ files (cat)
    -2. Sub-sample FastQ files and auto-infer strandedness (fq, Salmon)
    -3. Read QC (FastQC)
    -4. UMI extraction (UMI-tools)
    -5. Adapter and quality trimming (Trim Galore!)
    -6. Removal of genome contaminants (BBSplit)
    -7. Removal of ribosomal RNA (SortMeRNA)
    -8. Choice of multiple alignment and quantification routes:
    -	1. STAR -> Salmon
    -	2. STAR -> RSEM
    -	3. HiSAT2 -> NO QUANTIFICATION
    -9. Sort and index alignments (SAMtools)
    -10. UMI-based deduplication (UMI-tools)
    -11. Duplicate read marking (picard MarkDuplicates)
    -12. Transcript assembly and quantification (StringTie)
    -13. Create bigWig coverage files (BEDTools, bedGraphToBigWig)
    -14. Extensive quality control:
    -	1. RSeQC
    -	2. Qualimap
    -	3. dupRadar
    -	4. Preseq
    -	5. DESeq2
    -15. Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional)
    -16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R)
    -
    -

    In this section we will work through setting up the nf-core/RNAseq pipeline. We will be chosing specific parameters and finally, going through was the output looks like.

    - - - -

    - setting up nf-core/RNAseq -

    - - - -

    - dataset +

    + Run Summary

    -

    The dataset used throughout this workshop is as follow:

    +

    list results directory. All nf-core outputs have a consistant structure of the outputs

    -
    - 16 Samples sequenced with an MGI400 sequencer at SAGC using the Tecan Universal RNA-seq library protocol.
    -- 2 different cancer cell lines (human)
    -- treatment vs control
    -- 4 replicates for each
    -

    NOTE some concessions had to be made to work with this for workshop. Taking into account the large file sizes, the long run times and need for high compute resources.

    +
    tree /home/workshop/workshop/nfRNAseq/outs
    +

    link to execution timeline

    +

    link to execution report

    -

    - what are the inputs to an RNAseq pipeline -

    +

    + Genomics Files Background +

    +

    Quick background on genomic file formats to help describe the inputs and outputs of an NGS experiment. +Knowing what these files are isn’t only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformatics +Being able to use and manipulate each file open’s up many opportunities, and is often required for troubleshooting
    +Many are plain text files, this means they can be manipluated with basic text editing.

    -

    +

    .fastq /.fastq.gz -
    - - - -
    +

    -

    +

    .fasta -

    - + +

    the genome.fa file is a plain text representation of the genome sequence. This is the ‘reference’ to which the sequencing files (.fastq) are alligned.

    + -

    - .gtf -

    +
    head ./smallRNA/outs/mirtrace/[:]/mirtrace/qc_passed_reads.rnatype_unknown.collapsed/Acontrol1.fastp.fasta
    -

    - .bed -

    -

    +

    + .gtf +

    +

    genes.gtf ; a genomic interval file referencing the genome. This file depicts the genes/ transcripts. It’s a required for counting where reads are mapped to, and often has alot more annotation data in regards to the gene/transcript.

    -

    - Genomic file formats +

    + .bed

    - +

    -

    +

    bam -

    +

    -

    - bigwig +

    + Walkthough of outputs nf-core/RNAseq

    +

    the multiqc summary is a real strength of nf-core pipeline. +alot of the key analyses are captured

    link to multiqc report

    +
    Overview of the key processes run with nf-core/RNAseq.
    +- identifying steps to trouble shooting the success of a run.
    +- far more than just mapping (star) and counting (RSEM)
    +- demonstrating the need for a workflow manager
    + + + +

    + Preprocessing +

    +
      +
    1. cat - Merge re-sequenced FastQ files
    2. +
    3. FastQC - Raw read QC
    4. +
    +

    link to fastqc.html

    +
      +
    1. UMI-tools extract - UMI barcode extraction
    2. +
    3. TrimGalore - Adapter and quality trimming
    4. +
    5. BBSplit - Removal of genome contaminants
    6. +
    7. SortMeRNA - Removal of ribosomal RNA
    8. +
    + + + +

    + Alignment and quantification +

    +
      +
    1. STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification
    2. +
    3. STAR via RSEM - Alignment and quantification of expression levels
    4. +
    5. HISAT2 - Memory efficient splice aware alignment to a reference
    6. +
    + + + +

    + Alignment post-processing +

    +
      +
    1. SAMtools - Sort and index alignments
    2. +
    3. UMI-tools dedup - UMI-based deduplication
    4. +
    5. picard MarkDuplicates - Duplicate read marking
    6. +
    + + + +

    + Other steps +

    +
      +
    1. StringTie - Transcript assembly and quantification
    2. +
    3. BEDTools and bedGraphToBigWig - Create bigWig coverage files
    4. +
    + + + +

    + Quality control +

    +
      +
    1. RSeQC - Various RNA-seq QC metrics
    2. +
    3. Qualimap - Various RNA-seq QC metrics
    4. +
    5. dupRadar - Assessment of technical / biological read duplication
    6. +
    7. Preseq - Estimation of library complexity
    8. +
    9. featureCounts - Read counting relative to gene biotype
    10. +
    11. DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram +7 MultiQC - Present QC for raw reads, alignment, read counting and sample similiarity
    12. +
    + + + +

    + Pseudoalignment and quantification +

    +
      +
    1. Salmon - Wicked fast gene and isoform quantification relative to the transcriptome
    2. +
    3. Kallisto - Near-optimal probabilistic RNA-seq quantification
    4. +
    5. Workflow reporting and genomes
    6. +
    7. Reference genome files - Saving reference genome indices/files
    8. +
    9. Pipeline information - Report metrics generated during the workflow execution
    10. +
    +

    link to execution report

    + + + diff --git a/public/setup_environment/index.html b/public/setup_environment/index.html index a2ca0c4..395cdfa 100644 --- a/public/setup_environment/index.html +++ b/public/setup_environment/index.html @@ -463,7 +463,8 @@

    rm samtools-1.21.tar.bz2 cd samtools-1.21/ make -make install +make install +export PATH=$PATH:/home/workshop/bin/bin

    link to RNAseq Background slides