diff --git a/content/DifferentialAbundance.md b/content/DifferentialAbundance.md index 4b3ea4b..c99b6f6 100644 --- a/content/DifferentialAbundance.md +++ b/content/DifferentialAbundance.md @@ -4,7 +4,50 @@ title = 'nf-core/DifferentialAbundance'  -#### nf-core/DifferentialAbundance +nf-core/differentialabundance is a bioinformatics pipeline that can be used to analyse data represented as matrices, comparing groups of observations to generate differential statistics and downstream analyses. The pipeline supports RNA-seq data such as that generated by the nf-core rnaseq workflow, and Affymetrix arrays via .CEL files. Other types of matrix may also work with appropriate changes to parameters, and PRs to support additional specific modalities are welcomed. + +1. Optionally generate a list of genomic feature annotations using the input GTF file (if a table is not explicitly supplied). +2. Cross-check matrices, sample annotations, feature set and contrasts to ensure consistency. +3. Run differential analysis over all contrasts specified. +4. Optionally run a differential gene set analysis. +5. Generate exploratory and differential analysis plots for interpretation. +6. Optionally build and (if specified) deploy a Shiny app for fully interactive mining of results. +7. Build an HTML report based on R markdown, with interactive plots (where possible) and tables. + +# Set-up run +***for those with more confidence*** this run can be prepared from scratch in a new directory. +- the pipeline will taake ~40mins to run +- for this reason there is a completed run in ~/workshop/nfDifferentialAbundance/, where you can go through the motions +- a good chance to demonstrate the **resume** feature of nextflow + + +```bash +mkdir ~/workshop/nfDifferentialAbundance2 +cd ~/workshop/nfDifferentialAbundance2 +ls -l +``` +***optional*** you can use the nf-core launch command to build a launch command, but these instructions will be using a 'nextflow run' command + +The minimal input requirements are +1. Sample sheet +- containg the sample information, metadata and group relationships + +2 + + + + +nf-core launch +``` + + + + + + + +#### +nf-core/DifferentialAbundance There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a *nexflow run nf-core/* commands. You can check the ~/.nextflow/assets folders to see what is already installed diff --git a/content/nfRNAseq.md b/content/nfRNAseq.md index afcd136..1000c69 100644 --- a/content/nfRNAseq.md +++ b/content/nfRNAseq.md @@ -13,8 +13,8 @@ nf-core list -s stars lets use nf-core tools to build a command to run the nf-core/RNAseq pipeline ```bash -mkdir ~/workshop/nfcore -cd ~/workshop/nfcore +mkdir -p ~/workshop/nfRNAseq +cd ~/workshop/nfRNAseq nf-core launch -h ``` @@ -60,7 +60,7 @@ nf-core launch --id 1728482936_6064841c2138 ``` - alternatively you can use a nextflow run command -- no need to run this command, in the interest of time (and the lack of disk space on your intance), I've pre-prepared the outputs for this run. We will run the next pipeline to completion. +- no need to run this command, in the interest of time (and the lack of disk space on your intance), I've pre-prepared the outputs for this run navigate to run directory to see the nextflow run command ```bash @@ -84,81 +84,97 @@ The dataset used throughout this workshop is as follow: - treatment vs control - 4 replicates for each ``` -- the information is reflected in the samplesheet and run command - -# Results - - - - 1. Merge re-sequenced FastQ files (cat) - 2. Sub-sample FastQ files and auto-infer strandedness (fq, Salmon) - 3. Read QC (FastQC) - 4. UMI extraction (UMI-tools) - 5. Adapter and quality trimming (Trim Galore!) - 6. Removal of genome contaminants (BBSplit) - 7. Removal of ribosomal RNA (SortMeRNA) - 8. Choice of multiple alignment and quantification routes: - 1. STAR -> Salmon - 2. STAR -> RSEM - 3. HiSAT2 -> NO QUANTIFICATION - 9. Sort and index alignments (SAMtools) - 10. UMI-based deduplication (UMI-tools) - 11. Duplicate read marking (picard MarkDuplicates) - 12. Transcript assembly and quantification (StringTie) - 13. Create bigWig coverage files (BEDTools, bedGraphToBigWig) - 14. Extensive quality control: - 1. RSeQC - 2. Qualimap - 3. dupRadar - 4. Preseq - 5. DESeq2 - 15. Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional) - 16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R) - -In this section we will work through setting up the nf-core/RNAseq pipeline. We will be chosing specific parameters and finally, going through was the output looks like. - -# setting up nf-core/RNAseq +- most of this information is reflected in the samplesheet and run command -#### dataset -The dataset used throughout this workshop is as follow: +#### Run Summary +list results directory. All nf-core outputs have a consistant structure of the outputs +```bash +tree /home/workshop/workshop/nfRNAseq/outs ``` -- 16 Samples sequenced with an MGI400 sequencer at SAGC using the Tecan Universal RNA-seq library protocol. -- 2 different cancer cell lines (human) -- treatment vs control -- 4 replicates for each -``` -***NOTE*** some concessions had to be made to work with this for workshop. Taking into account the large file sizes, the long run times and need for high compute resources. -#### what are the inputs to an RNAseq pipeline +[link to execution timeline](../execution_timeline_2024-10-05_16-02-39.html) -# .fastq /.fastq.gz -``` +[link to execution report](../execution_report_2024-10-05_16-02-39.html) + +# Genomics Files Background +Quick background on genomic file formats to help describe the inputs and outputs of an NGS experiment. +Knowing what these files are isn't only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformatics +Being able to use and manipulate each file open's up many opportunities, and is often required for troubleshooting +Many are plain text files, this means they can be manipluated with basic text editing. + +##### .fastq /.fastq.gz -```  -# .fasta +#### .fasta +the genome.fa file is a plain text representation of the genome sequence. This is the 'reference' to which the sequencing files (.fastq) are alligned. -# .gtf +```bash +head ./smallRNA/outs/mirtrace/[:]/mirtrace/qc_passed_reads.rnatype_unknown.collapsed/Acontrol1.fastp.fasta +``` +#### .gtf +genes.gtf ; a genomic interval file referencing the genome. This file depicts the genes/ transcripts. It's a required for counting where reads are mapped to, and often has alot more annotation data in regards to the gene/transcript. -# .bed +#### .bed  -#### Genomic file formats -- now that we have gone through how to run the nf-core/RNAseq pipeline. Let's look at the inputs and outputs in detail. -- using this chance to describe what are the key genomic file formats used in RNAseq and beyond -- Knowing what these files are isn't only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformati$ -- being able to use and manipulate each file open's up many opportunities, and is often required for troubleshooting wehn something has gone w$ -- many files are plain text files, this means they can be manipluated with basic text editing. I'll be going through some examples. -- For a more visual perspective, we'll also be using a genome browser, IGV (Integrative Genome Browser) to get a feel for what information eac$ - -# bam +#### bam  -# bigwig - +# Walkthough of outputs nf-core/RNAseq +the multiqc summary is a real strength of nf-core pipeline. +alot of the key analyses are captured [link to multiqc report](../multiqc_report.html) + +``` +Overview of the key processes run with nf-core/RNAseq. +- identifying steps to trouble shooting the success of a run. +- far more than just mapping (star) and counting (RSEM) +- demonstrating the need for a workflow manager +``` +### Preprocessing +1. cat - Merge re-sequenced FastQ files +2. FastQC - Raw read QC + +[link to fastqc.html](../Acontrol1_1_fastqc.html) + +3. UMI-tools extract - UMI barcode extraction +4. TrimGalore - Adapter and quality trimming +5. BBSplit - Removal of genome contaminants +6. SortMeRNA - Removal of ribosomal RNA + +### Alignment and quantification +1. STAR and Salmon - Fast spliced aware genome alignment and transcriptome quantification +2. STAR via RSEM - Alignment and quantification of expression levels +3. HISAT2 - Memory efficient splice aware alignment to a reference + +### Alignment post-processing +1. SAMtools - Sort and index alignments +2. UMI-tools dedup - UMI-based deduplication +3. picard MarkDuplicates - Duplicate read marking + +### Other steps +1. StringTie - Transcript assembly and quantification +2. BEDTools and bedGraphToBigWig - Create bigWig coverage files + +### Quality control +1. RSeQC - Various RNA-seq QC metrics +2. Qualimap - Various RNA-seq QC metrics +3. dupRadar - Assessment of technical / biological read duplication +4. Preseq - Estimation of library complexity +5. featureCounts - Read counting relative to gene biotype +6. DESeq2 - PCA plot and sample pairwise distance heatmap and dendrogram +7 MultiQC - Present QC for raw reads, alignment, read counting and sample similiarity + +### Pseudoalignment and quantification +1. Salmon - Wicked fast gene and isoform quantification relative to the transcriptome +2. Kallisto - Near-optimal probabilistic RNA-seq quantification +3. Workflow reporting and genomes +4. Reference genome files - Saving reference genome indices/files +5. Pipeline information - Report metrics generated during the workflow execution + +[link to execution report](../execution_report_2024-10-05_16-02-39.html) diff --git a/content/setup_environment.md b/content/setup_environment.md index d67d860..70d0f18 100644 --- a/content/setup_environment.md +++ b/content/setup_environment.md @@ -245,6 +245,7 @@ rm samtools-1.21.tar.bz2 cd samtools-1.21/ make make install +export PATH=$PATH:/home/workshop/bin/bin ``` [link to RNAseq Background slides](../Workshop_RNAseq_Intro.pdf) diff --git a/public/differentialabundance/index.html b/public/differentialabundance/index.html index 35a154e..20856d2 100644 --- a/public/differentialabundance/index.html +++ b/public/differentialabundance/index.html @@ -12,43 +12,23 @@ +for those with more confidence this run can be prepared from scratch in a new directory."> @@ -154,23 +134,66 @@
nf-core/differentialabundance is a bioinformatics pipeline that can be used to analyse data represented as matrices, comparing groups of observations to generate differential statistics and downstream analyses. The pipeline supports RNA-seq data such as that generated by the nf-core rnaseq workflow, and Affymetrix arrays via .CEL files. Other types of matrix may also work with appropriate changes to parameters, and PRs to support additional specific modalities are welcomed.
+for those with more confidence this run can be prepared from scratch in a new directory.
+There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a nexflow run nf-core/ commands. -You can check the ~/.nextflow/assets folders to see what is already installed
+mkdir ~/workshop/nfDifferentialAbundance2
+cd ~/workshop/nfDifferentialAbundance2
+ls -l
optional you can use the nf-core launch command to build a launch command, but these instructions will be using a ’nextflow run’ command
+The minimal input requirements are
+2
+nf-core launch
-ls -l ~/.nextflow/assets/nf-core/
-If you don’t see the pipeline, you can pull it from the nf-core website.
+
+
+
+
+
+
+
+####
+nf-core/DifferentialAbundance
+
+There are a few ways of installing the nf-core pipeline. But this happens automatically when you use a *nexflow run nf-core/* commands.
+You can check the ~/.nextflow/assets folders to see what is already installed
+ls -l ~/.nextflow/assets/nf-core/
-nextflow pull nf-core/differentialabundance
If you don't see the pipeline, you can pull it from the nf-core website.
+```bash
+nextflow pull nf-core/differentialabundance
but this happeds automatically when running the nextflow run nf-core/differentialabundance
nextflow run nf-core/rnaseq -profile sahmri -c /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/nextflow.config -r 3.14.0 --input /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/nfSampleSheet.csv --outdir /cancer/storage/SAGC/workshop/Workshop_data/nfRNAseq/outs --fasta /homes/daniel.thomson/References/GRCh38/Ensembl_download/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa --gtf /homes/daniel.thomson/References/GRCh38/Ensembl_download/Homo_sapiens.GRCh38.111.gtf --skip_dupradar --skip_qualimap -resume
746820de9bf04c0ec23e1e0295aaf91b
b8dc6ea0-b938-4179-ba77-9e115af63492
https://github.com/nf-core/rnaseq
, revision 3.14.0
(commit hash b89fac32650aacc86fcda9ee77e00612a1d77066
)These plots give an overview of the distribution of resource usage for each process.
+ +This table shows information about each task in the workflow. Use the search box on the right + to filter rows for specific values. Clicking headers will sort the table by that value and + scrolling side to side will reveal more columns.
+ +mkdir ~/workshop/nfcore
-cd ~/workshop/nfcore
+mkdir -p ~/workshop/nfRNAseq
+cd ~/workshop/nfRNAseq
nf-core launch -h
Launch a pipeline using a web GUI or command line prompts.
@@ -193,7 +193,7 @@
alternatively you can use a nextflow run command
-no need to run this command, in the interest of time (and the lack of disk space on your intance), I’ve pre-prepared the outputs for this run. We will run the next pipeline to completion.
+no need to run this command, in the interest of time (and the lack of disk space on your intance), I’ve pre-prepared the outputs for this run
navigate to run directory to see the nextflow run command
@@ -224,128 +224,167 @@
- treatment vs control
- 4 replicates for each
1. Merge re-sequenced FastQ files (cat)
-2. Sub-sample FastQ files and auto-infer strandedness (fq, Salmon)
-3. Read QC (FastQC)
-4. UMI extraction (UMI-tools)
-5. Adapter and quality trimming (Trim Galore!)
-6. Removal of genome contaminants (BBSplit)
-7. Removal of ribosomal RNA (SortMeRNA)
-8. Choice of multiple alignment and quantification routes:
- 1. STAR -> Salmon
- 2. STAR -> RSEM
- 3. HiSAT2 -> NO QUANTIFICATION
-9. Sort and index alignments (SAMtools)
-10. UMI-based deduplication (UMI-tools)
-11. Duplicate read marking (picard MarkDuplicates)
-12. Transcript assembly and quantification (StringTie)
-13. Create bigWig coverage files (BEDTools, bedGraphToBigWig)
-14. Extensive quality control:
- 1. RSeQC
- 2. Qualimap
- 3. dupRadar
- 4. Preseq
- 5. DESeq2
-15. Pseudoalignment and quantification (Salmon or ‘Kallisto’; optional)
-16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks (MultiQC, R)
-
-In this section we will work through setting up the nf-core/RNAseq pipeline. We will be chosing specific parameters and finally, going through was the output looks like.
- - - -The dataset used throughout this workshop is as follow:
+list results directory. All nf-core outputs have a consistant structure of the outputs
-- 16 Samples sequenced with an MGI400 sequencer at SAGC using the Tecan Universal RNA-seq library protocol.
-- 2 different cancer cell lines (human)
-- treatment vs control
-- 4 replicates for each
-NOTE some concessions had to be made to work with this for workshop. Taking into account the large file sizes, the long run times and need for high compute resources.
+tree /home/workshop/workshop/nfRNAseq/outs
Quick background on genomic file formats to help describe the inputs and outputs of an NGS experiment.
+Knowing what these files are isn’t only important in finding which files to use for a pipeline, but a key foundation of genomic bioinformatics
+Being able to use and manipulate each file open’s up many opportunities, and is often required for troubleshooting
+Many are plain text files, this means they can be manipluated with basic text editing.
+
the genome.fa file is a plain text representation of the genome sequence. This is the ‘reference’ to which the sequencing files (.fastq) are alligned.
+ -head ./smallRNA/outs/mirtrace/[:]/mirtrace/qc_passed_reads.rnatype_unknown.collapsed/Acontrol1.fastp.fasta
genes.gtf ; a genomic interval file referencing the genome. This file depicts the genes/ transcripts. It’s a required for counting where reads are mapped to, and often has alot more annotation data in regards to the gene/transcript.
-the multiqc summary is a real strength of nf-core pipeline. +alot of the key analyses are captured
+Overview of the key processes run with nf-core/RNAseq.
+- identifying steps to trouble shooting the success of a run.
+- far more than just mapping (star) and counting (RSEM)
+- demonstrating the need for a workflow manager
+
+
+
+