updates to tutorials

RosCraddock · Jun 6, 2024 · e2a9c26 · e2a9c26
1 parent 573c14a
commit e2a9c26
Show file tree

Hide file tree

Showing 4 changed files with 85 additions and 87 deletions.
diff --git a/06_aggregate_tests.Rmd b/06_aggregate_tests.Rmd
@@ -241,7 +241,7 @@ The SMMAT test found two significant genes, ENSG00000253184 and ENSG00000251354,
 
 ## Exercise 6.1 (Application)
 
-Use the `GENESIS Aggregate Association Testing` app on the BioData Catalyst powered by Seven Bridges platform to perform gene-based burden tests for trait_1 using the null model previously fit in the `02_GWAS.Rmd` tutorial. Only include variants with alternate allele frequency < 1% and use the Wu weights to upweight rarer variants. Use the genotype data in the genome-wide GDS files you created previously. \n
+Use the `GENESIS Aggregate Association Testing` app on the BioData Catalyst powered by Seven Bridges platform to perform gene-based SMMAT tests for trait_1 using the null model previously fit in the `02_GWAS.Rmd` tutorial. Only include variants with alternate allele frequency < 1% and use the Wu weights to upweight rarer variants. Use the genotype data in the genome-wide GDS files you created previously. \n
 
 The `GENESIS Aggregate Association Testing` app currently requires Variant group files that are RData data.frames (i.e. our GRanges objects with gene defintions will not work). Fortunately, it is easy to transform our GRanges object to the required data.frame. The files you need to run the application are already in the project files on the SBG platform. 
 
@@ -275,22 +275,23 @@ The steps to perform this analysis are as follows:
     - aggregate_list > Aggregate type: position
     - assoc_aggregate > Alt Freq Max: 0.01
     - assoc_aggregate > Memory GB: 32 (increase to make sure enough available)
-    - assoc_aggregate > Test: burden
+    - assoc_aggregate > Test: smmat
     - assoc_aggregate > Weight Beta: "1 25"
-    - Output prefix: "1KG_trait_1_burden" (or any other string to name the output file)
+    - Output prefix: "1KG_trait_1_smmat" (or any other string to name the output file)
     - GENESIS Association results plotting > Plot MAC threshold: 5
   - Click: Run
 
 The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed.
 
 The output of this analysis will be 22 `<output_prefix>_chr<CHR>.RData` files with the association test results for each chromosome as well as a `<output_prefix>_manh.png` file with the Manhattan plot and a `<output_prefix>_qq.png` file with the QQ plot. Review the Manhattan plot -- are there any significant gene associations?
 
-You can find the expected output of this analysis by looking at the existing task `11 Burden Association Test trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
+You can find the expected output of this analysis by looking at the existing task `11 SMMAT Association Test trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
 
 
 ## Exercise 6.2 (Data Studio)
 
-After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chr 8 burden results into RStudio and find the significant genes.  
+After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chromosome 8 SMMAT results into RStudio and find the significant genes.  
+
 ```{r}
 # your solution here 
 #
@@ -306,21 +307,21 @@ After running an Application, you may want to load the results into RStudio to e
 
 ### Solution 6.2 (Data Studio)
 
-After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chr 8 burden results into RStudio and find the significant genes.
+After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chromosome 8 SMMAT results into RStudio and find the significant genes.
 
 ```{r, eval = FALSE}
 # load
-assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_burden_chr8.RData'))
+assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_smmat_chr8.RData'))
 names(assoc)
 
 head(assoc$results)
 
 # filter to cumulative MAC >= 5
-burden <- assoc$results[assoc$results$n.alt >= 5, ]
+smmat <- assoc$results[assoc$results$n.alt >= 5, ]
 
 # significant genes
-burden[burden$Score.pval < 0.05/nrow(burden), ]
+smmat[smmat$pval_SMMAT < 0.05/nrow(smmat), ]
 ```
 
-Gene ENSG00000186510.7 has the smallest burden p-value ($p = 4.8x10^{-4}$).
+Gene ENSG00000253184 has SMMAT $p = 4.3x10^{-4}$, and gene ENSG00000251354 has SMMAT $p = 3.6x10^{-6}$.
 
diff --git a/06_aggregate_tests.html b/06_aggregate_tests.html
@@ -830,7 +830,7 @@ <h3>SMMAT Test</h3>
 <h2>Exercise 6.1 (Application)</h2>
 <p>Use the <code>GENESIS Aggregate Association Testing</code> app on the
 BioData Catalyst powered by Seven Bridges platform to perform gene-based
-burden tests for trait_1 using the null model previously fit in the
+SMMAT tests for trait_1 using the null model previously fit in the
 <code>02_GWAS.Rmd</code> tutorial. Only include variants with alternate
 allele frequency &lt; 1% and use the Wu weights to upweight rarer
 variants. Use the genotype data in the genome-wide GDS files you created
@@ -902,9 +902,9 @@ <h2>Exercise 6.1 (Application)</h2>
 <li>assoc_aggregate &gt; Alt Freq Max: 0.01</li>
 <li>assoc_aggregate &gt; Memory GB: 32 (increase to make sure enough
 available)</li>
-<li>assoc_aggregate &gt; Test: burden</li>
+<li>assoc_aggregate &gt; Test: smmat</li>
 <li>assoc_aggregate &gt; Weight Beta: “1 25”</li>
-<li>Output prefix: “1KG_trait_1_burden” (or any other string to name the
+<li>Output prefix: “1KG_trait_1_smmat” (or any other string to name the
 output file)</li>
 <li>GENESIS Association results plotting &gt; Plot MAC threshold: 5</li>
 </ul></li>
@@ -922,7 +922,7 @@ <h2>Exercise 6.1 (Application)</h2>
 Review the Manhattan plot – are there any significant gene
 associations?</p>
 <p>You can find the expected output of this analysis by looking at the
-existing task <code>11 Burden Association Test trait_1</code> in the
+existing task <code>11 SMMAT Association Test trait_1</code> in the
 Tasks menu of your Project. The output files are available in the
 Project, so you do not need to wait for your analysis to finish to look
 at the output.</p>
@@ -931,8 +931,9 @@ <h2>Exercise 6.1 (Application)</h2>
 <h2>Exercise 6.2 (Data Studio)</h2>
 <p>After running an Application, you may want to load the results into
 RStudio to explore them interactively. All of the output files are saved
-in the directory <code>/sbgenomics/project-files/</code>. Load the chr 8
-burden results into RStudio and find the significant genes.</p>
+in the directory <code>/sbgenomics/project-files/</code>. Load the
+chromosome 8 SMMAT results into RStudio and find the significant
+genes.</p>
 <pre class="r"><code># your solution here 
 #
 #
@@ -947,20 +948,22 @@ <h2>Exercise 6.2 (Data Studio)</h2>
 <h3>Solution 6.2 (Data Studio)</h3>
 <p>After running an Application, you may want to load the results into
 RStudio to explore them interactively. All of the output files are saved
-in the directory <code>/sbgenomics/project-files/</code>. Load the chr 8
-burden results into RStudio and find the significant genes.</p>
+in the directory <code>/sbgenomics/project-files/</code>. Load the
+chromosome 8 SMMAT results into RStudio and find the significant
+genes.</p>
 <pre class="r"><code># load
-assoc &lt;- get(load(&#39;/sbgenomics/project-files/1KG_trait_1_burden_chr8.RData&#39;))
+assoc &lt;- get(load(&#39;/sbgenomics/project-files/1KG_trait_1_smmat_chr8.RData&#39;))
 names(assoc)
 
 head(assoc$results)
 
 # filter to cumulative MAC &gt;= 5
-burden &lt;- assoc$results[assoc$results$n.alt &gt;= 5, ]
+smmat &lt;- assoc$results[assoc$results$n.alt &gt;= 5, ]
 
 # significant genes
-burden[burden$Score.pval &lt; 0.05/nrow(burden), ]</code></pre>
-<p>Gene ENSG00000186510.7 has the smallest burden p-value (<span class="math inline">\(p = 4.8x10^{-4}\)</span>).</p>
+smmat[smmat$pval_SMMAT &lt; 0.05/nrow(smmat), ]</code></pre>
+<p>Gene ENSG00000253184 has SMMAT <span class="math inline">\(p =
+4.3x10^{-4}\)</span>, and gene ENSG00000251354 has SMMAT <span class="math inline">\(p = 3.6x10^{-6}\)</span>.</p>
 </div>
 </div>
 </div>

diff --git a/07_STAAR.Rmd b/07_STAAR.Rmd
@@ -47,21 +47,21 @@ We will use the `STAARpipeline` apps on the BioData Catalyst powered by Seven Br
 
 ## Exercise 7.1 (Application)
 
-First, run the `FAVORannotator` app on an example GDS file -- perhaps chromosome 19. This app runs one chromosome at a time. 
+First, run the `FAVORannotator` app on an example GDS file -- for the STAAR exercises we have provided a smaller chromosome 19 subset GDS in the interest of making these run more quickly. This app runs one chromosome at a time. 
 
 - Run the analysis in your project:
   - Click: Apps > `FAVORannotator` > Run
   - Specify the Inputs:
-    - GDS file: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned.gds`
+    - GDS file: `1KG_phase3_STAAR_subset_chr19.gds`
     - FAVOR database for specific chromosome: `FAVOR_chr19.tar.gz` (provided at link above by STAAR package creators)
     - FAVORdatabase_chrsplit CSV file: `FAVORdatabase_chrsplit.csv` (provided at link above by STAAR package creators)
   - Specify the App Settings:
     - Chromosome: 19
-    - Output file prefix: "1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor" (or any other string to name the output file)
+    - Output file prefix: "1KG_phase3_STAAR_subset_chr19_favor" (or any other string to name the output file)
 
 Note: other app setting defaults will not need to be changed for our example, but could be altered depending on cohort size, etc. 
 
-Then, as this task will otherwise take half an hour to run, do not run the task. The output of this task would be an annotated GDS file named `<output_prefix>.gds`. You can find the expected output of this task by looking at the existing task `11. STAARexercise_FAVORannotator_chr19_example` in the Tasks menu of your Project. We will utilize the pre-provided output file available in the Project for the next steps. 
+Then, as this task will otherwise take half an hour to run, do not run the task. The output of this task would be an annotated GDS file named `<output_prefix>.gds`. You can find the expected output of this task by looking at the existing task `12 STAARexercise FAVORannotator Chr19` in the Tasks menu of your Project. We will utilize the pre-provided output file available in the Project for the next steps. 
 
 ## Exercise 7.2 (Application)
 
@@ -70,14 +70,13 @@ Next, run `STAARpipeline`. We will focus on a sliding window test for this exerc
 - First, generate an appropriate null model:
   - Click: Apps > `STAARpipeline` > Run
   - Specify the Inputs:
-    - GDS files: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds` (in a real analysis, you would select all 22 chromosomes)
     - Annotation name catalog: `Annotation_name_catalog.csv`
     - Phenotype file: `mock_phenotype_SISG.csv`
   - Specify the App Settings:
-    - Phenotype: phenotype
-    - Covariates: age, sex (again, in a real analysis, you would want to include a kinship matrix and ancestry principal components)
-    - Test type: Null
-    - Output file prefix: "chr19_region_null" (or any other string to name the output file)
+    - Column name of outcome variable: phenotype
+    - Covariates: age,sex (again, in a real analysis, you would want to include a kinship matrix and ancestry principal components)
+    - Test type: Null (i.e. only fit the null model)
+    - Output file prefix: "STAAR_chr19_region_null" (or any other string to name the output file)
   - Click: Run
 
 Note: you do not need to provide a variant grouping file, variant annotations are already included in the annotated GDS. 
@@ -86,20 +85,20 @@ The analysis will take a few minutes to run. You can find your analysis in the T
 
 The output file for this null model task is `<output_prefix>.Rdata`, a null model data object.
 
-You can find the expected output of this analysis by looking at the existing task `12. STAARexercise_nullmodel_chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
+You can find the expected output of this analysis by looking at the existing task `13 STAARexercise Null Model Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
 
 
 ## Exercise 7.3 (Application)
 
 - Next, run a sliding window aggregate test (5 kb window).
   - Click: Apps > `STAARpipeline` > Run
   - Specify the Inputs:
-    - GDS files: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds`
+    - GDS files: `1KG_phase3_STAAR_subset_chr19_favor.gds` (in a real analysis, you would select all 22 chromosomes)
     - Annotation name catalog: `Annotation_name_catalog.csv`
-    - Null model: `chr19_region_null.Rdata`
+    - Null model: `STAAR_chr19_region_null.Rdata`
   - Specify App Settings:
     - Sliding window size (bp) to be used in sliding window test: 5000
-    - Output file prefix: "region_sliding_5kb_chr19" (or any other string to name the output file)
+    - Output file prefix: "STAAR_region_sliding_5kb_chr19" (or any other string to name the output file)
     - Test type: Sliding_Window
   - Click: Run
 
@@ -109,7 +108,7 @@ The analysis will take ~10 minutes to run. You can find your analysis in the Tas
 
 The output from this task is `<output_prefix>.Rdata`, which contains the STAAR association results.
 
-You can find the expected output of this analysis by looking at the existing task `13. STAARexercise_STAARpipeline_run_sliding_5kb` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
+You can find the expected output of this analysis by looking at the existing task `14 STAARexercise Sliding Window 5kb Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
 
 
 ## Exercise 7.4 (Application)
@@ -119,18 +118,18 @@ The output file for these association tests using STAAR is an `.RData` file. To
   - Click: Apps > `STAARpipelineSummary VarSet` > Run
   - Specify the Inputs:
     - Annotation name catalog: `Annotation_name_catalog.csv`
-    - Input array results: `region_sliding_5kb_chr19.Rdata`
+    - Input array results: `STAAR_region_sliding_5kb_chr19.Rdata`
   - Specify App Settings: 
-    - Output file prefix: "result_sliding_5kb_chr19" (or any other string to name the output file)
-    - Prefix of input results: "region_sliding_5kb_chr"
+    - Output file prefix: "STAAR_region_sliding_5kb_chr19" (or any other string to name the output file)
+    - Prefix of input results: "STAAR_region_sliding_5kb_chr19"
     - Test type: Sliding_Window
   - Click: Run
 
 The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed. 
 
-You can find the expected output of this analysis by looking at the existing task `14. STAARexercise_STAARpipelineSummary_VarSet_sliding5kb` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
+You can find the expected output of this analysis by looking at the existing task `15 STAARexercise Summary Sliding Window 5kb Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
 
-The output from this task is `<output_prefix>_results_sliging_window_genome.Rdata` and `<output_prefix>_results_sliging_window_genome_sig.csv`. Do note that the second file "sig", i.e. significant test files in this analysis, are empty - which is good/anticipated with a randomly generated phenotype!
+The output from this task is `<output_prefix>_results_sliding_window_genome.Rdata` and `<output_prefix>_results_sliding_window_genome_sig.csv`. Do note that the second file "sig", i.e. significant test files in this analysis, are empty - which is good/anticipated with a randomly generated phenotype!
 
 We can also use the `STAARpipelineSummary VarSet` app to adjust for a list of known variants we would like to condition our analysis on (i.e. include as a covariate) in order to determine if our identified rare variant signals are independent of such known single variants. Files of significant "cond" results will then appear. If you use this option, note you would need to input `null_obj_file`, `agds_file_name`, and `agds_files` across all chromosomes at once (the app is not intended to be run across a single chromosome and will fail if a file for each autosome is not found). 
 
@@ -141,21 +140,21 @@ You can also use the `STAARpipelineSummary IndVar` App to examine the individual
   - Click: Apps > `STAARpipelineSummary IndVar` > Run
   - Specify the Inputs:
     - Annotation name catalog: `Annotation_name_catalog.csv`
-    - AGDS file: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds`
-    - Null model: `chr19_region_null.Rdata`
+    - AGDS file: `1KG_phase3_STAAR_subset_chr19_favor.gds`
+    - Null model: `STAAR_chr19_region_null.Rdata`
   - Specify App Settings: 
     - Chromosome: 19
     - End location: 45668803
-    - Output file prefix: "slidingwindow_indvar_19_45663804" (or any other string to name the output file)
+    - Output file prefix: "STAAR_sliding_indvar_19_45663804" (or any other string to name the output file)
     - Start location: 45663804
     - Test type: Sliding_Window
   - Click: Run
 
 The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed.
 
-The output from this analysis will be `<output_prefix>csv`.
+The output from this analysis will be `<output_prefix>.csv`.
 
-You can find the expected output of this analysis by looking at the existing task `15. STAARexercise_STAARpipelineSummary_IndVar` in the Tasks menu of your Project.  The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
+You can find the expected output of this analysis by looking at the existing task `16 STAARexercise Summary IndVar Chr19` in the Tasks menu of your Project.  The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
 
 
 ## Gene-centric Tests