Skip to content

Commit

Permalink
updates to tutorials
Browse files Browse the repository at this point in the history
  • Loading branch information
mconomos committed Jun 6, 2024
1 parent 573c14a commit e2a9c26
Show file tree
Hide file tree
Showing 4 changed files with 85 additions and 87 deletions.
21 changes: 11 additions & 10 deletions 06_aggregate_tests.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ The SMMAT test found two significant genes, ENSG00000253184 and ENSG00000251354,

## Exercise 6.1 (Application)

Use the `GENESIS Aggregate Association Testing` app on the BioData Catalyst powered by Seven Bridges platform to perform gene-based burden tests for trait_1 using the null model previously fit in the `02_GWAS.Rmd` tutorial. Only include variants with alternate allele frequency < 1% and use the Wu weights to upweight rarer variants. Use the genotype data in the genome-wide GDS files you created previously. \n
Use the `GENESIS Aggregate Association Testing` app on the BioData Catalyst powered by Seven Bridges platform to perform gene-based SMMAT tests for trait_1 using the null model previously fit in the `02_GWAS.Rmd` tutorial. Only include variants with alternate allele frequency < 1% and use the Wu weights to upweight rarer variants. Use the genotype data in the genome-wide GDS files you created previously. \n

The `GENESIS Aggregate Association Testing` app currently requires Variant group files that are RData data.frames (i.e. our GRanges objects with gene defintions will not work). Fortunately, it is easy to transform our GRanges object to the required data.frame. The files you need to run the application are already in the project files on the SBG platform.

Expand Down Expand Up @@ -275,22 +275,23 @@ The steps to perform this analysis are as follows:
- aggregate_list > Aggregate type: position
- assoc_aggregate > Alt Freq Max: 0.01
- assoc_aggregate > Memory GB: 32 (increase to make sure enough available)
- assoc_aggregate > Test: burden
- assoc_aggregate > Test: smmat
- assoc_aggregate > Weight Beta: "1 25"
- Output prefix: "1KG_trait_1_burden" (or any other string to name the output file)
- Output prefix: "1KG_trait_1_smmat" (or any other string to name the output file)
- GENESIS Association results plotting > Plot MAC threshold: 5
- Click: Run

The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed.

The output of this analysis will be 22 `<output_prefix>_chr<CHR>.RData` files with the association test results for each chromosome as well as a `<output_prefix>_manh.png` file with the Manhattan plot and a `<output_prefix>_qq.png` file with the QQ plot. Review the Manhattan plot -- are there any significant gene associations?

You can find the expected output of this analysis by looking at the existing task `11 Burden Association Test trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
You can find the expected output of this analysis by looking at the existing task `11 SMMAT Association Test trait_1` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.


## Exercise 6.2 (Data Studio)

After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chr 8 burden results into RStudio and find the significant genes.
After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chromosome 8 SMMAT results into RStudio and find the significant genes.

```{r}
# your solution here
#
Expand All @@ -306,21 +307,21 @@ After running an Application, you may want to load the results into RStudio to e

### Solution 6.2 (Data Studio)

After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chr 8 burden results into RStudio and find the significant genes.
After running an Application, you may want to load the results into RStudio to explore them interactively. All of the output files are saved in the directory `/sbgenomics/project-files/`. Load the chromosome 8 SMMAT results into RStudio and find the significant genes.

```{r, eval = FALSE}
# load
assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_burden_chr8.RData'))
assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_smmat_chr8.RData'))
names(assoc)
head(assoc$results)
# filter to cumulative MAC >= 5
burden <- assoc$results[assoc$results$n.alt >= 5, ]
smmat <- assoc$results[assoc$results$n.alt >= 5, ]
# significant genes
burden[burden$Score.pval < 0.05/nrow(burden), ]
smmat[smmat$pval_SMMAT < 0.05/nrow(smmat), ]
```

Gene ENSG00000186510.7 has the smallest burden p-value ($p = 4.8x10^{-4}$).
Gene ENSG00000253184 has SMMAT $p = 4.3x10^{-4}$, and gene ENSG00000251354 has SMMAT $p = 3.6x10^{-6}$.

27 changes: 15 additions & 12 deletions 06_aggregate_tests.html
Original file line number Diff line number Diff line change
Expand Up @@ -830,7 +830,7 @@ <h3>SMMAT Test</h3>
<h2>Exercise 6.1 (Application)</h2>
<p>Use the <code>GENESIS Aggregate Association Testing</code> app on the
BioData Catalyst powered by Seven Bridges platform to perform gene-based
burden tests for trait_1 using the null model previously fit in the
SMMAT tests for trait_1 using the null model previously fit in the
<code>02_GWAS.Rmd</code> tutorial. Only include variants with alternate
allele frequency &lt; 1% and use the Wu weights to upweight rarer
variants. Use the genotype data in the genome-wide GDS files you created
Expand Down Expand Up @@ -902,9 +902,9 @@ <h2>Exercise 6.1 (Application)</h2>
<li>assoc_aggregate &gt; Alt Freq Max: 0.01</li>
<li>assoc_aggregate &gt; Memory GB: 32 (increase to make sure enough
available)</li>
<li>assoc_aggregate &gt; Test: burden</li>
<li>assoc_aggregate &gt; Test: smmat</li>
<li>assoc_aggregate &gt; Weight Beta: “1 25”</li>
<li>Output prefix: “1KG_trait_1_burden” (or any other string to name the
<li>Output prefix: “1KG_trait_1_smmat” (or any other string to name the
output file)</li>
<li>GENESIS Association results plotting &gt; Plot MAC threshold: 5</li>
</ul></li>
Expand All @@ -922,7 +922,7 @@ <h2>Exercise 6.1 (Application)</h2>
Review the Manhattan plot – are there any significant gene
associations?</p>
<p>You can find the expected output of this analysis by looking at the
existing task <code>11 Burden Association Test trait_1</code> in the
existing task <code>11 SMMAT Association Test trait_1</code> in the
Tasks menu of your Project. The output files are available in the
Project, so you do not need to wait for your analysis to finish to look
at the output.</p>
Expand All @@ -931,8 +931,9 @@ <h2>Exercise 6.1 (Application)</h2>
<h2>Exercise 6.2 (Data Studio)</h2>
<p>After running an Application, you may want to load the results into
RStudio to explore them interactively. All of the output files are saved
in the directory <code>/sbgenomics/project-files/</code>. Load the chr 8
burden results into RStudio and find the significant genes.</p>
in the directory <code>/sbgenomics/project-files/</code>. Load the
chromosome 8 SMMAT results into RStudio and find the significant
genes.</p>
<pre class="r"><code># your solution here
#
#
Expand All @@ -947,20 +948,22 @@ <h2>Exercise 6.2 (Data Studio)</h2>
<h3>Solution 6.2 (Data Studio)</h3>
<p>After running an Application, you may want to load the results into
RStudio to explore them interactively. All of the output files are saved
in the directory <code>/sbgenomics/project-files/</code>. Load the chr 8
burden results into RStudio and find the significant genes.</p>
in the directory <code>/sbgenomics/project-files/</code>. Load the
chromosome 8 SMMAT results into RStudio and find the significant
genes.</p>
<pre class="r"><code># load
assoc &lt;- get(load(&#39;/sbgenomics/project-files/1KG_trait_1_burden_chr8.RData&#39;))
assoc &lt;- get(load(&#39;/sbgenomics/project-files/1KG_trait_1_smmat_chr8.RData&#39;))
names(assoc)

head(assoc$results)

# filter to cumulative MAC &gt;= 5
burden &lt;- assoc$results[assoc$results$n.alt &gt;= 5, ]
smmat &lt;- assoc$results[assoc$results$n.alt &gt;= 5, ]

# significant genes
burden[burden$Score.pval &lt; 0.05/nrow(burden), ]</code></pre>
<p>Gene ENSG00000186510.7 has the smallest burden p-value (<span class="math inline">\(p = 4.8x10^{-4}\)</span>).</p>
smmat[smmat$pval_SMMAT &lt; 0.05/nrow(smmat), ]</code></pre>
<p>Gene ENSG00000253184 has SMMAT <span class="math inline">\(p =
4.3x10^{-4}\)</span>, and gene ENSG00000251354 has SMMAT <span class="math inline">\(p = 3.6x10^{-6}\)</span>.</p>
</div>
</div>
</div>
Expand Down
47 changes: 23 additions & 24 deletions 07_STAAR.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -47,21 +47,21 @@ We will use the `STAARpipeline` apps on the BioData Catalyst powered by Seven Br

## Exercise 7.1 (Application)

First, run the `FAVORannotator` app on an example GDS file -- perhaps chromosome 19. This app runs one chromosome at a time.
First, run the `FAVORannotator` app on an example GDS file -- for the STAAR exercises we have provided a smaller chromosome 19 subset GDS in the interest of making these run more quickly. This app runs one chromosome at a time.

- Run the analysis in your project:
- Click: Apps > `FAVORannotator` > Run
- Specify the Inputs:
- GDS file: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned.gds`
- GDS file: `1KG_phase3_STAAR_subset_chr19.gds`
- FAVOR database for specific chromosome: `FAVOR_chr19.tar.gz` (provided at link above by STAAR package creators)
- FAVORdatabase_chrsplit CSV file: `FAVORdatabase_chrsplit.csv` (provided at link above by STAAR package creators)
- Specify the App Settings:
- Chromosome: 19
- Output file prefix: "1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor" (or any other string to name the output file)
- Output file prefix: "1KG_phase3_STAAR_subset_chr19_favor" (or any other string to name the output file)

Note: other app setting defaults will not need to be changed for our example, but could be altered depending on cohort size, etc.

Then, as this task will otherwise take half an hour to run, do not run the task. The output of this task would be an annotated GDS file named `<output_prefix>.gds`. You can find the expected output of this task by looking at the existing task `11. STAARexercise_FAVORannotator_chr19_example` in the Tasks menu of your Project. We will utilize the pre-provided output file available in the Project for the next steps.
Then, as this task will otherwise take half an hour to run, do not run the task. The output of this task would be an annotated GDS file named `<output_prefix>.gds`. You can find the expected output of this task by looking at the existing task `12 STAARexercise FAVORannotator Chr19` in the Tasks menu of your Project. We will utilize the pre-provided output file available in the Project for the next steps.

## Exercise 7.2 (Application)

Expand All @@ -70,14 +70,13 @@ Next, run `STAARpipeline`. We will focus on a sliding window test for this exerc
- First, generate an appropriate null model:
- Click: Apps > `STAARpipeline` > Run
- Specify the Inputs:
- GDS files: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds` (in a real analysis, you would select all 22 chromosomes)
- Annotation name catalog: `Annotation_name_catalog.csv`
- Phenotype file: `mock_phenotype_SISG.csv`
- Specify the App Settings:
- Phenotype: phenotype
- Covariates: age, sex (again, in a real analysis, you would want to include a kinship matrix and ancestry principal components)
- Test type: Null
- Output file prefix: "chr19_region_null" (or any other string to name the output file)
- Column name of outcome variable: phenotype
- Covariates: age,sex (again, in a real analysis, you would want to include a kinship matrix and ancestry principal components)
- Test type: Null (i.e. only fit the null model)
- Output file prefix: "STAAR_chr19_region_null" (or any other string to name the output file)
- Click: Run

Note: you do not need to provide a variant grouping file, variant annotations are already included in the annotated GDS.
Expand All @@ -86,20 +85,20 @@ The analysis will take a few minutes to run. You can find your analysis in the T

The output file for this null model task is `<output_prefix>.Rdata`, a null model data object.

You can find the expected output of this analysis by looking at the existing task `12. STAARexercise_nullmodel_chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
You can find the expected output of this analysis by looking at the existing task `13 STAARexercise Null Model Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.


## Exercise 7.3 (Application)

- Next, run a sliding window aggregate test (5 kb window).
- Click: Apps > `STAARpipeline` > Run
- Specify the Inputs:
- GDS files: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds`
- GDS files: `1KG_phase3_STAAR_subset_chr19_favor.gds` (in a real analysis, you would select all 22 chromosomes)
- Annotation name catalog: `Annotation_name_catalog.csv`
- Null model: `chr19_region_null.Rdata`
- Null model: `STAAR_chr19_region_null.Rdata`
- Specify App Settings:
- Sliding window size (bp) to be used in sliding window test: 5000
- Output file prefix: "region_sliding_5kb_chr19" (or any other string to name the output file)
- Output file prefix: "STAAR_region_sliding_5kb_chr19" (or any other string to name the output file)
- Test type: Sliding_Window
- Click: Run

Expand All @@ -109,7 +108,7 @@ The analysis will take ~10 minutes to run. You can find your analysis in the Tas

The output from this task is `<output_prefix>.Rdata`, which contains the STAAR association results.

You can find the expected output of this analysis by looking at the existing task `13. STAARexercise_STAARpipeline_run_sliding_5kb` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
You can find the expected output of this analysis by looking at the existing task `14 STAARexercise Sliding Window 5kb Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.


## Exercise 7.4 (Application)
Expand All @@ -119,18 +118,18 @@ The output file for these association tests using STAAR is an `.RData` file. To
- Click: Apps > `STAARpipelineSummary VarSet` > Run
- Specify the Inputs:
- Annotation name catalog: `Annotation_name_catalog.csv`
- Input array results: `region_sliding_5kb_chr19.Rdata`
- Input array results: `STAAR_region_sliding_5kb_chr19.Rdata`
- Specify App Settings:
- Output file prefix: "result_sliding_5kb_chr19" (or any other string to name the output file)
- Prefix of input results: "region_sliding_5kb_chr"
- Output file prefix: "STAAR_region_sliding_5kb_chr19" (or any other string to name the output file)
- Prefix of input results: "STAAR_region_sliding_5kb_chr19"
- Test type: Sliding_Window
- Click: Run

The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed.

You can find the expected output of this analysis by looking at the existing task `14. STAARexercise_STAARpipelineSummary_VarSet_sliding5kb` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
You can find the expected output of this analysis by looking at the existing task `15 STAARexercise Summary Sliding Window 5kb Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.

The output from this task is `<output_prefix>_results_sliging_window_genome.Rdata` and `<output_prefix>_results_sliging_window_genome_sig.csv`. Do note that the second file "sig", i.e. significant test files in this analysis, are empty - which is good/anticipated with a randomly generated phenotype!
The output from this task is `<output_prefix>_results_sliding_window_genome.Rdata` and `<output_prefix>_results_sliding_window_genome_sig.csv`. Do note that the second file "sig", i.e. significant test files in this analysis, are empty - which is good/anticipated with a randomly generated phenotype!

We can also use the `STAARpipelineSummary VarSet` app to adjust for a list of known variants we would like to condition our analysis on (i.e. include as a covariate) in order to determine if our identified rare variant signals are independent of such known single variants. Files of significant "cond" results will then appear. If you use this option, note you would need to input `null_obj_file`, `agds_file_name`, and `agds_files` across all chromosomes at once (the app is not intended to be run across a single chromosome and will fail if a file for each autosome is not found).

Expand All @@ -141,21 +140,21 @@ You can also use the `STAARpipelineSummary IndVar` App to examine the individual
- Click: Apps > `STAARpipelineSummary IndVar` > Run
- Specify the Inputs:
- Annotation name catalog: `Annotation_name_catalog.csv`
- AGDS file: `1KG_phase3_subset_1KG_phase3_subset_chr19_pruned_favor.gds`
- Null model: `chr19_region_null.Rdata`
- AGDS file: `1KG_phase3_STAAR_subset_chr19_favor.gds`
- Null model: `STAAR_chr19_region_null.Rdata`
- Specify App Settings:
- Chromosome: 19
- End location: 45668803
- Output file prefix: "slidingwindow_indvar_19_45663804" (or any other string to name the output file)
- Output file prefix: "STAAR_sliding_indvar_19_45663804" (or any other string to name the output file)
- Start location: 45663804
- Test type: Sliding_Window
- Click: Run

The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed.

The output from this analysis will be `<output_prefix>csv`.
The output from this analysis will be `<output_prefix>.csv`.

You can find the expected output of this analysis by looking at the existing task `15. STAARexercise_STAARpipelineSummary_IndVar` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.
You can find the expected output of this analysis by looking at the existing task `16 STAARexercise Summary IndVar Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to look at the output.


## Gene-centric Tests
Expand Down
Loading

0 comments on commit e2a9c26

Please sign in to comment.