forked from UW-GAC/SISG_2024
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03.A_GENESIS_model_explorer.Rmd
142 lines (107 loc) · 11 KB
/
03.A_GENESIS_model_explorer.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# 3.A. Exploring Association Results
In this tutorial, we will learn how to use the [GENESIS Model Explorer App](https://genesis-model-explorer-app.bdc.sb-webapp.com/?project=smgogarten/uw-gac-commit), which is an "Interactive Browser" built with [R Shiny](https://shiny.rstudio.com/) on the NHLBI BioData Catalyst powered by Seven Bridges cloud platform.
## GENESIS Model Explorer App
The [GENESIS Model Explorer App](https://genesis-model-explorer-app.bdc.sb-webapp.com/?project=smgogarten/uw-gac-commit) is an interactive tool that enables users to make figures to visualize and explore the results of a GENESIS null model, paired with phenotype and genotype data on the same samples. It is meant to provide an intuitive interface for researchers to easily select, visualize, and explore phenotypes, genotypes, and a fitted GENESIS model interactively with no prior R programming knowledge. The app takes three inputs:
- **Null Model File:** The null model file should be any fitted GENESIS null model saved in .RData format. The null model could have been created interactively using the `fitNullModel` function in an R session (e.g. in Data Studio or on your local machine), or it could be the output from the `GENESIS Null Model` application.
- **Phenotype File:** The phenotype file should be a data.frame or `AnnotatedDataFrame` saved in .RData format. The data.frame must contain all of the samples included in your null model file in a column named `sample.id`, with additional columns containing phenotype variables of interest. If you used the `GENESIS Null Model` application to fit your null model, we recommend using the `<output_prefix>_phenotypes.RData` output file, which contains all of the phenotype data from all of the samples used in the analysis. Alternatively, you can use the same phenotype file used as input to fit your null model, or an entirely new file where you have added additional columns with phenotype variables of interest.
- **Genotype File (Optional):** Providing an optional genotype file allows the user to make figures looking at the relationships of variants of interest with null model variables and phenotypes of interest. The genotype file should be a data.frame saved in .rds format. The data.frame must contain all of the samples included in your null model file in a column named `sample.id`, with additional columns containing variant allele counts or dosages. Conveniently, this file can be generated from an existing GDS file with the `GDS Genotype Extractor` application (see below).
We will now use the [GENESIS Model Explorer](https://genesis-model-explorer-app.bdc.sb-webapp.com/?project=smgogarten/uw-gac-commit) to make some figures exploring the data:
- Launch the interactive browser
- From the top menu, click "Public Resources" > "Interactive Web Apps"
- Click: "Open" on the GENESIS Model Explorer App
- Click: "Yes" to proceed
- Click: "Get Started"
- Load Data
- Null Model File
- Project: select your SISG project (should be chosen by default if you launched the app as described)
- Current File: select `1KG_trait_1_null_model_reportonly.RData` (much smaller file without extra matrices required for computing association test statistics)
- Phenotype File:
- Project: select your SISG project
- Current File: select `1KG_trait_1_phenotypes.RData` (this is the phenotype file that was created by the null model application)
- Click: Load Data
Once you load the data, you will be taken to a "Plot setup" screen, where you can select what is plotted. We will make a few different plots. Once you've selected your variables, click "Generate Plot" to render the figure. To make a new plot, change the parameters and click "Generate Plot" again.
- Outcome Histogram
- x-axis: Model: outcome
- Outcome Density Plot
- x-axis: Model: outcome
- plot type: density plot
- Scatterplot of Residuals vs Fitted Values
- x-axis: Model: fitted.values
- y-axis: Model: resid.marginal
- plottype: scatterplot
- Additional Options
- Add y = 0 line
- Add smooth line
- Boxplot of trait_1 by sex
- x-axis: Phenotype: sex
- y-axis: Phenotype: trait_1
- plot type: boxplot
- Scatterplot of trait_1 vs age, grouped by sex (sex indicated by color)
- x-axis: Phenotype: age
- y-axis: Phenotype: trait_1
- plot type: scatterplot
- group by: sex
- Scatterplot of trait_1 vs age, faceted by sex (each sex in its own panel)
- x-axis: Phenotype: age
- y-axis: Phenotype: trait_1
- plot type: scatterplot
- facet by: Phenotype: sex
## Extracting Sample Genotypes from a GDS
Perhaps we want to look at the relationship between the genotype values of our association study "hits" and our phenotypes or model residuals. The GENESIS Model Explorer can do this as well if we provide the optional Genotype file with sample genotype values for the variants of interest. Conveniently, this file can be generated from an existing GDS file with the `GDS Genotype Extractor` application.
First, let's identify a few variants to use for this demonstration. After running an Application (e.g. GENESIS Single Variant Association Testing) on the BioData Catalyst Powered by Seven Bridges platform, the output files are saved in the directory `/sbgenomics/project-files/`. You can load these files into RStudio to explore them interactively -- we load the chromosome 19 single variant association test results from the task `07 Single Variant Association Test trait_1`, and identify the most significant variant.
```{r, eval = FALSE}
# load the association results
assoc <- get(load('/sbgenomics/project-files/1KG_trait_1_assoc_chr19.RData'))
# variant with minimum p-value
assoc[which.min(assoc$Score.pval), ]
```
We need to create a "variant include file" with the `variant.id` of this variants as input for the `GDS Genotype Extractor` application. The variant include should be saved as an .rds file using the `saveRDS` function.
```{r, eval = FALSE}
varid <- '1070473'
saveRDS(varid, file = '/sbgenomics/output-files/1KG_trait_1_chr19_variant_include.rds')
```
We also need to create a "sample include file" with the `sample.id` of all the samples included in our analysis as input for the `GDS Genotype Extractor` application. We can get these `sample.id` values from our fitted null model -- we can use the `<output_prefix>_null_model_reportonly.RData` file, which is much smaller than the `<output_prefix>_null_model.RData` file by excluding some large matrices only needed for computing association test results. The sample include file should also be saved as an .rds file using the `saveRDS` function.
```{r, eval = FALSE}
nullmod <- get(load('/sbgenomics/project-files/1KG_trait_1_null_model_reportonly.RData'))
# the sample.id are stored in the "fit" data.frame
head(nullmod$fit)
sampid <- nullmod$fit$sample.id
length(sampid)
```
```{r, eval = FALSE}
saveRDS(sampid, file = '/sbgenomics/output-files/1KG_trait_1_sample_include.rds')
```
**Note about Directories:** The working directory for the Data Studio is `sbgenomics/workspace/` (you can see this by going to the Terminal in RStudio and typing `pwd`). This directory is accessible in the Data Studio, but the Applications in your Project can **not** see files here. Applications only see the `sbgenomics/project-files/` directory (which is read-only from the Data Studio). In order to make our variant include file visible to the `GDS Genotype Extractor` application, we save our file to the `/sbgenomics/output-files/` directory. When we stop our RStudio session (and only then), new files in the `/sbgenomics/output-files/` directory will be copied over to the `/sbgenomics/project-files/` directory, making them available to Applications. More details can be found in the platform [documentation](https://sb-biodatacatalyst.readme.io/docs/about-files-in-a-data-cruncher-analysis).
## Exercise 3.A.1 (Application)
Use the `GDS Genotype Extractor` app on the BioData Catalyst powered by Seven Bridges platform to create an .rds file with genotype values for all samples in our `1KG_phase3_GRCh38_subset` GDS files at the lead variant on chromosome 19 we identified above. The steps to perform this analysis are as follows:
- Copy the app to your project if it is not already there:
- Click: Public Resources > Workflows and Tools > Browse
- Search for `GDS Genotype Extractor`
- Click: Copy > Select your project > Copy
- Run the analysis in your project:
- Click: Apps > `GDS Genotype Extractor` > Run
- Specify the Inputs:
- GDS file: `1KG_phase3_GRCh38_subset_chr19.gds`
- Sample include file: `1KG_trait_1_sample_include.rds`
- Variant include file: `1KG_trait_1_chr19_variant_include.rds`
- Output prefix: "1KG_trait_1_chr19" (or any other string to name the output file)
- Click: Run
The analysis will take a few minutes to run. You can find your analysis in the Tasks menu of your Project to check on its progress and see the results once it has completed.
The output of this analysis will be a `<output_prefix>_genotypes.rds` file that contains a column of `sample.id` and then one column per variant with the genotype values for each sample, and a `<output_prefix>_variant_info.rds` file with one row per variant and columns providing details such as variant identifiers, chromosome, position, and ref and alt alleles.
You can find the expected output of this analysis by looking at the existing task `GDS Genotype Extractor Chr19` in the Tasks menu of your Project. The output files are available in the Project, so you do not need to wait for your analysis to finish to move to the next exercise.
## Exercise 3.A.2 (GENESIS Model Explorer)
Use the GENESIS Model Explorer to make a boxplot of the Cholesky residuals (`resid.cholesky`) from the 1KG trait_1 null model by genotype value of the variant at chr19:45084084 A>G. The Cholesky residuals are a transformation of the marginal residuals computed using the estimated model covariance structure to remove the correlation among samples. The correlation of these residuals with the genotype values if essentially what the score statistic is measuring when we perform our association tests. What do you observe in the boxplot?
### Solution 3.A.2 (GENESIS Model Explorer)
- In the GENESIS Model Explorer window, click back on the "Load Data" tab and add the following file:
- Genotype File:
- Project: select your SISG project
- Current File: select `1KG_trait_1_chr19_genotypes.rds` (this is the genotype file we created in Exercise 5.2)
- Click: Load Data
- Set the plotting parameters as follows:
- x-axis: Genotype: chr19:45084084_A_G
- y-axis: Model: resid.cholesky
- plot type: boxplot
- Additional Options
- Add y = 0 line
- Click "Generate plot"
From the boxplot, we can see that there is an "upward" trend in the median residual value across genotypes. The values 0/1/2 of the genotype value correspond to the number of copies of the alternate allele (in this case, the G allele), so we observe that having more copies of the G allele is associated with higher values of trait_1, after adjusting for the covariates in our model. This is consistent with the `Score` value for this variant from our association test, which also has a positive value.