forked from UW-GAC/SISG_2024
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_advanced_GWAS.Rmd
287 lines (205 loc) · 12.3 KB
/
03_advanced_GWAS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
# 3. Advanced GWAS Models
This tutorial extends what was previously introduced in the `02_GWAS.Rmd` tutorial to more advanced models using the [GENESIS](https://bioconductor.org/packages/release/bioc/html/GENESIS.html) R/Bioconductor package.
## Sparse Kinship Matrix
Recall that fitting the null model uses a kinship matrix (KM) that captures the genetic correlation among samples. In the `02_GWAS.Rmd` tutorial, we used a dense KM that has non-zero values for all entries in the matrix. This works well with small sample sizes (like we have here), but can require a lot of memory and be very computationally demanding with large samples.
In large population based samples, we can make an empirical KM sparse by zeroing out small values that are near 0. When creating a PC-Relate KM with `pcrelateToMatrix` this can be done by setting the `thresh` parameter equal to the smallest non-zero value to keep in the matrix. Alternatively, we can use the `makeSparseMatrix` function from the GENESIS package to sparsify an existing matrix into a sparse block-diagonal matrix.
```{r, sparse}
repo_path <- "https://github.com/UW-GAC/SISG_2024/raw/main"
if (!dir.exists("data")) dir.create("data")
library(GENESIS)
# load the PC-Relate KM
kmfile <- "data/pcrelate_Matrix.RData"
if (!file.exists(kmfile)) download.file(file.path(repo_path, kmfile), kmfile)
km <- get(load(kmfile))
dim(km)
km[1:5,1:5]
# make the KM sparse at 4th degree relatedness
skm <- makeSparseMatrix(km, thresh = 2*2^(-11/2))
dim(skm)
skm[1:5,1:5]
```
This sparse KM can be used in the null model in place of the original dense KM that we used in the `02_GWAS.Rmd` tutorial -- exercise left to the reader.
## Two Stage Model
As discussed in the lecture, we recommend a fully adjusted two-stage inverse Normalization procedure for fitting the null model for quantitative traits, particularly when the outcome has a non-Normal distribution. See [Sofer et al. (2019)](https://onlinelibrary.wiley.com/doi/10.1002/gepi.22188) for more information on the fully adjusted two-stage model.
### Phenotype Data
First we load the `AnnotatedDataFrame` with the phenotype data that we prepared previously.
```{r, message = FALSE}
library(Biobase)
```
```{r}
# pheno data
annotfile <- "data/pheno_annotated_pcs.RData"
if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), annotfile)
annot <- get(load(annotfile))
head(pData(annot))
```
For this section of the tutorial, we will be analyzing `trait_2`. Make a histogram of the trait_2 values -- what do you notice about the distribution?
```{r}
library(ggplot2)
ggplot(pData(annot), aes(x = trait_2)) + geom_histogram()
```
### Standard Analysis
First, let's run the GWAS using the standard LMM. Recall that we first fit the null model and then perform single variant score tests.
#### Null Model
We use the `fitNullModel` function from GENESIS. We need to specify the `AnnotatedDataFrame` with the phenotype data (annot), the outcome variable (trait_2), and the fixed effect covariates (sex, age, and PCs 1-7). We also include the kinship matrix to specify the covariance structure of the polygenic random effect in the model.
```{r}
nullmod <- fitNullModel(annot,
outcome = "trait_2",
covars=c("sex", "age", paste0("PC", c(1:7))),
cov.mat=km,
verbose=TRUE)
```
#### Association Tests
After fitting the null model, we use single-variant score tests to test each variant across the genome separately for association with the outcome, accounting for genetic ancestry and genetic relatedness among the samples. Recall that we make a `SeqVarData` iterator object that links the genotype data in the GDS file with the phenotype data in our `AnnotatedDataFrame` and use the `assocTestSingle` function from GENESIS to perform the association tests.
```{r, message = FALSE}
library(SeqVarTools)
# open a connection to the GDS file
gdsfile <- "data/1KG_phase3_GRCh38_subset_chr1.gds"
if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)
# make the seqVarData object
seqData <- SeqVarData(gds, sampleData=annot)
# make the iterator object
iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)
# run the single-variant association test
assoc <- assocTestSingle(iterator,
null.model = nullmod,
test = "Score")
dim(assoc)
```
We make a QQ plot of all variants with MAC >= 5. What do you notice about this QQ plot?
```{r}
qqPlot <- function(pval) {
pval <- pval[!is.na(pval)]
n <- length(pval)
x <- 1:n
dat <- data.frame(obs=sort(pval),
exp=x/n,
upper=qbeta(0.025, x, rev(x)),
lower=qbeta(0.975, x, rev(x)))
ggplot(dat, aes(-log10(exp), -log10(obs))) +
geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
geom_point() +
geom_abline(intercept=0, slope=1, color="red") +
xlab(expression(paste(-log[10], "(expected P)"))) +
ylab(expression(paste(-log[10], "(observed P)"))) +
theme_bw()
}
# make a QQ plot
qqPlot(assoc$Score.pval[assoc$MAC >= 5])
```
### Fully-Adjusted Two-Stage Model Analysis
Now, let's run the GWAS using the fully-adjusted two-stage LMM.
#### Null Model
To run the fully-adjusted two-stage null model, we simply set the `two.stage` option to `TRUE`.
```{r}
# fit the two stage null model
nullmod.twostage <- fitNullModel(annot,
outcome = "trait_2",
covars=c("sex", "age", paste0("PC", c(1:7))),
cov.mat=km,
two.stage = TRUE,
verbose=TRUE)
```
Notice that the messages from the function show that the model was fit twice -- first with the original outcome and second with the inverse-Normal transformed residuals.
```{r}
# description of the model we fit
nullmod.twostage$model
```
From the model description, we can see this is a two stage model because the formula element has `rankInvNorm(resid(trait_2))` as the outcome variable.
#### Association Tests
After fitting the two-stage null model, the association testing procedure is exactly the same. Since we've already created our `SeqVarData` iterator, we do not need to create it again; however, we do need to "reset" it to the first block of data.
```{r, message = FALSE}
# reset the filter to the first block
seqResetFilter(iterator, verbose = FALSE)
# run the single-variant association test
assoc.twostage <- assocTestSingle(iterator,
null.model = nullmod.twostage,
test = "Score")
dim(assoc.twostage)
```
We make a QQ plot of all variants with MAC >= 5. What do you notice about this QQ plot compared to the one from the standard null model?
```{r}
# make a QQ plot
qqPlot(assoc.twostage$Score.pval[assoc.twostage$MAC >= 5])
```
From the QQ plot, we see that using the fully-adjusted two-stage model substantially decreased the amount of inflation in the test statistics. We know that this is because of the non-Normality / skewness of trait_2, but we can compare the marginal residuals from both null models to see how their distributions look after covariate adjustment
```{r null_model_two_stage}
# merge the data for plotting
pdat <- merge(nullmod$fit, nullmod.twostage$fit,
by = 'sample.id', suffixes = c('.orig', '.twostage'))
head(pdat, 2)
# distribution of residuals - original null model
ggplot(pdat, aes(x = resid.marginal.orig)) + geom_histogram()
# distribution of residuals - two stage null model
ggplot(pdat, aes(x = resid.marginal.twostage)) + geom_histogram()
# compare residuals
ggplot(pdat, aes(x = resid.marginal.orig, y = resid.marginal.twostage)) +
geom_point() +
geom_abline(intercept = 0, slope = 1)
```
As expected, the residuals from the original model are very skewed, while the residuals from the two-stage model are much closer to Normally distributed. The skewness of the residuals in the original model leads to inflation in variant test statistics (as seen in the QQ plots), particularly for rare or low frequency variants, where effect allele carriers with extreme residuals in the tails of the distribution can have high leverage on the score statistics.
## Binary Traits
GENESIS also supports testing binary (e.g. case/control) outcomes.
### Phenotype Data
The outcome `status` in the annotated phenotype data is a binary case/control variable. Look at the prevalence of this outcome
```{r}
table(pData(annot)$status)
```
### Logistic Mixed Model Analysis
GENESIS implements the GMMAT method of approximate logistic mixed models for association testing with binary outcomes. See [Chen et al. (2016)](https://www.sciencedirect.com/science/article/pii/S000292971600063X) for more information on the GMMAT method.
#### Null Model
We can fit a null model using a logistic mixed model by specifying the argument `family=binomial` in the `fitNullModel` function. As before, we need to specify the `AnnotatedDataFrame` with the phenotype data (annot), the outcome variable (status), and the fixed effect covariates (sex, age, and PCs 1-7). We still include the kinship matrix to specify the covariance structure of the polygenic random effect in the model.
```{r}
# fit the null model with logistic mixed model
nullmod.status <- fitNullModel(annot,
outcome="status",
covars=c("sex", "age", paste0("PC", 1:7)),
cov.mat = km,
family=binomial,
verbose=TRUE)
```
#### Association Tests
After fitting the logistic null model, the association testing procedure is still exactly the same. Since we've already created our `SeqVarData` iterator, we do not need to create it again; however, we do need to "reset" it to the first block of data.
```{r, message = FALSE}
# reset the filter to the first block
seqResetFilter(iterator, verbose = FALSE)
# run the single-variant association test
assoc.status <- assocTestSingle(iterator,
null.model = nullmod.status,
test = "Score")
dim(assoc.status)
```
As usual, we make a QQ plot of all variants with MAC >= 5. What do you notice about the tail of this QQ plot?
```{r}
# make a QQ plot
qqPlot(assoc.status$Score.pval[assoc.status$MAC >= 5])
```
#### SPA test
In samples with highly imbalanced case:control ratios, the Score test can be inflated for rare and low frequency variants. Saddlepoint approximation (SPA) can be used to improve $p$-value calculations, and is available in GENESIS by setting the argument `test=Score.SPA` in `assocTestSingle`. See [Dey et al. (2017)](https://www.cell.com/ajhg/fulltext/S0002-9297(17)30201-X) and [Zhou et al. (2018)](https://www.nature.com/articles/s41588-018-0184-y) for details on using SPA in GWAS. Re-run the analysis using the SPA $p$-value calculation.
```{r, message = FALSE}
# reset the filter to the first block
seqResetFilter(iterator, verbose = FALSE)
# run the single-variant association test
assoc.spa <- assocTestSingle(iterator,
null.model = nullmod.status,
test = "Score.SPA")
dim(assoc.spa)
head(assoc.spa)
```
Notice that the results of this test include two new columns: `SPA.pval`, which are the $p$-values after the SPA adjustment, and `SPA.converged` which indicates if the SPA adjustment was successful (`TRUE/FALSE`); if the value is `NA` then the SPA adjustment was not applied and the original score test $p$-value is returned -- for computational efficiency, SPA is only applied when the original score test $p$-value $< 0.05$. Note that the `Score`, `Score.SE`, and `Score.Stat` are exactly the same for all variants as when using the standard score test, as the SPA adjustment only alters the $p$-value.
```{r}
table(assoc.spa$SPA.converged, exclude = NULL)
```
Make a new QQ plot of all variants with MAC >= 5. How does using the SPA $p$-value adjustment affect the results?
```{r}
# make a QQ plot
qqPlot(assoc.spa$SPA.pval[assoc.spa$MAC >= 5])
```
Using the SPA $p$-value adjustment has resolved the test statistic inflation!
```{r}
# close the GDS file!
seqClose(seqData)
```