-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path07_Diagnostics.Rmd
406 lines (293 loc) · 22.8 KB
/
07_Diagnostics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
# Diagnostics {#Diagnostics_Chapter}
```{r, echo=FALSE}
# Unnattach any packages that happen to already be loaded. In general this is unecessary
# but is important for the creation of the book to not have package namespaces
# fighting unexpectedly.
pkgs = names(sessionInfo()$otherPkgs)
if( length(pkgs > 0)){
pkgs = paste('package:', pkgs, sep = "")
for( i in 1:length(pkgs)){
detach(pkgs[i], character.only = TRUE, force=TRUE)
}
}
# Set my default chunk options
knitr::opts_chunk$set( fig.height=3 )
```
```{r, message=FALSE, warning=FALSE}
library(ggfortify) # for autoplot for lm objects
library(emmeans) # emmeans for pairwise constrasts.
library(tidyverse) # for dplyr, tidyr, ggplot2
```
We will be interested in analyzing whether or not our linear model is a good model and whether or not the data violate any of the assumptions that are required. In general we will be interested in three classes of assumption violations and our diagnostic measures might be able detect one or more of the following issues:
1. Unusual observations that contribute too much influence to the analysis. These few observations might drastically change the outcome of the model.
2. Model misspecification. Our assumption that $E\left[\boldsymbol{y}\right]=\boldsymbol{X}\boldsymbol{\beta}$ might be wrong and we might need to include different covariates in the model to get a satisfactory result.
3. Error distribution. We have assumed that $\epsilon_i \stackrel{iid}{\sim} N\left(0,\sigma^{2}\right)$ but autocorrelation, heteroscedasticity, and non-normality might be present.
Often problems with one of these can be corrected by transforming either the explanatory or response variables.
## Detecting Assumption Violations
Throughout this chapter I will use data created by Francis Anscombe that show how simple linear regression can be misused. In particular, these data sets will show how our diagnostic measures will detect various departures from the model assumptions.
The data are available in R as a data frame `anscombe` and is loaded by default. The data consists of four datasets, each having the same linear regression $\hat{y}=3+0.5\,x$ but the data are drastically different.
```{r, message=FALSE, warning=FALSE}
# The anscombe dataset has 8 columns - x1,x2,x3,x4,y1,y2,y3,y4
# and I want it to have 3 columns - Set, X, Y
Anscombe <- rbind(
data.frame(x=anscombe$x1, y=anscombe$y1, set='Set 1'),
data.frame(x=anscombe$x2, y=anscombe$y2, set='Set 2'),
data.frame(x=anscombe$x3, y=anscombe$y3, set='Set 3'),
data.frame(x=anscombe$x4, y=anscombe$y4, set='Set 4'))
# order them by their x values, and add an index column
Anscombe <- Anscombe %>%
group_by(set) %>% # Every subsequent action happens by dataset
arrange(x,y) %>% # sort them on the x-values and if tied, by y-value
mutate( index = 1:n() ) # give each observation within a set, an ID number
# Make a nice graph
ggplot(Anscombe, aes(x=x, y=y)) +
geom_point() +
facet_wrap(~set, scales='free') +
stat_smooth(method="lm", formula=y~x, se=FALSE)
```
### Measures of Influence
#### Standardized Residuals (aka Studentized )
Recall that we have
$$\begin{aligned}\hat{\boldsymbol{y}}
&= \boldsymbol{X}\hat{\boldsymbol{\beta}}\\
&= \boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}\\
&= \boldsymbol{H}\boldsymbol{y}\\
\end{aligned}$$
where the “Hat Matrix” is $\boldsymbol{H}=\boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}$ because we have $\hat{\boldsymbol{y}}=\boldsymbol{H}\boldsymbol{y}$. The elements of $\boldsymbol{H}$ can be quite useful in diagnostics. It can be shown that the variance of the $i$the residual is
$$Var\left(\hat{\epsilon}_{i}\right)=\sigma^{2}\left(1-\boldsymbol{H}_{ii}\right)$$
where $\boldsymbol{H}_{ii}$ is the $i$th element of the main diagonal of $\boldsymbol{H}$. This suggests that I could rescale my residuals to
$$\hat{\epsilon}_{i}^{*}=\frac{\hat{\epsilon}_{i}}{\hat{\sigma}\sqrt{1-\boldsymbol{H}_{ii}}}$$ which, if the normality and homoscedasticity assumptions hold, should behave as a $N\left(0,1\right)$ sample.
These rescaled residuals are called “studentized residuals”, though R typically refers to them as “standardized”. Since we have a good intuition about the scale of a standard normal distribution, the scale of standardized residuals will give a good indicator if normality is violated.
There are actually two types of studentized residuals, typically called *internal* and *external* among statisticians. The version presented above is the *internal* version which can be obtained using the R function `rstandard()` while the *external* version is available using `rstudent()`. Whenever you see R present standardized residuals, they are talking about internally studentized residuals. For sake of clarity, I will use the term *standardized* as well.
##### Example - Anscombe's set 3
For the third dataset, the outlier is the ninth observation with $x_{9}=13$ and $y_{9}=12.74$. We calculate the standardized residuals using the function `rstandard()` and plot them
```{r}
Set3 <- Anscombe %>% filter(set == 'Set 3') # Just set 3
model <- lm(y ~ x, data=Set3) # Fit the regression line
Set3$stdresid <- rstandard(model) # rstandard() returns the standardized residuals
ggplot(Set3, aes(x=index, y=stdresid)) + # make a plot
geom_point() +
labs(x='Observation Index',
y='Standardized Residuals',
title='Standardized Residuals vs Observation Index')
```
and we notice that the outlier residual is really big. If the model assumptions were true, then the standardized residuals should follow a standard normal distribution, and I would need to have hundreds of observations before I wouldn't be surprised to see a residual more than 3 standard deviations from 0.
#### Leverage
The extremely large standardized residual suggests that this data point is important, but we would like to quantify how important this observation actually is.
One way to quantify this is to look at the elements of $\boldsymbol{H}$. Because $$\hat{y}_{i}=\sum_{j=1}^{n}\boldsymbol{H}_{ij}y_{j}$$
then the $i$th row of $\boldsymbol{H}$ is a vector of weights that tell us how influential a point $y_{j}$ is for calculating the predicted value $\hat{y}_{i}$. If I look at just the main diagonal of $\boldsymbol{H}$, these are how much weight a point has on its predicted value. As such, I can think of the $\boldsymbol{H}_{ii}$ as the amount of leverage a particular data point has on the regression line. It can be shown that the leverages must be $0 \le \boldsymbol{H}_{ii} \le 1$ and that $\sum \boldsymbol{H}_{ii} = p$.
```{r}
Set3 <- Anscombe %>% filter( set == 'Set 3')
Set4 <- Anscombe %>% filter( set == 'Set 4')
model3 <- lm(y ~ x, data = Set3 )
model4 <- lm(y ~ x, data = Set4 )
X <- model.matrix(model3)
H <- X %*% solve( t(X) %*% X) %*% t(X)
round(H, digits=2)
```
Fortunately there is already a function `hatvalues()` to compute these $\boldsymbol{H}_{ii}$ values for me. We will compare the leverages from Anscombe's set 3 versus set 4.
```{r}
Set3 <- Set3 %>% mutate(leverage = hatvalues(model3)) # add leverage columns
Set4 <- Set4 %>% mutate(leverage = hatvalues(model4))
ggplot( rbind(Set3,Set4), aes(x=index, y=leverage) ) +
geom_point() +
facet_grid( . ~ set )
```
This leverage idea only picks out the *potential* for a specific value of $x$ to be influential, but does not actually measure influence. It has picked out the issue with the fourth data set, but does not adequately address the outlier in set 3.
#### Cook's Distance
To attempt to measure the actual influence of an observation $\left\{ y_{i},\boldsymbol{x}_{i}^{T}\right\}$ on the linear model, we consider the effect on the regression if we removed the observation and fit the same model. Let $$\hat{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}}$$
be the vector of predicted values, where $\hat{\boldsymbol{\beta}}$ is created using all of the data, and $\hat{\boldsymbol{y}}_{(i)}=\boldsymbol{X}\hat{\boldsymbol{\beta}}_{(i)}$ be the vector of predicted values where $\hat{\boldsymbol{\beta}}_{(i)}$ was estimated using all of the data except the $i$th observation. Letting $p$ be the number of $\beta_{j}$ parameters as usual we define Cook's distance of the $i$th observation as
$$ D_{i} = \frac{\left(\hat{\boldsymbol{y}}-\hat{\boldsymbol{y}}_{(i)}\right)^{T}\left(\hat{\boldsymbol{y}}-\hat{\boldsymbol{y}}_{(i)}\right)}{p\hat{\sigma}^{2}} = \frac{\sum_{j=1}^n(\hat{y}_{j}-(\hat{y}_{(i)})_j)^2}{p\hat{\sigma}^2}$$
which boils down to saying if the predicted values have large changes when the $i$th element is removed, then the distance is big. It can be shown that this formula can be simplified to
$$D_{i}=\frac{\hat{\epsilon}_{i}^{*}\boldsymbol{H}_{ii}}{p\left(1-H_{ii}\right)}$$
which expresses Cook's distance in terms of the $i$th studentized residual and the $i$th leverage.
Nicely, the R function `cooks.distance()` will calculate Cook's distance.
```{r, fig.height=3}
Set3 <- Set3 %>% mutate(cooksd = cooks.distance(model3))
Set4 <- Set4 %>% mutate(cooksd = cooks.distance(model4))
# Note: The high leverage point in set 4 has a Cook's distance of Infinity.
ggplot(rbind(Set3,Set4), aes(x=index, y=cooksd)) +
geom_point() +
facet_grid(. ~ set) +
labs(y="Cook's Distance")
```
Some texts will give a rule of thumb that points with Cook's distances greater than 1 should be considered influential, while other books claim a reasonable rule of thumb is $4/\left(n-p-1\right)$ where $n$ is the sample size, and $p$ is the number of parameters in $\boldsymbol{\beta}$. My take on this, is that you should look for values that are highly different from the rest of your data.
### Diagnostic Plots
After fitting a linear model in R, you have the option of looking at diagnostic plots that help to decide if any assumptions are being violated. We will step through each of the plots that are generated by the function `plot(model)` or using `ggplot2` using the package `ggfortify`.
In the package `ggfortify` there is a function that will calculate the diagnostics measures and add them to your dataset. This will simplify our graphing process.
```{r}
Set1 <- Anscombe %>% filter(set == 'Set 1')
model <- lm( y ~ x, data=Set1)
Set1 <- fortify(model) # add diagnostic measures to the dataset
Set1 %>% round(digits=3) # show the dataset nicely
```
#### Residuals vs Fitted
In the simple linear regression the most useful plot to look at was the residuals versus the $x$-covariate, but we also saw that this was similar to looking at the residuals versus the fitted values. In the general linear model, we will look at the residuals versus the fitted values or possibly the studentized residuals versus the fitted values.
##### Polynomial relationships
To explore how this plot can detect non-linear relationships between $y$ and $x$, we will examine a data set from Ashton et al. (2007) that relates the length of a tortoise's carapace to the number of eggs laid in a clutch. The data are
```{r, fig.height=3}
Eggs <- data.frame(
carapace = c(284,290,290,290,298,299,302,306,306,
309,310,311,317,317,320,323,334,334),
clutch.size = c(3,2,7,7,11,12,10,8,8,
9,10,13,7,9,6,13,2,8))
ggplot(Eggs, aes(x=carapace, y=clutch.size)) +
geom_point()
```
Looking at the data, it seems that we are violating the assumption that a linear model is appropriate, but we will fit the model anyway and look at the residual graph.
```{r, fig.height=3}
model <- lm( clutch.size ~ carapace, data=Eggs )
plot(model, which=1) # Base R Function: which=1 tells R to only make the first plot
```
```{r, fig.height=5, warning=FALSE, message=FALSE}
lindia::gg_diagnose(model) # using lindia package
```
```{r, fig.height=2.5}
lindia::gg_diagnose(model, plot.all=FALSE)[[3]] # using lindia package
```
```{r, fig.height=2.5, warning=FALSE, message=FALSE}
autoplot(model, which=1) # same plot using ggplot2 and ggfortify package
```
The blue curves going through the plot is a smoother of the residuals. Ideally this should be a flat line and I should see no trend in this plot. Clearly there is a quadratic trend as larger tortoises have larger clutch sizes until some point where the extremely large tortoises start laying fewer (perhaps the extremely large tortoises are extremely old as well). To correct for this, we should fit a model that is quadratic in `carapace length`. We will create a new covariate, `carapace.2`, which is the square of the carapace length and add it to the model.
In general I could write the quadratic model as
$$y_{i}=\beta_{0}+\beta_{1}x_{i}+\beta_{2}x_{i}^{2}+\epsilon_{i}$$
and note that my model is still a linear model with respect to covariates $\boldsymbol{x}$ and $\boldsymbol{x}^{2}$ because I can still write the model as
$$\begin{aligned} \boldsymbol{y}
&= \boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon} \\
&= \left[\begin{array}{ccc}
1 & x_{1} & x_{1}^{2}\\
1 & x_{2} & x_{2}^{2}\\
1 & x_{3} & x_{3}^{2}\\
\vdots & \vdots & \vdots\\
1 & x_{n} & x_{n}^{2}
\end{array}\right]\left[\begin{array}{c}
\beta_{0}\\
\beta_{1}\\
\beta_{2}
\end{array}\right]
+
\left[\begin{array}{c}
\epsilon_{1}\\
\epsilon_{2}\\
\epsilon_{3}\\
\vdots\\
\epsilon_{n}
\end{array}\right]\end{aligned}$$
```{r}
# add a new column that is carapace^2
Eggs2 <- Eggs %>% mutate( carapace.2 = carapace^2 )
model <- lm( clutch.size ~ carapace + carapace.2, data=Eggs2 )
# make R do it inside the formula... convenient
model <- lm( clutch.size ~ carapace + I(carapace^2), data=Eggs )
# Fit an arbitrary degree polynomial, I recommend this method for fitting the model!
model <- lm( clutch.size ~ poly(carapace, 2), data=Eggs )
# If you use poly() in the formula, you must use 'data=' here,
# otherwise you can skip it and R will do the right thing.
autoplot(model, which=1, data=Eggs)
```
Now our residual plot versus fitted values does not show any trend, suggesting that the quadratic model is fitting the data well. Graphing the original data along with the predicted values confirms this.
```{r}
# add the fitted and CI lwr/upr columns to my dataset
Eggs <- Eggs %>%
select( -matches('fit'), -matches('lwr'), -matches('upr') ) %>%
cbind( predict(model, interval='confidence') )
ggplot(Eggs, aes(x=carapace)) +
geom_ribbon( aes(ymin=lwr, ymax=upr), fill='red', alpha=.3) +
geom_line(aes(y=fit), color='red') +
geom_point(aes(y=clutch.size))
```
##### Heteroskedasticity
The plot of residuals versus fitted values can detect heteroskedasticity (non-constant variance) in the error terms.
To illustrate this, we turn to another dataset in the Faraway book. The dataset `airquality` uses data taken from an environmental study that measured four variables, ozone, solar radiation, temperature and wind speed for 153 consecutive days in New York. The goal is to predict the level of ozone using the weather variables.
We first graph all pairs of variables in the dataset.
```{r, warning=FALSE, message=FALSE}
data(airquality)
# pairs(~ Ozone + Solar.R + Wind + Temp, data=airquality)
airquality %>% select( Solar.R, Wind, Temp, Ozone) %>%
GGally::ggpairs()
```
and notice that ozone levels are positively correlated with solar radiation and temperature, and negatively correlated with wind speed. A linear relationship with wind might be suspect as is the increasing variability in the response to high temperature. However, we don't know if those trends will remain after fitting the model, because there is some covariance among the predictors.
```{r, fig.height=3}
model <- lm(Ozone ~ Solar.R + Wind + Temp, data=airquality)
autoplot(model, which=1)
```
As we feared, we have both a non-constant variance and a non-linear relationship. A transformation of the $y$ variable might be able to fix our problem.
#### QQplots
If we are taking a sample of size $n=10$ from a standard normal distribution, then I should expect that the smallest observation will be negative. Intuitively, you would expect the smallest observation to be near the $10$th percentile of the standard normal, and likewise the second smallest should be near the $20$th percentile.
This idea needs a little modification because the largest observation cannot be near the $100$th percentile (because that is $\infty$). So we'll adjust the estimates to still be spaced at $(1/n)$ quantile increments, but starting at the $0.5/n$ quantile instead of the $1/n$ quantile. So the smallest observation should be near the $0.05$ quantile, the second smallest should be near the $0.15$ quantile, and the largest observation should be near the $0.95$ quantile. I will refer to these as the *theoretical quantiles*.
```{r, echo=FALSE}
x <- seq(-3,3,length=1000)
labels <- NULL
labels[1] <- expression(paste(z[.05]))
labels[2] <- expression(paste(z[.15]))
labels[3] <- expression(paste(z[.25]))
labels[4] <- expression(paste(z[.35]))
labels[5] <- expression(paste(z[.45]))
labels[6] <- expression(paste(z[.55]))
labels[7] <- expression(paste(z[.65]))
labels[8] <- expression(paste(z[.75]))
labels[9] <- expression(paste(z[.85]))
labels[10]<- expression(paste(z[.95]))
ggplot(data.frame(x=x, y=dnorm(x)), aes(x=x, y=y)) +
scale_x_continuous(breaks=qnorm(seq(.05,.95,by=.1)), labels=labels) +
geom_line() +
labs(y='Density', title='Std Normal distribution')
```
I can then graph the theoretical quantiles vs my observed values and if they lie on the 1-to-1 line, then my data comes from a standard normal distribution.
```{r}
set.seed(93516) # make random sample in the next code chunk consistant run-to-run
n <- 10
data <- data.frame( observed = rnorm(n, mean=0, sd=1) ) %>%
arrange(observed) %>%
mutate( theoretical = qnorm( (1:n -.5)/n ) )
ggplot(data, aes(x=theoretical, y=observed) ) +
geom_point() +
geom_abline( intercept=0, slope=1, linetype=2, alpha=.7) +
labs(main='Q-Q Plot: Observed vs Normal Distribution')
```
In the context of a regression model, we wish to look at the residuals and see if there are obvious departures from normality. Returning to the air quality example, R will calculate the qqplot for us.
```{r}
model <- lm(Ozone ~ Solar.R + Wind + Temp, data=airquality)
autoplot(model, which=2)
```
In this case, we have a large number of residuals that are bigger than I would expect them to be based on them being from a normal distribution. We could further test this using the Shapiro-Wilks test and compare the standardized residuals against a $N\left(0,1\right)$ distribution.
```{r}
shapiro.test( rstandard(model) )
```
The tail of the distribution of observed residuals is *far* from what we expect to see.
#### Scale-Location Plot
This plot is a variation on the fitted vs residuals plot, but the y-axis uses the square root of the absolute value of the standardized residuals. Supposedly this makes detecting increasing variance easier to detect, but I'm not convinced.
#### Residuals vs Leverage (plus Cook's Distance)
This plot lets the user examine the which observations have a high potential for being influential (i.e. high leverage) versus how large the residual is. Because Cook's distance is a function of those two traits, we can also divide the graph up into regions by the value of Cook's Distance.
Returning to Anscombe's third set of data, we see
```{r}
model3 <- lm(y ~ x, data=Set3)
autoplot(model3, which=5)
```
that one data point (observation 10) has an extremely large standardized residual. This is one plot where I prefer what the base graphics in R does compared to the `ggfortify` version. The base version of R adds some contour lines that mark where the contours of where Cook's distance is 1/2 and 1.
```{r}
plot(model3, which=5)
```
## Exercises {#Exercises_Diagnostics}
1. In the ANCOVA chapter, we examined the relationship on dose of vitamin C on guinea pig tooth growth based on how the vitamin was delivered (orange juice vs a pill supplement).
a. Load the `ToothGrowth` data which is pre-loaded in base R.
b. Plot the data with `dose` level on the x-axis and tooth length growth (`len`) on the y-axis. Color the points by supplement type (`supp`).
c. Is the data evenly distributed along the x-axis? Comment on the wisdom of using this design.
d. Fit a linear model to these data and examine the diagnostic plots. What stands out to you?
e. Log-transform the dose variable and repeat parts (c) and (d). Comment on the effect of the log transformation.
2. The dataset `infmort` in the `faraway` package has information about infant mortality from countries around the world. Be aware that this is a old data set and does not necessarily reflect current conditions. More information about the dataset can be found using `?faraway::infmort`. We will be interested in understanding how infant mortality is predicted by per capita income, world region, and oil export status.
a. Plot the relationship between income and mortality. This can be done using the command
```{r, eval=FALSE}
data('infmort', package='faraway')
pairs(mortality ~., data=infmort)
```
What do you notice about the relationship between mortality and income?
b. Fit a linear model without any interaction terms with all three covariates as predictors of infant mortality. Examine the diagnostic plots. What stands out?
c. Examine the pairs plot with log(mortality), income, and log(income). Which should be used in our model, `income` or `log(income)`?
3. The dataset`pressure` in the `datasets` package gives emperical measurements of the vapor pressure of mercury at various temperatures. Fit a model with pressure as the response and temperature as the predictor using transformations to obtain a good fit. Feel free to experiment with what might be considered a ridiculously complicated model with a high degree polynomial. These models can most easily be fit using the `poly(x, degree=p)` function in the formula specification where you swap out the covariate `x` and polynomial degree `p`.
a. Document your process of building your final model. Do not show graphs or computer output that is not relevant to your decision or that you do not wish to comment on.
b. Comment on the interpretability of your (possibly ridiculously complicated) model. Consider a situation where I'm designing a system where it is easy to measure temperature, but I am unable to measure pressure. Could we get by with just measuring temperature? What if I didn't know the Clausius-Clapeyron equation (which describes vapor pressure at various temperatures) and I'm trying to better understand how temperature and pressure are related?
4. We will consider the relationship between income and race using a subset of employed individuals from the American Community Survey.
a. Load the `EmployedACS` dataset from the `Lock5Data` package.
b. Create a box plot showing the relationship between `Race` and `Income`.
c. Fit an ANOVA model to this data and consider the diagnostic plots for the residuals. What do you notice?