-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathlm.Rmd
385 lines (298 loc) · 11.6 KB
/
lm.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
---
title: "Linear Models"
author: "Douglas Bates"
date: "10/09/2014"
output:
ioslides_presentation:
wide: true
small: true
---
```{r preliminaries,echo=FALSE,results='hide'}
library(knitr)
library(ggplot2)
library(hexbin)
library(xtable)
opts_chunk$set(cache=TRUE)
options(width=100)
```
# Simple linear regression
## The brain-weight data
A famous data set from a 1976 _Science_ paper provides average body weight (kg) and brain weight (g) for 62 species of mammals.
```{r brains}
data("brains",package="alr4")
str(brains)
head(brains)
```
## Initial plots
```{r p1,fig.align='center'}
p <- ggplot(brains,aes(x=BodyWt,y=BrainWt)) + xlab("Average body weight (kg)") + ylab("Average brain weight (g)")
(p <- p + geom_point())
```
## On a log-log scale
```{r log-log,fig.align='center'}
(p <- p + scale_x_log10() + scale_y_log10())
```
## With a smoother line
```{r smoother,fig.align='center',echo=FALSE,warning=FALSE}
p + geom_smooth()
```
## With a regression line
```{r regr,fig.align='center',echo=FALSE}
p + geom_smooth(method="lm")
```
# Fitting a simple linear regression
## Calling `lm`
- As with other model-fitting functions in `R`, the first argument to `lm` is a formula.
- The second, optional but recommended, argument is `data` which is a data frame in which to evaluate the expressions in the formula
+ it may seem handy to omit it and use variables from the GlobalNameSpace but you lose the audit trail when you do this
```{r fm1}
(fm1 <- lm(log(BrainWt) ~ 1 + log(BodyWt), brains))
```
## The `summary` is more than the object
A peculiarity of the terminology in `R` model fitting is that `print`ing a fitted model provided minimal information but applying `summary` to it provides much more.
```{r summaryfm1}
summary(fm1)
```
## Suppressing "significance stars"
- One of the worst decisions in `R` development was adding "significance stars" to the `summary` output for many models.
- They are optional. Unfortunately, the default is to have them.
```{r signifstars,cache=FALSE}
options(show.signif.stars=FALSE)
summary(fm1)
```
## The model matrix
- the terminology used in `R` is that the formula and the data together generate a `model matrix`
- sometimes the term `design matrix` is used. `model matrix` is more accurate
```{r mm}
head(mm1 <- model.matrix(fm1))
str(mm1)
```
## The Intercept term
- Many "statistical packages" assume that an intercept will be included in a model.
- I prefer to indicate the intercept explicitly by writing the formula as `1 + BodyWt`
- To suppress the intercept you must write the formula as `0 + BodyWt`
- The intercept term generates the initial column of 1's in the model matrix
```{r ones}
all(mm1[,1] == 1)
```
## Extracting the coefficient summary
- Sometimes you just want the coefficient table from the summary, which is available as
```{r coef}
(ctbl <- coef(summary(fm1)))
```
- There is a special method for printing tables with p-values in them. It is used in `print.summary.lm` but not in the more terse form used above.
- Because the special printing method shows `< 2e-16` for small p-values, some people believe that probabilities lower than that are not evaluated. This is not the case.
## Plots of an `lm` object
- There are several "pre-packaged" plots for `lm` objects. A total of 6 are available.
```{r plot1,fig.align='center'}
plot(fm1,which=1)
```
## plot 2
```{r plot2,fig.align='center',echo=FALSE}
plot(fm1,which=2)
```
## plot 3
```{r plot3,fig.align='center',echo=FALSE}
plot(fm1,which=3)
```
## plot 4
```{r plot4,fig.align='center',echo=FALSE}
plot(fm1,which=4)
```
## plot 5
```{r plot5,fig.align='center',echo=FALSE}
plot(fm1,which=5)
```
## plot 6
```{r plot6,fig.align='center',echo=FALSE}
plot(fm1,which=6)
```
## Methods for `lm` object
Most model-fitting functions assign a "class" tag to the return value
```{r classlm}
class(fm1)
```
allowing for "generic" functions to have special methods used with such objects.
The `methods` function provides a list of methods for a given generic or for a given class.
```{r methods}
methods(class=class(fm1))
```
## Using extractor functions
- Many of the methods for `lm` objects are "extractor" functions. That it, they extract some information from the fitted model in a way that does not depend on the internal structure.
- Try to use such functions whenever possible so as to "future proof" your code.
- When writing code for your own model fitting, provide extractors when possible
```{r coeff}
coef(fm1) # estimates of the coefficients
deviance(fm1) # residual sum of squares
logLik(fm1) # log-likelihood
```
## More extractors
```{r extractors}
vcov(fm1) # estimated variance-covariance of coefficient estimators
anova(fm1) # analysis of variance table (trivial in this case)
```
## Even more extractors
```{r confint}
confint(fm1) # confidence intervals on the coefficients
df.residual(fm1) # degrees of freedom for residuals
nobs(fm1) # number of observations used to fit the model
formula(fm1) # formula for the model matrix
```
## Fitted values
```{r residuals}
fitted(fm1)
```
## Residuals (raw)
- weighted residuals are available as `residuals(fm1,type = "pearson")` or as `weighted.residuals(fm1)`
```{r rawresiduals}
unname(residuals(fm1)) # suppress printing of names
```
## Various kinds of cooked residuals
- other generics provide the "studentized" residuals, `rstudent`, and the standardized residuals
```{r studentized}
unname(rstudent(fm1))
```
# Numerical methods, including simulation
## Simulating responses from a fitted model
- For generality the `simulate` generic returns a `data.frame` of simulated responses
```{r strsimulate}
str(simulate(fm1,nsim=10L))
```
## Conversion to a matrix
```{r datamatrix}
str(yy <- data.matrix(simulate(fm1,nsim=10000L)))
```
- It is very slow to fit the simulated responses in a loop calling `lm`
- The response in a call to `lm` can be a matrix.
```{r multipleRHS}
system.time(fms <- lm(yy ~ 1 + log(BodyWt),brains))
str(cc <- coef(fms))
```
## Coefficients from simulated data
```{r plot,echo=FALSE,fig.align='center'}
ccf <- unname(as.data.frame(t(cc)))
names(ccf) <- c("Intercept","logBodyWt")
qplot(Intercept,logBodyWt,data=ccf,geom="hex")
```
## `qr` and `effects`
```{r qr}
str(QR <- qr(fm1))
```
- Objects of class `qr` represent an orthogonal-triangular or QR decomposition of a matrix
```{r effects}
str(ee <- effects(fm1))
```
## Properties of `Q` and `R`
- The orthogonal triangular decomposition of an $n\times p$ matrix $\bf X$ is
$$\bf X = QR = Q_1R_1$$
where $\bf Q$ is $n\times n$ and orthogonal (i.e. $\bf Q'Q=QQ'=I_{\rm n}$) and the $n\times p$ matrix $\bf R$ is zero below the main diagonal. In the condensed form $\bf Q_{\rm 1}$ is $n\times p$ and $\bf R_{\rm 1}$ is $p\times p$ and upper triangular.
- Multiplication of a vector, $\bf y$, by $\bf Q$ or by $\bf Q'$ preserves its length.
$$\bf\|Q'y\|^2=y'QQ'y=y'y=\|y\|^2$$
so the residual sum of squares can be written
$$\|\bf y-X\beta\|^2=\|Q'(y-X\beta)\|^2=\|Q'y-Q'QR\beta\|^2=\|c_{\rm 1}-R_{\rm 1}\beta\|^2+\|c_2\|^2$$
where $\bf c_{\rm 1}$ is the first $p$ elements of $\bf Q'y$ and $\bf c_{\rm 2}$ is the last $n-p$.
- If $\bf X$ has full column rank, which amounts to saying that the diagonal elements of $\bf R_{\rm 1}$ are non-zero, then the least squares estimates, $\widehat{\beta}=\arg\min_{\beta}\|y-X\beta\|^2$, is the solution to
$$\bf R_{\rm 1}\widehat{\beta}=c_{\rm 1}$$
## Methods for `qr` objects
- Functions for working with objects that later got the tag `qr` were introduced in S before S3 methods
- The names of these functions do not follow the usual naming conventions. They begin with `qr.`. The methods are
- `qr.Q` - return $\bf Q_{\rm 1}$ (the default) or $\bf Q$
- `qr.R` - return $\bf R_{\rm 1}$ (the default) or $\bf R$
- `qr.X` - reconstruct $\bf X$ from the decomposition
- `qr.qty` - create $\bf Q'y$ from $\bf y$ without evaluating $\bf Q$ explicitly
- `qr.qy` - as above but for $\bf Qy$
- `qr.resid` - evaluate $\bf Q_{\rm 2}Q_{\rm 2}'y$ without evaluating $\bf Q$
- `qr.fitted` - evaluate $\bf\widehat{y} = Q_{\rm 1}Q_{\rm 1}'y$
- `qr.coef` - evaluate $\widehat{\beta}$ from $\bf y$
## Examples of using `qr.*` methods
```{r qandr}
(R <- qr.R(QR))
str(Q1 <- qr.Q(QR))
zapsmall(crossprod(Q1)) # equivalent to Q₁'Q₁
```
## More methods
```{r moremethods}
str(y <- model.response(model.frame(fm1)))
all.equal(unclass(ee),qr.qty(QR,y),check.attributes=FALSE)
(rss <- as.vector(crossprod(ee[-(1:2)]))) # squared length of c₂
all.equal(deviance(fm1), rss)
```
## Even more methods
```{r evenmore}
(betahat <- backsolve(R,ee[1:2]))
c(all.equal(betahat,unname(qr.coef(QR,y))),all.equal(qr.coef(QR,y),coef(fm1)),
all.equal(qr.fitted(QR,y),fitted(fm1)),all.equal(qr.resid(QR,y),resid(fm1)))
anova(fm1)[["Sum Sq"]]
ee[1:2]^2 # "sequential" sums of squares from the effects vector
```
## Numerical linear algebra is not what you learned in an intro course
- In an intro course you use matrix inverses all the time. In numerical linear algebra you almost never evaluate an inverse.
- In an intro course the rank of a matrix is well-defined. In floating point computation it isn't.
- Numerical linear algebra is all about decompositions. The least used is the eigenvalue-eigenvector. In an intro course the only decomposition you discuss is the eigenvalue-eigenvector decomposition.
- In an intro course you evaluate eigenvalues as the roots of the characteristic polynomial. In numerical work you solve for the roots of a polynomial by determining the eigenvalues of a companion matrix.
- One of the few instances of evaluating the inverse of a matrix is for the covariance of $\widehat{\beta}$ but it is $\bf R_{\rm 1}^{\rm -1}$, not $(\bf X'X)^{\rm -1}$ that is evaluated
## Evaluating the covariance matrix
```{r vcov}
chol2inv(R)*deviance(fm1)/df.residual(fm1)
vcov(fm1)
QR$rank # the computational rank of X
```
# Linear models and categorical covariates
## "one-way" analysis of variance
```{r insectsprays}
str(InsectSprays)
```
```{r boxplot1,echo=FALSE,fig.height=2,fig.align='center'}
qplot(reorder(spray,count),count,data=InsectSprays,geom="boxplot")+xlab("Spray")+ylab("Insect count")+coord_flip()
```
```{r boxplot2,echo=FALSE,fig.height=2,fig.align='center'}
qplot(reorder(spray,count),count,data=InsectSprays,geom="boxplot")+xlab("Spray")+ylab("Insect count (square root scale)")+scale_y_sqrt()+coord_flip()
```
## Fitting an `aov` model
```{r fm2}
summary(fm2 <- aov(sqrt(count) ~ 1 + spray,InsectSprays))
class(fm2)
e2 <- effects(fm2)
attr(e2,"assign")
c(e2[1]^2,crossprod(e2[2:6]),crossprod(e2[-(1:6)])) # sequential sum-of-squares decomposition
```
## Using `xtable`
```{r summaryfm2xtable,results='asis',echo=FALSE}
options(xtable.type="html")
xtable(fm2)
oldclass <- class(fm2)
class(fm2) <- "lm"
xtable:::xtable(fm2)
class(fm2) <- oldclass
```
## Model matrix
```{r modelmatrixfm2}
model.frame(fm2)$spray
head(X <- model.matrix(fm2),15)
```
## Model matrix (cont'd)
```{r modelmatrix2fm2}
tail(X,20)
```
## Contrasts
- a categorical covariate with $k$ levels is converted to $k-1$ columns of "contrasts" in the model matrix.
```{r contrasts}
contrasts(model.frame(fm2)$spray)
attr(model.frame(fm2),"terms")
```
## Crossproduct form
- the crossproduct matrix, $\bf X'X$ has the form
```{r xtx1}
crossprod(model.matrix(fm2))
```
## An `aov` model is a `lm` model (class inheritance)
```{r summaryfm2}
summary.lm(fm2)
```
## Interpretation of coefficients
- The default contrasts for a factor are `contr.treatment` and for an ordered factor `contr.poly`
- In the `treatment` contrasts the indicators of the factor levels are generated and the first column dropped
```{r contrtreatment}
contr.treatment(2)
contr.treatment(3)
```