-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path02_Estimation.Rmd
655 lines (526 loc) · 26.2 KB
/
02_Estimation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
# Parameter Estimation
```{r, echo=FALSE}
# Unattach any packages that happen to already be loaded. In general this is unecessary
# but is important for the creation of the book to not have package namespaces
# fighting unexpectedly.
pkgs = names(sessionInfo()$otherPkgs)
if( length(pkgs > 0)){
pkgs = paste('package:', pkgs, sep = "")
for( i in 1:length(pkgs)){
detach(pkgs[i], character.only = TRUE, force=TRUE)
}
}
# Set my default chunk options
knitr::opts_chunk$set( fig.height=3 )
```
```{r, message=FALSE, warning=FALSE}
library(tidyverse) # dplyr, tidyr, ggplot2
```
### Learning Outcomes{-}
* Write simple regression or one-way ANOVA models as $\boldsymbol{y} \sim N(\boldsymbol{X\beta} , \sigma^2 \textbf{I}_n)$
* Utilizing R and a sample of data, calculate the estimators
$$\hat{\boldsymbol{\beta}} = (\textbf{X}^T\textbf{X})^{-1} \textbf{X}^T \textbf{y}$$
$$\hat{\textbf{y}} = \textbf{X}\hat{\boldsymbol{\beta}}$$
and
$$\hat{\sigma}^2 = \textrm{MSE} = \frac{1}{n-p}\;\sum_{i=1}^n (y_i - \hat{y}_i)^2$$
* Calculate the uncertainty of $\hat{\beta_j}$ as
$$\textrm{StdErr}\left(\hat{\beta}_{j}\right)=\sqrt{\hat{\sigma}^{2}\left[\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\right]_{jj}}$$
## Introduction
We have previously looked at ANOVA and regression models and, in many
ways, they felt very similar. In this chapter we will introduce the
theory that allows us to understand both models as a particular flavor
of a larger class of models known as _linear models_.
First we clarify what a linear model is. A linear model is a model
where the data (which we will denote using roman letters as $\boldsymbol{x}$
and $\boldsymbol{y}$) and parameters of interest (which we denote
using Greek letters such as $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$)
interact only via addition and multiplication. The following are linear
models:
Model | Formula
------------------ | ---------------------------------------------
ANOVA | $y_{ij}=\mu+\tau_{i}+\epsilon_{ij}$
Simple Regression | $y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}$
Quadratic Term | $y_{i}=\beta_{0}+\beta_{1}x_{i}+\beta_{2}x_{i}^{2}+\epsilon_{i}$
General Regression | $y_{i}=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\dots+\beta_{p}x_{i,p}+\epsilon_{i}$
Notice in the Quadratic model, the square is
not a parameter and we can consider $x_{i}^{2}$ as just another column
of data. This leads to the second example of multiple regression where
we just add more slopes for other covariates where the $p$th
covariate is denoted $\boldsymbol{x}_{\cdot,p}$ and might be some
transformation (such as $x^{2}$ or $\log x$) of another column of
data. The critical point is that the transformation to the data $\boldsymbol{x}$ does not depend
on a parameter. Thus the following is _not_ a linear model
$$
y_{i}=\beta_{0}+\beta_{1}x_{i}^{\alpha}+\epsilon_{i}
$$
## Model Specifications
### Simple Regression
We would like to represent all linear models in a similar compact
matrix representation. This will allow us to make the transition between
simple and multiple regression (and ANCOVA) painlessly.
To begin, lets consider the simple regression model.
Typically we'll write the model as if we are specifying the $i^{th}$ element of the
data set
$$y_i = \underbrace{\beta_0 + \beta_1 x_i}_{\textrm{signal}} + \underbrace{\epsilon_i}_{\textrm{noise}} \;\;\; \textrm{ where } \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)$$
Notice we have a data generation model where there is some relationship between the
explanatory variable and the response which we can refer to as the "signal" part of the
defined model and the noise term which represents unknown actions effecting each data point
that move the response variable. We don't know what those unknown or unmeasured effects are,
but we do know the sum of those effects results in a vertical shift from the signal part
of the model.
```{r, echo=FALSE, warning=FALSE, message=FALSE}
seed <- runif(1, max=10000) %>% round()
seed <- 1018
set.seed(seed)
n <- 20
data <- data.frame( x = runif(n)) %>%
mutate(Ey = .5 + .5*x,
y = Ey + rnorm(n, sd=.2)) %>%
arrange(x)
ggplot(data, aes(x=x)) +
geom_point(aes(y=y)) +
geom_line(aes(y=Ey), color='blue') +
annotate('text', x=.9, y=1, color='blue', label=latex2exp::TeX('$\\beta_0 + \\beta_1 x$')) +
annotate('line', x=rep(data[13,'x'],2), y=c(data[13,'y'], data[13,'Ey']), color='red') +
annotate('text', x=.68, y=1, label=latex2exp::TeX('$\\epsilon_i$'), color='red') +
labs(x='Explanatory Variable', y='Response Variable', title='Simple Regression Model') +
scale_x_continuous(labels=NULL) +
scale_y_continuous(labels=NULL)
```
This representation of the model implicitly assumes that our data set has $n$ observations
and we could write the model using *all* the obsesrvations
using matrices and vectors that correspond the the data and the parameters.
$$\begin{aligned}
y_{1} & = \beta_{0}+\beta_{1}x_{1}+\epsilon_{1}\\
y_{2} & = \beta_{0}+\beta_{1}x_{2}+\epsilon_{2}\\
y_{3} & = \beta_{0}+\beta_{1}x_{3}+\epsilon_{3}\\
& \vdots\\
y_{n-1} & = \beta_{0}+\beta_{1}x_{n-1}+\epsilon_{n-1}\\
y_{n} & = \beta_{0}+\beta_{1}x_{n}+\epsilon_{n}
\end{aligned}$$
where, as usual, $\epsilon_{i}\stackrel{iid}{\sim}N\left(0,\sigma^{2}\right)$.
These equations can be written using matrices as
$$
\underset{\boldsymbol{y}}{\underbrace{\left[\begin{array}{c}
y_{1}\\
y_{2}\\
y_{3}\\
\vdots\\
y_{n-1}\\
y_{n}
\end{array}\right]}}=\underset{\boldsymbol{X}}{\underbrace{\left[\begin{array}{cc}
1 & x_{1}\\
1 & x_{2}\\
1 & x_{3}\\
\vdots & \vdots\\
1 & x_{n-1}\\
1 & x_{n}
\end{array}\right]}}\underset{\boldsymbol{\beta}}{\underbrace{\left[\begin{array}{c}
\beta_{0}\\
\beta_{1}
\end{array}\right]}}+\underset{\boldsymbol{\epsilon}}{\underbrace{\left[\begin{array}{c}
\epsilon_{1}\\
\epsilon_{2}\\
\epsilon_{3}\\
\vdots\\
\epsilon_{n-1}\\
\epsilon_{n}
\end{array}\right]}}
$$
and we compactly write the model as
$$
\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon} \;\;
\textrm{ where } \; \boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \sigma^2 \boldsymbol{I}_n)
$$
where $\boldsymbol{X}$ is referred to as the *design matrix*
and $\boldsymbol{\beta}$ is the vector of *location parameters* we are interested
in estimating. To be very general, a vector of random variables needs
to describe how each of them varies in relation to each other, so we need to specify
the variance *matrix*. However because $\epsilon_i$ is independent of $\epsilon_j$,
for all $(i,j)$ pairs, the variance matrix can be written as $\sigma^2\,\boldsymbol{I}$
because all of the covariances are zero.
### ANOVA model
The anova model is also a linear model and all we must do is create
a appropriate design matrix. Given the design matrix $\boldsymbol{X}$,
all the calculations are identical as in the simple regression case.
#### Cell means representation
Recall the cell means representation is
$$
y_{i,j}=\mu_{i}+\epsilon_{i,j}
$$
where $y_{i,j}$ is the $j$th observation within the $i$th group.
To clearly show the creation of the $\boldsymbol{X}$ matrix, let
the number of groups be $p=3$ and the number of observations per
group be $n_{i}=4$. We now expand the formula to show all the data.
$$\begin{aligned}
y_{1,1} &= \mu_{1}+\epsilon_{1,1}\\
y_{1,2} &= \mu_{1}+\epsilon_{1,2}\\
y_{1,3} &= \mu_{1}+\epsilon_{1,3}\\
y_{1,4} &= \mu_{1}+\epsilon_{1,4}\\
y_{2,1} &= \mu_{2}+\epsilon_{2,1}\\
y_{2,2} &= \mu_{2}+\epsilon_{2,2}\\
y_{2,3} &= \mu_{2}+\epsilon_{2,3}\\
y_{2,4} &= \mu_{2}+\epsilon_{2,4}\\
y_{3,1} &= \mu_{3}+\epsilon_{3,1}\\
y_{3,2} &= \mu_{3}+\epsilon_{3,2}\\
y_{3,3} &= \mu_{3}+\epsilon_{3,3}\\
y_{3,4} &= \mu_{3}+\epsilon_{3,4}
\end{aligned}$$
In an effort to write the model as $\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}$
we will write the above as
$$\begin{aligned}
y_{1,1} &= 1\mu_{1}+0\mu_{2}+0\mu_{3}+\epsilon_{1,1}\\
y_{1,2} &= 1\mu_{1}+0\mu_{2}+0\mu_{3}+\epsilon_{1,2}\\
y_{1,3} &= 1\mu_{1}+0\mu_{2}+0\mu_{3}+\epsilon_{1,3}\\
y_{1,4} &= 1\mu_{1}+0\mu_{2}+0\mu_{3}+\epsilon_{1,4}\\
y_{2,1} &= 0\mu+1\mu_{2}+0\mu_{3}+\epsilon_{2,1}\\
y_{2,2} &= 0\mu+1\mu_{2}+0\mu_{3}+\epsilon_{2,2}\\
y_{2,3} &= 0\mu+1\mu_{2}+0\mu_{3}+\epsilon_{2,3}\\
y_{2,4} &= 0\mu+1\mu_{2}+0\mu_{3}+\epsilon_{2,4}\\
y_{3,1} &= 0\mu+0\mu_{2}+1\mu_{3}+\epsilon_{3,1}\\
y_{3,2} &= 0\mu+0\mu_{2}+1\mu_{3}+\epsilon_{3,2}\\
y_{3,3} &= 0\mu+0\mu_{2}+1\mu_{3}+\epsilon_{3,3}\\
y_{3,4} &= 0\mu+0\mu_{2}+1\mu_{3}+\epsilon_{3,4}
\end{aligned}$$
and we will finally be able to write the matrix version
$$
\underset{\boldsymbol{y}}{\underbrace{\left[\begin{array}{c}
y_{1,1}\\
y_{1,2}\\
y_{1,3}\\
y_{1,4}\\
y_{2,1}\\
y_{2,2}\\
y_{2,3}\\
y_{2,4}\\
y_{3,1}\\
y_{3,2}\\
y_{3,3}\\
y_{3,4}
\end{array}\right]}}=\underset{\mathbf{X}}{\underbrace{\left[\begin{array}{ccc}
1 & 0 & 0\\
1 & 0 & 0\\
1 & 0 & 0\\
1 & 0 & 0\\
0 & 1 & 0\\
0 & 1 & 0\\
0 & 1 & 0\\
0 & 1 & 0\\
0 & 0 & 1\\
0 & 0 & 1\\
0 & 0 & 1\\
0 & 0 & 1
\end{array}\right]}}\underset{\boldsymbol{\beta}}{\underbrace{\left[\begin{array}{c}
\mu_{1}\\
\mu_{2}\\
\mu_{3}
\end{array}\right]}}+\underset{\boldsymbol{\epsilon}}{\underbrace{\left[\begin{array}{c}
\epsilon_{1,1}\\
\epsilon_{1,2}\\
\epsilon_{1,3}\\
\epsilon_{1,4}\\
\epsilon_{2,1}\\
\epsilon_{2,2}\\
\epsilon_{2,3}\\
\epsilon_{2,4}\\
\epsilon_{3,1}\\
\epsilon_{3,2}\\
\epsilon_{3,3}\\
\epsilon_{3,4}
\end{array}\right]}}
$$
Notice that each column of the $\boldsymbol{X}$ matrix is acting
as an indicator if the observation is an element of the appropriate
group. As such, these are often called *indicator variables*.
Another term for these, which I find less helpful, is *dummy variables*.
#### Offset from reference group
In this model representation of ANOVA, we have an overall mean and
then offsets from the control group (which will be group one). The
model is thus
$$
y_{i,j}=\mu+\tau_{i}+\epsilon_{i,j}
$$
where $\tau_{1}=0$. We can write this in matrix form as
$$
\underset{\boldsymbol{y}}{\underbrace{\left[\begin{array}{c}
y_{1,1}\\
y_{1,2}\\
y_{1,3}\\
y_{1,4}\\
y_{2,1}\\
y_{2,2}\\
y_{2,3}\\
y_{2,4}\\
y_{3,1}\\
y_{3,2}\\
y_{3,3}\\
y_{3,4}
\end{array}\right]}}=\underset{\mathbf{X}}{\underbrace{\left[\begin{array}{ccc}
1 & 0 & 0\\
1 & 0 & 0\\
1 & 0 & 0\\
1 & 0 & 0\\
1 & 1 & 0\\
1 & 1 & 0\\
1 & 1 & 0\\
1 & 1 & 0\\
1 & 0 & 1\\
1 & 0 & 1\\
1 & 0 & 1\\
1 & 0 & 1
\end{array}\right]}}\underset{\boldsymbol{\beta}}{\underbrace{\left[\begin{array}{c}
\mu\\
\tau_{2}\\
\tau_{3}
\end{array}\right]}}+\underset{\boldsymbol{\epsilon}}{\underbrace{\left[\begin{array}{c}
\epsilon_{1,1}\\
\epsilon_{1,2}\\
\epsilon_{1,3}\\
\epsilon_{1,4}\\
\epsilon_{2,1}\\
\epsilon_{2,2}\\
\epsilon_{2,3}\\
\epsilon_{2,4}\\
\epsilon_{3,1}\\
\epsilon_{3,2}\\
\epsilon_{3,3}\\
\epsilon_{3,4}
\end{array}\right]}}
$$
## Parameter Estimation
For both simple regression and ANOVA, we can write the model in matrix form as
$$\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\epsilon} \;\;
\textrm{ where } \boldsymbol{\epsilon} \sim N(\boldsymbol{0},\sigma^2\boldsymbol{I}_n)$$
which could also be written as
$$\boldsymbol{y} \sim N(\boldsymbol{X\beta},\sigma^2\boldsymbol{I}_n)$$
and we could use the maximum-likelihood principle to find estimators for $\beta$ and $\sigma^2$.
In this section, we will introduce the estimators $\hat{\boldsymbol{\beta}}$ and $\hat{\sigma}^2$.
### Estimation of Location Paramters
Our goal is to find the best estimate of $\boldsymbol{\beta}$
given the data. To justify the formula, consider the case where there
is no error terms (i.e. $\epsilon_{i}=0$ for all $i$). Thus we have
$$
\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\beta}
$$
and our goal is to solve for $\boldsymbol{\beta}$. To do this, we
must use a matrix inverse, but since inverses only exist for square
matrices, we pre-multiple by $\boldsymbol{X}^{T}$ (notice that $\boldsymbol{X}^{T}\boldsymbol{X}$
is a symmetric $2\times2$ matrix).
$$
\boldsymbol{X}^{T}\boldsymbol{y}=\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{\beta}
$$
and then pre-multiply by $\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}$.
$$\begin{aligned}
\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}
&= \left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{\beta} \\
\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y} &= \boldsymbol{\beta}
\end{aligned}$$
This exercise suggests that $\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}$
is a good place to start when looking for the maximum-likelihood estimator
for $\boldsymbol{\beta}$.
Happily it turns out that this quantity is in fact the maximum-likelihood estimator for the data generation model
$$\boldsymbol{y} \sim N(\boldsymbol{X\beta}, \sigma^2 \boldsymbol{I}_n)$$
(and equivalently minimizes the sum-of-squared
error). In this course we won't prove these two facts, but we will use this as our estimate of $\boldsymbol{\beta}$.
$$\hat{\boldsymbol{\beta}}=\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}$$
### Estimation of Variance Parameter
Recall our simple regression model is
$$y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}$$
where $\epsilon_{i}\stackrel{iid}{\sim}N\left(0,\sigma^{2}\right)$.
Using our estimates $\hat{\boldsymbol{\beta}}$ we can obtain predicted values for the regression line at any x-value. In particular we can find the predicted value for each $x_i$ value in our dataset.
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$$
Using matrix notation, I would write $\hat{\boldsymbol{y}}=\boldsymbol{X}\hat{\boldsymbol{\beta}}$.
As usual we will find estimates of the noise terms (which we will call residuals or errors) via
$$\begin{aligned} \hat{\epsilon}_{i}
&= y_{i}-\hat{y}_{i}\\
&= y_{i}-\left(\hat{\beta}_{0}+\hat{\beta}_{1}x_{i}\right)
\end{aligned}$$
Writing $\hat{\boldsymbol{y}}$ in matrix terms we have
$$\begin{aligned}
\hat{\boldsymbol{y}} &= \boldsymbol{X}\hat{\boldsymbol{\beta}}\\
&= \boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}\\
&= \boldsymbol{H}\boldsymbol{y}
\end{aligned}$$
where $\boldsymbol{H}=\boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}$
is often called the *hat-matrix* because it takes $y$ to $\hat{y}$
and has many interesting theoretical properties.\footnote{Mathematically, $\boldsymbol{H}$ is the projection matrix that takes
a vector in $n$-dimensional space and projects it onto a $p$-dimension
subspace spanned by the vectors in $\boldsymbol{X}$. Projection matrices
have many useful properties and much of the theory of linear models
utilizes $\boldsymbol{H}$. }
We can now estimate the error terms via
$$\begin{aligned}
\hat{\boldsymbol{\epsilon}} &= \boldsymbol{y}-\hat{\boldsymbol{y}}\\
&= \boldsymbol{y}-\boldsymbol{H}\boldsymbol{y}\\
&= \left(\boldsymbol{I}_{n}-\boldsymbol{H}\right)\boldsymbol{y}
\end{aligned}$$
As usual we estimate $\sigma^{2}$ using the mean-squared error, but the general formula is
$$\begin{aligned} \hat{\sigma}^{2} &= \textrm{MSE} \\ \\
&= \frac{1}{n-p}\;\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}\\ \\
&= \frac{1}{n-p}\;\hat{\boldsymbol{\epsilon}}^{T}\hat{\boldsymbol{\epsilon}}
\end{aligned}$$
where $\boldsymbol{\beta}$ has $p$ elements, and thus we have $n-p$ degrees of freedom.
## Standard Errors
Because our $\hat{\boldsymbol{\beta}}$ estimates vary from sample to sample, we need to estimate
how much they vary from sample to sample. This will eventually allow us to create confidence intervals for
and perform hypothesis test relating to the $\boldsymbol{\beta}$ parameter vector.
### Expectation and variance of a random vector
Just as we needed to derive the expected value and variance of $\bar{x}$
in the previous semester, we must now do the same for $\hat{\boldsymbol{\beta}}$.
But to do this, we need some properties of expectations and variances.
In the following, let $\boldsymbol{A}_{n\times p}$ and $\boldsymbol{b}_{n\times1}$
be constants and $\boldsymbol{\epsilon}_{n\times1}$ be a random vector.
Expectations are very similar to the scalar case where
$$E\left[\boldsymbol{\epsilon}\right]=\left[\begin{array}{c}
E\left[\epsilon_{1}\right]\\
E\left[\epsilon_{2}\right]\\
\vdots\\
E\left[\epsilon_{n}\right]
\end{array}\right]$$
and any constants are pulled through the expectation
$$E\left[\boldsymbol{A}^{T}\boldsymbol{\epsilon}+\boldsymbol{b}\right]=\boldsymbol{A}^{T}\,E\left[\boldsymbol{\epsilon}\right]+\boldsymbol{b}$$
Variances are a little different. The variance of the vector $\boldsymbol{\epsilon}$
is
$$
Var\left(\boldsymbol{\epsilon}\right)=\left[\begin{array}{cccc}
Var\left(\epsilon_{1}\right) & Cov\left(\epsilon_{1},\epsilon_{2}\right) & \dots & Cov\left(\epsilon_{1},\epsilon_{n}\right)\\
Cov\left(\epsilon_{2},\epsilon_{1}\right) & Var\left(\epsilon_{2}\right) & \dots & Cov\left(\epsilon_{2},\epsilon_{n}\right)\\
\vdots & \vdots & \ddots & \vdots\\
Cov\left(\epsilon_{n},\epsilon_{1}\right) & Cov\left(\epsilon_{n},\epsilon_{2}\right) & \dots & Var\left(\epsilon_{1}\right)
\end{array}\right]
$$
and additive constants are ignored, but multiplicative constants
are pulled out as follows:
$$
Var\left(\boldsymbol{A}^{T}\boldsymbol{\epsilon}+\boldsymbol{b}\right)=Var\left(\boldsymbol{A}^{T}\boldsymbol{\epsilon}\right)=\boldsymbol{A}^{T}\,Var\left(\boldsymbol{\epsilon}\right)\,\boldsymbol{A}
$$
### Variance of Location Parameters
We next derive the sampling variance of our estimator $\hat{\boldsymbol{\beta}}$
by first noting that $\boldsymbol{X}$ and $\boldsymbol{\beta}$ are
constants and therefore
$$\begin{aligned} Var\left(\boldsymbol{y}\right)
&= Var\left(\boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\right)\\
&= Var\left(\boldsymbol{\epsilon}\right)\\
&= \sigma^{2}\boldsymbol{I}_{n}
\end{aligned}$$
because the error terms are independent and therefore $Cov\left(\epsilon_{i},\epsilon_{j}\right)=0$
when $i\ne j$ and $Var\left(\epsilon_{i}\right)=\sigma^{2}$. Recalling
that constants come out of the variance operator as the constant _squared_.
$$\begin{aligned} Var\left(\hat{\boldsymbol{\beta}}\right)
&= Var\left(\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}\right)\\
&= \left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\,Var\left(\boldsymbol{y}\right)\,\boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\\
&= \left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\,\sigma^{2}\boldsymbol{I}_{n}\,\boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\\
&= \sigma^{2}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\\
&= \sigma^{2}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}
\end{aligned}$$
Using this, the standard error (i.e. the estimated standard deviation)
of $\hat{\beta}_{j}$ (for any $j$ in $1,\dots,p$) is
$$StdErr\left(\hat{\beta}_{j}\right)=\sqrt{\hat{\sigma}^{2}\left[\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\right]_{jj}}$$
### Summary of pertinent results
The statistic $\hat{\boldsymbol{\beta}}=\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}$ is the unbiased maximum-likelihood estimator of $\boldsymbol{\beta}$.
The Central Limit Theorem applies to each element of $\boldsymbol{\beta}$. That is, as $n\to\infty$, the distribution of $\hat{\beta}_{j}\to N\left(\beta_{j},\left[\sigma^{2}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\right]_{jj}\right)$.
The error terms can be calculated via
$$\begin{aligned} \hat{\boldsymbol{y}} &= \boldsymbol{X}\hat{\boldsymbol{\beta}}\\
\hat{\boldsymbol{\epsilon}} &= \boldsymbol{y}-\hat{\boldsymbol{y}}
\end{aligned}$$
The estimate of $\sigma^{2}$ is
$$\begin{aligned} \hat{\sigma}^{2} = \textrm{MSE}
= \frac{1}{n-p}\;\sum_{i=1}^{n}\hat{\epsilon}_{i}^{2}
= \frac{1}{n-p}\;\hat{\boldsymbol{\epsilon}}^{T}\hat{\boldsymbol{\epsilon}}
\end{aligned}$$
and is the typical squared distance between an observation and the model prediction.
The standard error (i.e. the estimated standard deviation) of $\hat{\beta}_{j}$ (for any $j$ in $1,\dots,p$) is
$$StdErr\left(\hat{\beta}_{j}\right)=\sqrt{\hat{\sigma}^{2}\left[\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\right]_{jj}}$$
## R example
Here we will work an example in R and do both the "hand" calculation as well as using the `lm()` function to obtain the same information.
Consider the following data in a simple regression problem:
```{r, fig.height=3, warning=FALSE}
n <- 20
x <- seq(0,10, length=n)
y <- -3 + 2*x + rnorm(n, sd=2)
my.data <- data.frame(x=x, y=y)
ggplot(my.data) + geom_point(aes(x=x,y=y))
```
First we must create the design matrix $\boldsymbol{X}$. Recall
$$
\boldsymbol{X}=\left[\begin{array}{cc}
1 & x_{1}\\
1 & x_{2}\\
1 & x_{3}\\
\vdots & \vdots\\
1 & x_{n-1}\\
1 & x_{n}
\end{array}\right]
$$
and can be created in R via the following:
```{r}
X <- cbind( rep(1,n), x)
```
Given $\boldsymbol{X}$ and $\boldsymbol{y}$ we can calculate
$$
\hat{\boldsymbol{\beta}}=\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\boldsymbol{y}
$$
in R using the following code:
```{r}
XtXinv <- solve( t(X) %*% X ) # solve() is the inverse function
beta.hat <- XtXinv %*% t(X) %*% y %>% # Do the calculations
as.vector() # make sure the result is a vector
```
Our next step is to calculate the predicted values $\hat{\boldsymbol{y}}$
and the residuals $\hat{\boldsymbol{\epsilon}}$
$$\begin{aligned}
\hat{\boldsymbol{y}} &= \boldsymbol{X}\hat{\boldsymbol{\beta}}\\
\hat{\boldsymbol{\epsilon}} &= \boldsymbol{y}-\hat{\boldsymbol{y}}
\end{aligned}$$
```{r}
y.hat <- X %*% beta.hat
residuals <- y - y.hat
```
Now that we have the residuals, we can calculate $\hat{\sigma}^{2}$
and the standard errors of $\hat{\beta}_{j}$
$$\hat{\sigma}^{2}=\frac{1}{n-p}\,\sum_{i=1}^n\hat{\epsilon}_i^2$$
$$StdErr\left(\hat{\beta}_{j}\right)=\sqrt{\hat{\sigma}^{2}\left[\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\right]_{jj}}$$
```{r}
sigma2.hat <- 1/(n-2) * sum( residuals^2 ) # p = 2
sigma.hat <- sqrt( sigma2.hat )
std.errs <- sqrt( sigma2.hat * diag(XtXinv) )
```
We now print out the important values and compare them to the summary
output given by the `lm()` function in R.
```{r}
cbind(Est=beta.hat, StdErr=std.errs)
sigma.hat
```
```{r}
model <- lm(y~x)
summary(model)
```
## Exercises {#Exercises_Estimation}
1. We will do a simple ANOVA analysis on example 8.2 from Ott \& Longnecker using the matrix representation of the model. A clinical psychologist wished to compare three methods for reducing hostility levels in university students, and used a certain test (HLT) to measure the degree of hostility. A high score on the test indicated great hostility. The psychologist used 24 students who obtained high and nearly equal scores in the experiment. eight were selected at random from among the 24 problem cases and were treated with method 1. Seven of the remaining 16 students were selected at random and treated with method 2. The remaining nine students were treated with method 3. All treatments were continued for a one-semester period. Each student was given the HLT test at the end of the semester, with the results show in the following table. (This analysis was done in section 8.3 of my STA 570 notes)
Method | Values
-------- | ------------
1 | 96, 79, 91, 85, 83, 91, 82, 87
2 | 77, 76, 74, 73, 78, 71, 80
3 | 66, 73, 69, 66, 77, 73, 71, 70, 74
We will be using the cell means model of ANOVA
$$ y_{ij}=\beta_{i}+\epsilon_{ij} $$
where $\beta_{i}$ is the mean of group $i$ and $\epsilon_{ij}\stackrel{iid}{\sim}N\left(0,\sigma^{2}\right)$.
a. Create one vector of all 24 hostility test scores `y`. (Use the `c()` function.)
b. Create a design matrix `X` with dummy variables for columns that code for what group an observation belongs to. Notice that `X` will be a $24$ rows by $3$ column matrix. _Hint: An R function that might be handy is `cbind(a,b)` which will bind two vectors or matrices together along the columns. There is also a corresponding `rbind()` function that binds vectors/matrices along rows. Furthermore, the repeat command `rep()` could be handy._
c) Find $\hat{\boldsymbol{\beta}}$ using the matrix formula given in class. _Hint: The R function `t(A)` computes the matrix transpose $\mathbf{A}^{T}$, `solve(A)` computes $\mathbf{A}^{-1}$, and the operator `%*%` does matrix multiplication (used as `A %*% B`)._
d) Examine the matrix $\left(\mathbf{X}^{T}\mathbf{X}\right)^{-1}\mathbf{X}^{T}$.
What do you notice about it? In particular, think about the result
when you right multiply by $\mathbf{y}$. How does this matrix calculate the appropriate group means and using the appropriate group sizes $n_i$?
2. We will calculate the y-intercept and slope estimates in a simple linear model using matrix notation. We will use a data set that gives the diameter at breast height (DBH) versus tree height for a randomly selected set of trees. In addition, for each tree, a ground measurement of crown closure (CC) was taken. Larger values of crown closure indicate more shading and is often associated with taller tree morphology (possibly). We will be interested in creating a regression model that predicts height based on DBH and CC. In the interest of reduced copying, we will only use 10 observations. *(Note: I made this data up and the DBH values might be unrealistic. Don't make fun of me.)*
```{r, echo=FALSE}
data <- data.frame(DBH=c(30.5, 31.5, 31.7, 32.3, 33.3, 35.0, 35.4, 35.6, 36.3, 37.8),
CC =c(.74, .69, .65, .72, .58, .50, .60, .70, .52, .60),
Height=c(58, 64, 65, 70, 68, 63, 78, 80, 74, 76))
pander::pander(t(data))
```
We are interested in fitting the regression model
$$y_{i}=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\epsilon_{i}$$ where $\beta_{0}$ is the y-intercept and $\beta_{1}$ is the slope parameter associated with DBH and $\beta_{2}$ is the slope parameter associated with Crown Closure.
a) Create a vector of all 10 heights $\mathbf{y}$.
b) Create the design matrix $\mathbf{X}$.
c) Find $\hat{\boldsymbol{\beta}}$ using the matrix formula given in class.
d) Compare your results to the estimated coefficients you get using the `lm()` function. To add the second predictor to the model, your call to `lm()` should look something like `lm(Height ~ DBH + CrownClosure)`.