-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathbayes_regression.Rmd
837 lines (657 loc) · 29.1 KB
/
bayes_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
---
title: "Bayesian Linear Regression"
output: statsr:::statswithr_lab
references:
- id: Wooldridge2000
title: Introductory Econometrics- A Modern Approach
author:
- family: Wooldridge
given: Jeffrey
URL: 'http://fmwww.bc.edu/ec-p/data/wooldridge/wage2.dta'
publisher: South-Western College Publishing
type: book
issued:
year: 2000
- id: Clyde2018
title: 'BAS: Bayesian Adaptive Sampling for Bayesian
Model Averaging Version 1.4.9'
author:
- family: Clyde
given: Merlise
URL: 'https://CRAN.R-project.org/package=BAS'
publisher: CRAN
type: book
issued:
year: 2018
- id: StatNews83
author:
- family: Yang
given: Jing
title: Interpreting Coefficients in Regression with Log-Transformed Variables
URL: 'https://www.cscu.cornell.edu/news/statnews/stnews83.pdf'
issued:
year: 2012
publisher: Cornell Statistical Consulting Unit, Cornell University
type: book
---
```{r setup, include=FALSE}
options(digits = 4)
```
<div class="instructions">
Complete all **Exercises**, and submit answers to **Questions** in the **Quiz: Week 4 of Lab** on Coursera.
</div>
## Modeling Wages
In the field of labor economics, the study of income and wages provides insight
about topics ranging from gender discrimination to the benefits of higher
education. In this lab, we will analyze cross-sectional wage data in order to
practice using Bayesian selection methods such as BIC and Bayesian Model
Averaging to construct parsimonious predictive models.
## Getting Started
In this lab we will explore the data using the `dplyr` package and visualize it
using the `ggplot2` package for data visualization. Both of these packages are
part of the tidyverse. We will review simple linear regression using the `lm`
function and how the output can be interpreted from a Bayesian perspective. We
will also use the `broom` package to turn regression outputs to tidy data frames
to help with diagnostic plotting. We will use the `stepAIC` function from the
`MASS` package for model selection using step-wise selection using BIC. The
`bas.lm` function from the `BAS` package later in the lab to implement Bayesian
Model Averaging. Please make sure that the version of `BAS` is 1.4.9 or greater.
The data can be found in the companion package for this course, `statsr`. Some
learners may want to review material from the earlier courses in the
specialization that covers EDA and regression if they are unfamiliar with
`ggplot` basics or the `lm` function.
### Load packages
Let's load the packages that we will be using:
```{r load-packages, message=FALSE, warning = F}
library(MASS)
library(tidyverse)
library(statsr)
library(BAS)
library(broom)
options(width=100)
```
### The data
The data we will be using in this lab were gathered as a random sample of 935
respondents throughout the United States. This data set was released as part of
the series *Instructional Stata Datasets for Econometrics* by the Boston College
Department of Economics [@Wooldridge2000].
Let's start by loading the data:
```{r load-data, message=FALSE}
data(wage)
```
variable | description
---------------- | -----------
`wage` | weekly earnings (dollars)
`hours` | average hours worked per week
`iq` | IQ score
`kww` | knowledge of world work score
`educ` | number of years of education
`exper` | years of work experience
`tenure` | years with current employer
`age` | age in years
`married` | =1 if married
`black` | =1 if black
`south` | =1 if live in south
`urban` | =1 if live in a Standard Metropolitan Statistical Area
`sibs` | number of siblings
`brthord` | birth order
`meduc` | mother's education (years)
`feduc` | father's education (years)
`lwage` | natural log of `wage`
<div class="question">
Is this an observational study or an experiment? You may refer to
http://study.com/academy/lesson/experiments-vs-observational-studies.html for
the definitions of the two.
* Observational study
* Experiment
</div>
### Setting a seed
In this lab we will do some random generation, which means you should set a seed
on top of your document. Setting a seed will cause R to sample the same sample
each time you knit your document. This will make sure your results don't change
each time you knit, and it will also ensure reproducibility of your work
(by setting the same seed it will be possible to reproduce your results).
You can set a seed like this:
```{r set-seed}
set.seed(18382)
```
The number above is completely arbitraty. If you need inspiration, you can use
your ID, birthday, or just a random string of numbers. The important thing is
that you use each seed only once. You only need to do this once in your
R Markdown document, but make sure it comes before sampling.
## Exploring the data
For a new data set, a good place to start is standard exploratory data analysis.
We will begin with the `wage` variable since it will be the response variable in
our models. We may use a histogram to visualize the distribution.
```{r Q1-wage-hist}
ggplot(data = wage, aes(x = wage)) +
geom_histogram(binwidth = 100)
```
For numeric summary statistics, the `summary` function provides additional
insights about the distribution of `wage`.
```{r Q1-wage-dist}
summary(wage$wage)
```
<div class="question">
Which of the following statements is **false** about the distribution of
weekly wages?
* The median of the distribution is 905.
* 25\% of respondents make at least 1160 dollars per week.
* 10 of the respondents make strictly less than 300 dollars per week
* `wage` is right-skewed, meaning that more respondents have weekly wages below the mean weekly wage than above it.
</div>
## Simple linear regression
Since `wage` is our response variable, we would like to explore the relationship
between `wage` and other variables as predictors. One possible, simplistic,
explanation for the variation in wages that we see in the data is that smarter
people make more money. The plot below visualizes a scatterplot between weekly
wage and IQ score.
```{r scatter-score-bty_avg}
ggplot(data = wage, aes(x = iq, y = wage)) +
geom_point()
```
There appears to be a positive relationship between IQ score and wage. We can
quantify this by fitting a Bayesian simple linear regression
$$\text{wage}_i = \alpha + \beta \cdot \text{iq}_i + \epsilon_i$$
to the observed data using the reference prior. We can fit the model using the
`lm` function:
```{r wage-iq-model}
m_wage_iq <- lm(wage ~ iq, data = wage)
```
and extract the summary statistics for the posterior distribution using the
output from the `lm` by applyting the `tidy` function from the `broom` package.
```{r}
tidy(m_wage_iq)
```
The first column displays the posterior means of the linear model's y-intercept
and the regression coefficient of `iq`.
With this we can write down the posterior mean of the regression line
$$`r round(summary(m_wage_iq)$coefficients[1,1],3)` + `r round(summary(m_wage_iq)$coefficients[2,1],3)` \times \text{IQ}$$
and create a scatterplot with the posterior mean for the regression line laid
on top.
```{r reg-with-line}
ggplot(data = wage, aes(x = iq, y = wage)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
```
Under the assumption that the errors $\epsilon_i$ are independent and normally
distributed with mean zero and an unknown variance $\sigma^2$, the posterior
distributions for the intercept and slope will have a Student-t distribution
under the reference prior with the posterior means and scales equal to the
ordinary least squares estimates and standard errors respectively. We can create
95% credible intervals for the two parameters using the `confint` function:
```{r}
confint(m_wage_iq)
```
<div class="question">
Fit a new model that uses `educ` (education) to predict average weekly wages.
Using the estimates from the R output, write the equation of the posterior mean
of the regression line and obtain a 95% credible interval for the coefficients.
What does the slope tell us in the context of the relationship between education
and earnings?
* Each additional year of education increases weekly wages by $60.21.
* Each additional year of education increases weekly wages by $146.95.
* For each additional year of education, there is a 95% chance that average weekly wages will possibly decrease by $5.56 or increase by $299.47.
* For each additional year of education, there is a 95% chance that average weekly wages will increase by $49.04 to $71.39.
</div>
```{r Q-SLR}
# Type your code for Question 3 here.
```
## Model diagnostics
The Bayesian model specification assumes that the errors are normally
distributed with a constant variance and that the mean expected weekly wages is
linear in IQ. We can check these assumption by examining the distribution of
the residuals for the model.
In order to do so we will use predicted values, residuals, and standardized
residuals of the model we fit earlier. The `augment` function in the `broom`
package is going to come in handy here as it takes in a model object (the output
of an `lm`) and returns a data frame with columns correspinding to variables
in the model as well as predicted values (`.fitted`), residuals (`.resid`), and
standardized residuals (`.std.resid`), along with a few others.
```{r augment}
m_wage_iq_aug <- augment(m_wage_iq)
```
**Linearity and Constant Variance**: You already checked if the relationship
between weekly wages and IQ is linear using a scatterplot. We should also verify
this condition with a plot of the residuals vs. fitted (predicted) values.
```{r residuals}
ggplot(data = m_wage_iq_aug, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals")
```
Also note that we're getting fancy with the code here. We set the `alpha` level
of our points to a value lower than 1 (`0.6` to be precise) in order to add
plot the points with some transparency. This will allow us to more easily
identify where the points are more dense vs. more sparse. Then, we overlay a
horizontal dashed line at $y = 0$ (to help us check whether residuals are
distributed evenly around 0 at each fitted value), and we also adjust the axis
labels to be more informative.
**Normality**: To check this condition, we can look at a histogram of residuals
```{r}
ggplot(data = m_wage_iq_aug, aes(x = .resid)) +
geom_histogram(binwidth = 100) +
xlab("Residuals")
```
or a normal probability plot of the residuals
```{r}
ggplot(m_wage_iq_aug) +
geom_qq(aes(sample = .std.resid)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(x = "Theoretical quantiles", y = "Standardized residuals")
```
where we expect the points to be close to the dashed line, if the assumption of
normality holds. Note that the $y$-axis in the plot uses standardized residuals,
which are the residuals divided by their standard deviations so that they will
have a normal distribution with mean zero and constant variance if the model
holds.
<div class="question">
Which of the following statements about the residual plots are **false**?
* The residuals appear to be randomly distributed around 0.
* The residuals are strongly left skewed, hence the normal distribution of errors condition is not met.
* The variability of residuals appears to increase as the fitted increase, suggesting that the constant variance assumption does not hold.
* There are more individuals where the model under predicts weekly wages rather than over estimates weekly wages.
</div>
```{r Q-resid}
# Type your code for Question 4 here.
```
<div class="exercise">
Refit the model by using `educ` (education) as the independent variable. Does
your answer to the previous exercise change?
</div>
```{r E2-educ-resid}
# Type your code for Exercise 1 here.
```
## Linear Regression After Transforming `wage`
One way to accommodate the right-skewness in the residuals is to (natural)
log-transform the dependent variable. Note that this is only possible if the
variable is strictly positive, since the log of negative value is not defined
and $\ln(0) = -\infty$. Let us try to fit a linear model with log-wage (`lwage`)
as the dependent variable. The next two questions will be based on this log
transformed model.
```{r lwage-iq-model}
m_lwage_iq = lm(lwage ~ iq, data = wage)
```
<div class="exercise">
Examine the residuals of this model. Is the assumption of normally distributed
residuals reasonable?
</div>
```{r E-log-resid}
# Type your code for Exercise 2 here.
```
<div class="exercise">
Excluding `wage` and `lwage`, select two other variables that you think might be
good predictors of `lwage`. Visualize their relationships with `wage` and check
assumptions using appropriate plots.
</div>
```{r E1-two-vars-eda}
# Type your code for Exercise 3 here.
```
## Outliers
We declared observations to be outliers with respect to the population model if
their deviation or error $\epsilon_i$ was more than $k=3$ standard deviations
above or below 0. Let's use the `Bayes.outlier` function from `BAS`, to calculate
these probabilities for the model `m_lwage_iq` and plot them against the
case number.
We start by calculating the probabilities,
```{r wage-iq-outliers-calculate}
outliers <- Bayes.outlier(m_lwage_iq, k = 3)
```
and then store the results in a data frame and plot them.
```{r wage-iq-outliers-plot}
outliers_df <- data.frame(probability = outliers$prob.outlier,
case = 1:length(outliers$prob.outlier))
ggplot(outliers_df, aes(ymax = probability, x = case)) +
geom_linerange(ymin = 0) +
labs(y = "Probability")
```
To identify which cases have probabilities greater than 0.50 of being an outlier,
we can use the `filter` function to return which cases have `probability > 0.50`.
```{r}
outliers_df %>%
filter(probability > 0.50)
```
<div class="question">
Using the definition of outlier above, which statement is **false**?
* Case 434 has a probability of close to 1 that it an outlier under the normal error model for regressing `lwage` on `iq`
* Case 514 has a probably of close to 1 that it an outlier under the normal error model for regressing `lwage` on `iq`
* Case 616 has a probably of close to 1 that it an outlier under the normal error model for regressing `lwage` on `iq`
* Case 784 has a probably of close to 1 that it an outlier under the normal error model for regressing `lwage` on `iq`
</div>
```{r Q-outlier-prob}
# Type your code for Question 5 here.
```
While being $3$ standard deviations seems like an extremely unlikely event for
a single observation, for large sample sizes, there is often a rather high
probability that there will be at least one error $\epsilon_i$ that exceeds $3$
standard deviations above or below zero _a priori_. We can calculate this as
follows
```{r}
# prob of a case being an outlier:
# being below or above 3 standard deviations from 0
(prob_outlier <- pnorm(-3) + pnorm(3, lower.tail = FALSE))
# probability of a signle case not being an outler is therefore the complement
(prob_not_outlier <- 1 - prob_outlier)
# probability of no outliers in the sample of n assuming errors are independent a priori
n <- nrow(wage)
(prob_no_outliers <- prob_not_outlier^n)
# probability of at least one outlier in the sample is the complement of the
# probability of no outliers in the sample of n
1 - prob_no_outliers
```
With a sample size of `r nrow(wage)` and using $3$ standard deviations to
define outliers, the chance of having at least one outlier in the sample is
`r round((1 - prob_no_outliers)*100, 2)`% so the fact that we did discover some
outliers is not that surprising.
So instead of fixing the number of standard deviations to $k=3$, an alternative
is fix the prior probability of there being no outliers in the sample,
$$P(\text{no outliers in sample}) = P(\text{observation is not an outlier})^n = 0.95$$
which we can solve for
$$ P(\text{observation is not an outlier}) = 0.95^{1/n} $$
and then solve for $k$ using the normal quantile function.
```{r}
n <- nrow(wage)
(prob_obs_not_outlier <- 0.95^(1/n))
(newk <- qnorm(0.5 + 0.5 * prob_obs_not_outlier))
```
The function `Bayes.outlier` can also calculate `k` internally if we specify
the prior probability of there being no outliers in the sample:
```{r no-outliers-in-sample}
outliers <- Bayes.outlier(m_lwage_iq, prior.prob=0.95)
```
<div class="question">
Use the new value of $k$ to calculate the posterior probability of each observation being an
outlier. Which observation has a posterior probability of being an outlier
that exceeds the prior probability of being an outlier?
* Case 434
* Case 514
* Case 616
* Case 784
</div>
```{r Q6-new-outlier-prob}
# Type your code for Qustion 6 here.
```
## Multiple linear regression
It is evident that wage can be explained by many predictors, such as experience,
education, IQ, and so on. We can include all relevant covariates in a regression
model in an attempt to explain as much wage variation as possible. In addition,
sometimes outliers can be explained by changing the model by adding other
predictors; let's take a look at multiple regression before removing any cases.
```{r full-lwage-model}
m_lwage_full <- lm(lwage ~ . - wage, data = wage)
```
The use of `. - wage` in the `lm` function tells R to include all covariates in
the model except the `wage` variable from the data set.
However, running this full model has a cost: we will remove observations from
our data if some measurements in the variables (e.g. birth order, mother's
education, and father's education) are missing. By default, the `lm` function
does a complete-case analysis. So it removes any observations with a missing
(`NA`) value in one or more of the predictor variables.
Because of these missing values we must make an addition assumption in order
for our inferences to be valid. This exclusion of rows with missing values
requires that in the data there is no systematic reason for the values to be
missing. In other words, our data must be missing at random. For example, if
all first-born children did not report their birth order, the data would not
be missing at random. Without any additional information we will assume this
is reasonable and use the 663 complete observations (as opposed to the original
935) to fit the model. Both Bayesian and frequentist methods exist to handle
data sets with missing data, but they are beyond the scope of this course.
<div class="question">
From the model, all else begin equal, who would you expect to make more: a
married black man or a single non-black man?
* The married black man
* The single non-black man
</div>
```{r Q7}
# Type your code for Question 7 here.
```
As you can see from a quick summary of the full linear model, many coefficients of independent variables are not statistically significant. In previous labs
within this specialization, you selected variables based on the values of
Adjusted $R^2$. This module introduced the Bayesian Information Criterion (BIC),
which is a criterion that can be used for model selection. BIC is based on
model fit, while simultaneously penalizing the number of parameters in proportion
to the sample size. We can calculate the BIC of the full linear model using
the command below:
```{r}
BIC(m_lwage_full)
```
We can compare the BIC of the full model with that of a reduced model. Let us
try to remove birth order from the model. To ensure that the observations remain
the same, the data set can be specified as `na.omit(wage)`, which includes only
the observations with no missing values in any variables in the data set.
```{r}
m_lwage_nobrthord <- lm(lwage ~ . - wage - brthord, data = na.omit(wage))
BIC(m_lwage_nobrthord)
```
As you can see, removing birth order from the regression reduces BIC, which
we seek to minimize by model selection.
<div class="question">
Elimination of which variable from the full model yielded the lowest BIC?
* `brthord`
* `sibs`
* `feduc`
* `meduc`
</div>
```{r Q7-BIC}
# Type your code for Question 8 here.
```
<div class="exercise">
R has a function `stepAIC` from the `MASS` package that will work backwards
through the model space, removing variables until the AIC score can be no longer
lowered. It takes all inputs in the full model, and a penalty parameter $k$.
The default setting is $k=2$ for the AIC score. Find the best model according
to BIC (in which case `k = log(n)` where $n$ is the number of observations).
Remember to use `na.omit(wage)` as your data set. You may type `?stepAIC` in
the RStudio Console to get the use and examples of the function `stepAIC`.
</div>
```{r E4-step}
# Type your code for Exercise 4 here.
```
## Bayesian model averaging
Often, several models are equally plausible and choosing only one ignores the
inherent uncertainty involved in choosing the variables to include in the model.
A way to get around this problem is to implement Bayesian model averaging (BMA),
in which multiple models are averaged to obtain posteriors of coefficients and
predictions from new data. Dr. Merlise Clyde is the primary author of the R
package `BAS`, which implements BMA [@Clyde2018]. We can use this for either
implementing BMA or selecting models.
We start by applying BMA to the wage data using all $15$ potential predictors.
```{r bas-wage}
# Exclude observations with missing values in the data set
wage_no_na <- na.omit(wage)
# Fit the model using Bayesian linear regression, `bas.lm` function in the `BAS` package
bma_lwage <- bas.lm(lwage ~ . -wage, data = wage_no_na,
prior = "BIC",
modelprior = uniform())
# Print out the marginal posterior inclusion probabilities for each variable
bma_lwage
# Top 5 most probably models
summary(bma_lwage)
```
Printing the model object and the summary command gives us both the posterior
model inclusion probability for each variable and the most probable models. For
example, the posterior probability that `hours` is included in the model is
0.855. Further, the most likely model, which has posterior probability of 0.0455,
includes an intercept, hours worked, IQ, education, tenure, age, marital status,
urban living status, and mother's education. While a posterior probability of
0.0455 sounds small, it is much larger than the uniform prior probability
assigned to it, since there are $2^{15}$ possible models.
It is also possible to visualize the posterior distribution of the coefficients
under the model averaging approach. We graph the posterior distribution of the
coefficients of `iq` and `sibs` below. Note that the subset command dictates
which variable is plotted.
```{r vis-BMA}
# Obtain the coefficients from the model `bma_lwage`
coef_lwage <- coefficients(bma_lwage)
# `iq` is the 3rd variable, while `sibs` is the 13th variable in the data set
plot(coef_lwage, subset = c(3,13), ask = FALSE)
```
We can also provide 95% credible intervals for these coefficients:
```{r conf-BMA}
confint(coef_lwage)
```
For Questions 9-10, we'll use a reduced data set which excludes wage, number
of siblings, birth order, and parental education.
```{r rem-vars-wage}
wage_red <- wage %>%
select(-wage, -sibs, -brthord, -meduc, -feduc)
```
Let's use BMA with the Zellner-Siow prior on the regression coefficients:
```{r}
bma_lwage_red <- bas.lm(lwage ~ ., data = wage_red,
prior = "ZS-null",
modelprior = uniform())
```
<div class="question">
Based on this reduced data set, according to Bayesian model averaging, which of
the following variables has the lowest marginal posterior inclusion probability?
* `kww`
* `black`
* `south`
* `age`
</div>
```{r Q9-reduced-BMA}
# Type your code for Question 9 here.
```
<div class="question">
**True** or **False**: The naive model with all variables included has posterior
probability greater than 0.5.
* True
* False
</div>
```{r Q10}
# Type your code for Question 10 here.
```
<div class="exercise">
Graph the posterior distribution of the coefficient of `age`, using the data
set `wage_red`.
</div>
```{r E5-graph}
# Type your code for Exercise 5 here.
```
Because we have log transformed wage, interpretation of coefficients from the
output is not as useful for understanding how the different predictors influence
wages. Instead we can interpret coefficients after transforming back to the
original units by exponentiation. The exponential of the posterior mean is not
the same as the mean of the exponential, however, the median of wage can be
found by exponentiating the median of log wage (i.e. the middle value is still
in the middle with transformations that do not change the order of values).
Let's look at the coefficients and 95% credible intervals
```{r credible_red}
coef(bma_lwage_red)
coef(bma_lwage_red) %>%
confint()
```
The exponential transformation applied to coefficients has a multiplicative
effect on the posterior median. What this means is that a one unit increase in
predictor $X_j$ leads to a $(\exp(\beta_j) -1) \times 100$ percent increase in
the median wage [@StatNews83]. For a factor or indicator variable like
`urban`, the posterior median for wages for urban areas (`urban == 1`) will
be $(\exp(0.1791) - 1) \times 100$ percent higher than in rural areas
(`urban == 0`). Similarly we can use the same transformation with the credible
interval. First, let's calculate the credible interval for the coefficient of
`urban` and exponentiate it.
```{r}
ci_urban <- coef(bma_lwage_red) %>%
confint(parm = "urban1") %>%
exp()
ci_urban
```
Then, we can substract 1 from the bounds of the interval and multipl them by 100
to make them a bit more straightforward to interpret
```{r}
(ci_urban - 1) * 100
```
Based on this data, there is a 95% chance that median wages for urban areas are
`r round((exp(confint(coef(bma_lwage_red), parm='urban1'))[1] - 1)*100, 2)`
to
`r round((exp(confint(coef(bma_lwage_red), parm='urban1'))[2] - 1)*100,2)`
times higher than in rural regions.
<div class="exercise">
Find a 95% credible interval for the coefficient of `educ` and provide an interpretation.
</div>
```{r E6-CI-lwage}
# Type your code for Exercise 6 here
```
## Prediction with BAS
Similar to last week's lab, we will be using Bayesian predictive distribution
for predictions and interpretation of predictions. Simulation is used in `BAS`
to construct predictive intervals with Bayesian Model Averaging, while exact
inference is often possible with predictive intervals under model selection.
Returning to the wage data set, let us find the predictive values under the
*Best Predictive Model* (`BPM`), the one which has predictions closest to BMA
and corresponding posterior standard deviations.
```{r bma_predict, cache=TRUE}
BPM_pred_lwage <- predict(bma_lwage, estimator = "BPM", se.fit = TRUE)
variable.names(BPM_pred_lwage)
```
In the code above, the function `variable.names` can be used to extract the names of all of the predictors in the Best Probabilty model.
This can be used to identify the variables in the *Highest Probability Model* (`HPM`)
```{r HPM}
HPM_pred_lwage <- predict(bma_lwage, estimator = "HPM")
variable.names(HPM_pred_lwage)
```
and the
*Median Probability Model* (`MPM`)
```{r MPM}
MPM_pred_lwage <- predict(bma_lwage, estimator = "MPM")
variable.names(MPM_pred_lwage)
```
The `MPM` includes `exper` in addition to all of the variables in the
*Highest Probability Model* (`HPM`), while the `BPM` includes `kww` in
addition to all of the variables in the `MPM`.
<div class="question">
Based on these results which covariates are included in **all** of the following: the best
predictive model, the median probability model, and the highest posterior
probability model?
* `kww`, `married`, `urban`
* `married`, `age`, `black`
* `black`, `south`, `married`
* `meduc`, `urban`, `married`
</div>
```{r Q11}
# Type your code for Question 11 here.
```
Let us turn to examine what characteristics lead to the highest wages in the
`BPM` model.
```{r opt_wage}
# Find the index of observation with the largest fitted value
opt <- which.max(BPM_pred_lwage$fit)
# Extract the row with this observation and glimpse at the row
wage_no_na %>%
slice(opt) %>%
glimpse()
```
A 95% credible interval for predicting log wages can be obtained by
```{r ci}
ci_lwage <- confint(BPM_pred_lwage, parm = "pred")
ci_lwage[opt,]
```
To translate this back to `wage` (recall that we regress `lwage`), we may
exponentiate the interval to obtain a 95% prediction interval for the wages of
an individual with covariates at the levels of the individual specified by `opt`.
```{r ci_wage}
exp(ci_lwage[opt,])
```
If we were to use BMA, the interval would be
```{r}
BMA_pred_lwage <- predict(bma_lwage, estimator = "BMA", se.fit = TRUE)
ci_bma_lwage <- confint(BMA_pred_lwage, estimator = "BMA")
opt_bma <- which.max(BMA_pred_lwage$fit)
exp(ci_bma_lwage[opt_bma, ])
```
<div class="question">
Repeat these calculations for a 95% prediction interval for the individual who is
predicted to have the highest predicted wages based on the best predictive model.
* [414, 1717]
* [782, 1571]
* [782, 3154]
* [706, 2950]
</div>
```{r Q12-cred-interval}
# Type your code for Question 12 here.
```
<div class="license">
This work is licensed under [GNU General Public License v3.0](https://www.gnu.org/licenses/quick-guide-gplv3.html).
</div>
## References