-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathREADME.Rmd
143 lines (89 loc) · 7.12 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
output: rmarkdown::github_document
---
# Causal Inference in Introductory Statistics Courses
Below is a causal inference-based student activity using the forced expiratory volume (FEV) dataset originally published in Rosner (1995). Previously, Khan (2005) demonstrated how this data could be used in introductory courses to illustrate traditional statistical concepts. Here, we demonstrate how principles of casual inference can also be easily explained to introductory students using the same data. A student handout and instructor resources related to this activity are available on this site.
Thank you to my colleagues James Pleuss, Dusty Turner, Bryan Adams, Nicholas Clark, and Krista Watts for their assistance.
## Background
Today, it is common knowledge smoking has many negative health consequences. This was not always the case. The mass production and marketing of cigarettes following World War II led to a rapid increase in smoking rates, outpacing scientific and public health knowledge. Massive studies, with millions of participants, in the 1950s and 1960s led by the American Cancer Society established strong evidence connecting smoking to lung cancer. Subsequent studies, like the one this activity is based upon, sought to further refine this evidence. This activity is based on a study conducted in the 1970s. The researchers followed a cohort of children in East Boston, MA for seven years to determine, among other things, the effect of childhood smoking on lung function (Tager et al 1979, 1983).
## Research Question
What is the effect of teenage smoking on lung function?
## Confounding Variables
It would be unethical to conduct a randomized experiment of the effect of teenage smoking on lung function. Therefore, an observational study is necessary. In observational studies, the effect of one variable (the treatment) on another (the outcome) is subject to confounding. Confounding can occur when there are common causes of the variables. If we identify such variables, there are various things we can do in the study design and analysis to eliminate confounding.
What do you think are some confounders of the effect of teenage smoking on lung function?
## Variables
For the purposes of this activity, let's say you were able to control for all other potential confounding variables except subject age, sex, and height. Here is a list of variables we will consider. FEV is the outcome and SMOKE is the treatment.
| Variable | Description |
|-----------|-----------------------------------------------------------------|
| AGE | the age of the subject in years |
| FEV | forced expiratory volume (L), a common measure of lung function |
| HEIGHT | the height of the subject in inches |
| SEX | biological sex of the subject: Female (0), Male (1) |
| SMOKE | whether the subject had ever smoked or not: No (0), Yes (1) |
## Causal Diagram
A key component of causal inference is identifying potential confounding variables using expert knowledge, not the data.
The figure below is a causal diagram describing the variables in this study. An arrow from one variable to another indicates the first variable causes the second. These relationships are specified based on expert knowledge prior to collecting data. Here is a brief justification for the arrows originating from variables in this causal diagram.
* Age. Older teenagers are more likely to smoke (more peers that smoke, greater freedom from parents, etc.), have higher lung capacity, and be taller.
* Sex. Men are more likely to smoke than women, be taller, and have higher FEV (even when they are the same height as women).
* Smoking. Childhood smoking can inhibits growth.
* Height. Taller people have higher lung capacity.

Based on the backdoor path criterion, SEX and AGE are confounders in this causal diagram. Therefore, we should adjust for AGE and SEX in our analysis.
## Data Analysis
```{r, message = FALSE}
library(tidyverse)
fev = read.table('http://jse.amstat.org/datasets/fev.dat.txt')
colnames(fev) = c("AGE", "FEV", "HEIGHT", "SEX", "SMOKE")
fev = as.tibble(fev)
#Summary Statistics
summary <- fev %>% group_by(SEX,SMOKE) %>% summarise_all(funs(mean,sd))
```
```{r, echo = FALSE, results = 'asis'}
library(knitr)
summary = round(summary,1)
summary = summary %>% select(c(1,2,4,7,3,6,5,8))
colnames(summary) = c("SEX", "SMOKE", "FEV (liters)", "stdev", "AGE (yrs)", "stdev", "HEIGHT (in)", "stdev")
summary$SEX = c("FEMALE","FEMALE","MALE","MALE")
kable(summary)
```
Smokers are older than nonsmokers.
```{r, echo = TRUE}
fev %>% ggplot(aes(x = factor(SEX), y = AGE, color = factor(SMOKE))) + geom_boxplot()
```
Smokers have higher FEV than nonsmokers.
```{r, echo = TRUE}
fev %>% ggplot(aes(x = factor(SEX), y = FEV, color = factor(SMOKE))) + geom_boxplot()
```
However, older teenagers have higher FEV.
```{r, echo = TRUE}
fev %>% ggplot(aes(x = AGE, y = FEV, color = factor(SMOKE))) + geom_point()
```
### Unadjusted Analysis
In the unadjusted analysis, smoking is positively associated with lung function.
```{r, echo = TRUE}
summary(lm(FEV ~ SMOKE, data = fev))
```
### Adjusted Analysis
We show two adjusted models (matching and regression). In both cases, there is no longer a beneficial effect of smoking on lung function.
#### Matching Model
In the matching model, we match smokers with all nonsmokers of the same AGE and SEX. Each AGE/SEX combination is called a subclass. Weights are assigned to each observation such that the total weight for both smokers and nonsmokers in each subclass is equal to one. Then, we compute the weighted least squares estimate for the parameter associated with FEV. We use the R MatchIt package for this analysis.
```{r, echo = TRUE}
library(MatchIt)
#Perform matched and extract matched dataset and weights
match.exact = matchit(SMOKE ~ AGE + SEX, data = fev, method = "exact")
matched.data = match.data(match.exact)
#Fit weighted least squares regression model
model.matched = lm(FEV ~ SMOKE + SEX + AGE, data = matched.data, weights = weights)
summary(model.matched)
```
#### Regression Model
```{r, echo = TRUE}
model.regression = lm(FEV ~ SMOKE + AGE + SEX, data = fev)
summary(model.regression)
```
References:
* Kahn, M. (2005). An exhalent problem for teaching statistics.Journal of Statistics Education, 13(2).
* Rosner, Bernard (1995). Fundamentals of Biostatistics. Duxbury Press: New York.
* Tager, I. B., Weiss, S. T., Mu ̃noz, A., Rosner, B., and Speizer, F. E. (1983). Longitudinal study of the effects of maternal smoking on pulmonary function in children. New England Journal of Medicine, 309(12):699–703.
* Tager, I. B., Weiss, S. T., Rosner, B., and Speizer, F. E. (1979). Effect of parental cigarette smoking on the pulmonary function of children. American Journal of Epidemiology, 110(1):15–26.
* Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2011). MatchIt: Nonparametric preprocessing for parametric causal inference.Journal of Statistical Software, 42(8):1–28.