forked from jonberthet/JLeek-Data-Analysis-Notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathW5.R
272 lines (214 loc) · 12 KB
/
W5.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
W1 - ANOVA
multiple factors and variables @ same time
//KEY IDEAS
1. Outcome is quantitative
2. Multiple explanatory variables
3. Goal = ID contributions of different variables
//Anova - aov()
//Left side of tilda is result/outcome, right side of tilda is the covariate variable (factor variable of left side variable)
//Show Analysis of Variance
aovObject <- aov(movies$score ~ movies$rating)
aovObject
//Get coefficients
//Intercept is the average of b0 (G rating movie)
//next variable - Intercept = avg. rating for PG movies
//3rd variable is - Intercept = avg. rating for PG-13 movies, etc...
aovObject$coeff
//Add a second factor to model and see what relative contribution of
2 factors is to the score.
//Here, I just add an extra set of variables (genre) to the first set (rating).
It's really 2 variables with multiple levels.'
//Add a 2nd variable easily
aovObject2 <- aov(movies$score ~ movies$rating + movies$genre)
aovObject2
//Conclusion I reach is DFs for 2nd variable and residuals
//Look at summary
summary(aovObject2)
//Conclusion: tells me DF, Sum of Sq, Mean Sq, F-value
F-value
-1st F-value measures amount of variation explained by rating in the score varaible
-If P-values < 0.05 generally considered significant
-under Genre value, since it's the second' variable in the data set, it is the additional variation that rating did NOT explain already
//So, I should flip the right side tilda, switch the variables
aovObject 3 <- aov(movies$score ~ movies$genre + movies$rating)
//CONCLUSION: results are different. So ORDER MATTERS, especially if I have an unbalanced design
(ie - genre variable is not distributed evenly b/w levels in the movie rating variable) - be careful.
//Adding quantitative variable
- only include as 1 term
aovObject4 <-aov(movies$score ~ movies$genre + movies$rating + movies$box.office)
summary(aovObject4)
//NOTE: only 1 df for box office cuz everything is the same w/ 1 quant variable
//ANOVA language
Unit - 1 observation
Treatments - diff. conditions applied to diff. units in study
Factors - controlled units
Replicates - independent units w/ same factors and treatments
//Great references to ANOVA & A/B Testing
Video 2 - Binary Outcomes
//Linear Regression might not be best option
//When I put summary(lmRavens), 'Estimate' tells me % inc w/ 1 new point
//1 problem: have a probability more than 1 when using linear regression on binary outcomes
//Advantage of Log odds - can calculate -∞ to +∞ = more stable results, but not very useful for applied research
//Odds - calculate 0 to ∞ = prob / (1 - prob)
//probability - calculate (0 to 1)
Regression: Linear Vs Logistic
//Linear = use expected value, or Ravens win is equal to specific value, given by model provided
//Logistic = model probability of Ravens win itself. Instead of looking at Ravens wins itself, look at probability of Ravens wins
-has advantages cuz limits from -∞ to +∞ , not limits of 0 to 1
//interpret Logistic Regression:
log( Pr(RWi|RSi,b0,b1) / (1 - Pr(RWi|RSi,b0,b1) ) = b0 +b1RSi
b0 - Log odds of a Ravens win if they score zero points
b1 - Log odds ratio of win probability for each point scored (compared to zero points)
exp(b1) - Odds ratio of win probability for each point scored (compared to zero points)
//Logistics Regression Line - glm()
//Always say family - "binomial" for logistic regressions
logRegRavens <- glm(ravensData$ravenWinNum ~ ravensData$ravenScore, family = "binomial")
summary(logRegRavens)
//Interpreting results:
intercept term = log odds of Ravens win when they score 0. If # is negative, they basically 0 chance of winning
Estimate for Ravens score = log odds ratio for a 1 unit increase in Ravens score. So every time Ravens score 1 point, their log odds ratio increases by 0.1066. So, since this is a positive #, the more points they score, the more likely to win
//A sensible graph that shows prob of Ravens win vs their scorez-values and p-values analyzed similarly to a linear model
//show logistic regression
plot(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",xlab="Score",ylab="Prob Ravens Win"
//logs odds ratios of <0 means it's more likely for Ravens to lose. vice versa'
exp(logRegRavens$coeff)
//Get confidence intervals
exp(confint(logRegRavens))
//Apply Anova
anova(logRegRavens, test="Chisq")
NOTE: deviance residuals is about log liklihood of the term - can calculate p-values
//Simpson's Paradox'
When doing regression, consider all important terms
//Interpreting Odds Ratios
-not probabilities
-Odds ratio of 1 = no diff. in odds = no effect
-Log odds ratio of 0 = no difference in odds
-Odds ratio < 0.5 or >2 commonly a 'moderate effect' - vague rule of thumb, but consider real issue
-Relative risk = ratio of probabilities
-Odds Ratio is NOT Relative Risk
V3 - COUNT OUTCOMES
//Many data take form of counts, like:
-Calls to a call center
-# of flu cases in an area
-# of cars that cross a bridge
//Data may also be in form of rates
-% of children passing tests
-% of hits to a website from a country
//Linear regression w/ transformation is an option
Poisson Distribution
lambda parameter = rate (so, if lambda = 100, then avg. # of calls coming into call center in 1 hour is 100')
distributions (variance) are smaller when lambda is smaller (most times)
Linear Regression for # of website views
download.file("https://dl.dropbox.com/u/7710864/data/gaData.rda",destfile="./data/gaData.rda",metho load("./data/gaData.rda")
gaData$julian <- julian(gaData$date)
head(gaData)
plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
lm1 <- lm(gaData$visits ~ gaData$julian)
abline(lm1,col="red",lwd=3)
//Poisson/log-linear
//If I take the log of a Poisson distribution, I make a linear function
//I get Multiplicative differences
//If JD increases by 1 day(unit), then take previous day expected value and multiply it by e(b1)
plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
lm1 <- lm(gaData$visits ~ gaData$julian)
//Mean-Variance Relationship
plot(glm1$fitted,glm1$residuals,pch=19,col="grey",ylab="Residuals",xlab="Date")
abline(lm1,col="red",lwd=3)
//REFERENCE LECTURE for more details: Poisson great for comparing relationships or dependencies between variables.
V4 - MODEL CHECKING AND MODEL SELECTION
Sometimes model checking/selection not allowed
Often problems
-overfitting (model works for sample set but not for other samples)
-overtesting (Confidence intervals don't' have the significance claimed)
-Biased inference
Linear Regression - basic assumptions
-variance constant
-summarizing linear trend
-all terms are right and enough
-no big outlier
//MODEL CHECKING - CONSTANT VARIANCE
//With dataset, see if another variable explains increased variance
//Use vcovHC{sandwich} variance estimators (if n is big) - changing variance situation
//Generated data where x-values are between 0 and 3, create linear model, and plot vcovHC. The result: variance-covariance matrix (variances on the upper left and lower right) + (covariance on lower left and upper right)
library(sandwich)
set.seed(3433); par(mfrow=c(1,2)); data <- rnorm(100,mean=seq(0,3,length=100),sd=seq(0.1,3,length=1)
lm1 <- lm(data ~ seq(0,3,length=100))
vcovHC(lm1)
//Then, just to see, get a linear model
summary(lm1)$cov.unscaled
//RESULT between linear and vcovHC is that I have more confidence than I should have with
linear model relationship than otherwise. I underestimate variability in linear model.
Also, there may not be a linear trend either
//MODEL CHECKING: If I have a curvalinear trend, I can:
1. use Poisson regression
2. Use data transformation (take the log)
3. Smooth the data/fit a nonlinear trend
4. Use linear regression anyway (interpret line or use vcovHC{sandwich} variance estimators (if n is big))
//MODEL CHECKING: Look for missing covariate (find difference b/w residuals of different variables)
set.seed(3433); par(mfrow=c(1,3)); z <- rep(c(-0.5,0.5),50)
data <- rnorm(100,mean=(seq(0,3,length=100) + z),sd=seq(0.1,3,length=100))
lm1 <- lm(data ~ seq(0,3,length=100))
//color the residuals to see which ones stand above and below the regression line
plot(seq(0,3,length=100),data,pch=19,col=((z>0)+3)); abline(lm1,col="red",lwd=3)
plot(seq(0,3,length=100),lm1$residuals,pch=19,col=((z>0)+3)); abline(c(0,0),col="red",lwd=3)
//Boxplot of the residuals and I notice 1 residuals is lower than the other
boxplot(lm1$residuals ~ z,col = ((z>0)+3) )
//So, What to do?:
1. Use explanatory analysis to ID other variables to include
2. use vcovHC{sandwhich} variance estimators (if n is big)
3. Report unexplained patterns in the data - may be violations of assumptions in model
//MODEL CHECKING - outliers
1. If outliers are experimental mistakes - remove & document
2. If they're real', consider reporting how sensitive estimate is to outlier
3. Consider using a robust linear model fit like rlm{MASS} - MASS package = good if lots of outliers
//MODEL CHECKING - default plots
//These give me 3 sets of charts. -
set.seed(343); par(mfrow = c(1,2))
//MODEL CHECKING - deviance
commonly reported for GLM's'
usually compares the model where every point gets its own parameter to the model I'm using'
deviance doesn't tell me how' my model is wrong, just tells me it's' wrong
//MODEL SELECTION
-choose right variables by:
1. domain-specific knowledge
2. exploratory analysis (plots, color residuals to see variances b/w variables)
x <- seq(0,3, length = 100); y <- rnorm(100); lm1 <- lm(y ~ x)
plot(lm1)
3. Statstical Selection (may bias selection)
-step-wise
-AIC
-BIC
-Modern approaches: Lass, Ridge-Regression
Error measures - R squared will always get bigger with more variables
-Adjusted R-squared is a lil better cuz takes into account # of estimated parameters
-AIC penalizes models w/ more parameters = measures quality of model
-BIC does the same with models w/ lots of parameters, but w/ bigger penalty
//AWESOME MODEL SELECTION - STEP
-I want to model scores but I don't know' what variables to include in the model
-Solve this with step function in R
movies <- movies[,1]
//do linear analysis on movie data. consider all the terms in the data set
lm1 <- lm(score ~ .,data=movies)
//aicFormula() = consider deleting one of the variables, then once it does,
it will recompute the AIC score and see if it got better or worse. If it got better,
it will try again. It will try forward, backward, or both tests and give me back a
result when the AIC isn't changing anymore'
//Give me best AIC result
aicFormula <- step(lm1)
//Give me coefficients (estimates), may be biased
aicFormula
MODEL SELECTION - regsubsets() = which looks at all possible subsets of variable and calculate the BIC score - goal is to minimize BIC
library(leaps);
regSub <- regsubsets(score ~ .,data = movies)
plot(regSub)
or use:
//Does model averaging - calculates posterior probabilities that each term be included in the model. If posterior rating is 0, then don't include'
variables with score closer or exactly 100 should be included
library(BMA)
bicglm1 <- bic.glm(score ~.,data=movies,glm.family="gaussian")
NOTE:
exploratory/visual analysis is key
automatic selection produces answers, but may bias inference
separate sample into 2 groups (use 1 group for the selection step and 1 group for inference)
not for causal relationships
print(bicglm1)