forked from jonberthet/JLeek-Data-Analysis-Notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathW1.R
119 lines (91 loc) · 5.49 KB
/
W1.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
class() //Says that it's character, numeric, logical (TRUE/FALSE)
Vectors = set of values with same class
c(1,2,3,4)
c("Jon", "Bob")
Lists = a vector of values of possibly different classes
v1 = c(1,2,3,4)
v2 = c("jon", "bob")
mylist = list(number = v1, ppl = v2)
Matrices = vectors with multiple dimensions
myMatrix = matrix(1,2,3,4), byrow = T, nrow = 2
Data Frames
matrices with multiple vectors put into a graph
Factors = qualitative variables that can be included in models
smoker = c("yes", "no", "yes", "no")
smokeFactor = as.factor(smoker) //use function as.factor!!
Missing Values = usually coded NA
is.na(vector1) // to determine which values are missing
Subsetting
Use [1] to get the first value in the dataframe
Use [6] to get the 6th value in the dataframe
Logical Subsetting
Will return value if only if true
ie - myDataFrame[firstNames = "jeff",]
myDataFrame[height <190,] //Returns rows less than 90. I could add anything after the ',', but that would deal with columns
r
Variable naming Conventions
Camel Caps = myHeightCM = 188
Underscore = my_height_cm = 188
Period = my.height.cm = 188
SIMULATION BASICS
Important Simulation Functions
Distributions - always starts with r (rbeta, rbinom)
Densities - always starts with d (dbeta, dbinom)
Sampling - with and w/out replacement
rfoo functions general data ***args(XXX) = tells me argument for fn***
rbinom's function is (n, size, prob) = (number of trials, number of coins flipped, probability of each coin) -> pumps out how many coins are heads
dfoo functions calculate the density
Normal
dnorm
dbinom
sample draws a random sample
sample(x,size=10,replace=FALSE) // x = sample data, size = # of random samples we'd like to take, prob = which probabilities per data
sample can draw according to a set of probabilities
probs = c(0.4,0.3,0.2, 0.1,0,0,0,0,0) // Must correspond with exactly same count of data ---- sample()
Setting a Seed = ensures reproducible results from random process in R
set.seed(12345) //has 1 argument, which is a single integer, then run that function that generates random variables (like rnorm), or samples from random sample (like sample) . As long as seed is set and generate variable in the same order, we'll get the same values out. MUST KEEP ORDER OF SAMPLING DATA THE SAME TO GET THE SAME RESULTS OUT
http://cran.r-project.org/web/views/Distributions.html
TYPES OF DATA ANALYSIS QUESTIONS
1. Descriptive: Describe a set of data
First kind of data analysis performed
Commonly applied to census data
description and interpretation are diff. steps
Descriptions can't usually be genearlized w/out additional stats modeling
2. Exploratory: Find relationships you didn't know about
Good for discovering new connections
useful for future studies
usually not the final say
this should not be used for generalizing/predicting alone
3. Inferential Analysis : use relatively small sample of data to say something of a bigger data
-common goal of stats models
-involves estimating both on quantity you care about and uncertainity about your estimate
-depends heavily on both population and sampling scheme
4. Predictive Analysis : use data on some objects to predict values for another object
If X predicts Y, it doesn't mean that X causes Y
Accurate prediction depends heavily on measuring right variables
Although there's better and worse prediction models, more data and a simple model works well
5. Causal Analysis: to find out what happens to 1 variable when you make another variable change
-need randomized studies to identify causation
-there's approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions
-usually ID'd as average effects, may not apply to every individual
-usually the 'gold standard' of data analysis
Mechanistic Models : understand the exact changes in varibales that lead to changes in other variables for individual objects
-very hard to infer, except in simple situations
-usually modeled by deterministic set of equations (physical/engineering)\
-geerally random component of data is measurement error
- if equations are known but parameters are not, they may be inferred w/ data analysis
SOURCES OF DATA
CENSUS : measure each individual (no inferential problem)
OBSERVATIONAL :
set.seed(5) <- sets seed to make results reproducible
sample(1:8, size=4, replace=FALSE) <-takes a sample of ppl 1 thru 8, pick a sample of size 4,and don't replace any ppl, so each individual can only be sampled once)
CONVENIENCE :
probs = c(5,5,5,5,1,1,1,1)/16 <- in this situation, proximity makes it easier to pick ppl closer to us than ones farther away. So it's 5x more likely to pick the first 4 people, and only 1x more likely to pick the latter 4 ppl
sample(1:8, size = 4, replace=FALSE, prob=probs) <-shows that it's easier to choose first 4 ppl than latter 4 ppl
Randomized Trial : if we want to do causal analysis, do this!
Prediction Study : need 2 sets of data:
1st - a training set where we build a predictive model and
2nd, a test set where we evaluate the predictive model
Cross-Sectional : picks a particular time point and studies them at that time
Longitudinal : follows same sample of ppl over time
Retrospective : take random sample at the end of time, measure their outcome (inferential, rather than predictive or causal)