forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
211 lines (148 loc) · 7.55 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
output:
html_document:
keep_md: yes
---
Peer Assessment 1
========================
Necessary libraries for the analysis of the activity.csv data
```{r}
library(lattice)
```
Read in the activity.csv file
```{r}
temp <- tempfile()
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip",temp)
data <- read.csv(unz(temp, "activity.csv"))
unlink(temp)
head(data, n = 5)
```
### **What is the total number of steps taken per day?**
- **Make a histogram of the total number of steps taken each day**
```{r}
options(scipen=999)
#Create a data frame for the total number of steps per day for the dataset
totalSteps <- aggregate(data$steps, list(data$date), "sum")
#Rename variables for the dataset containing the step totals per day
names(totalSteps) <- c("date", "total")
#Create a histogram of the total number of steps per day for the dataset
hist(totalSteps$total,
main = "Histogram of The Total Number of Steps Per Day",
xlab = "Total Number of Steps Per Day",
col = "blue")
```
- **Calculate and report the *mean* and *median* total number of steps taken per day**
The mean total number of steps taken per day is `r mean(totalSteps$total, na.rm = TRUE)`
```{r}
mean(totalSteps$total, na.rm = TRUE)
```
The median total number of steps taken per day is `r median(totalSteps$total, na.rm = TRUE)`
```{r}
median(totalSteps$total, na.rm = TRUE)
```
### **What is the average daily activity pattern?**
- **Make a time series plot (i.e. `type = "l"`) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)**
```{r}
#Create data frame for the mean number of steps per 5 minute interval
meanSteps <- aggregate(data$steps, list(data$interval), "mean", na.rm = TRUE)
#Rename the variables in meanSteps
names(meanSteps) <- c("interval", "mean")
xyplot(mean ~ interval,
data = meanSteps,
type = "l",
main = "Mean Number of Steps Per Interval",
xlab = "Interval",
ylab = "Mean Number of Steps")
```
- **Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?**
The 5-minute interval, on average across all the days in the dataset, that contains the maximum number of steps is `r meanSteps[meanSteps$mean == max(meanSteps$mean, na.rm = TRUE), "interval"]`
```{r}
meanSteps[meanSteps$mean == max(meanSteps$mean, na.rm = TRUE), "interval"]
```
### **Inputing missing values**
**Note that there are a number of days/intervals where there are missing values (coded as `NA`). The presence of missing days may introduce bias into some calculations or summaries of the data.**
- **Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with `NA`s)**
There are `r sum(is.na(data$steps) == TRUE)` missing values in the dataset
```{r}
sum(is.na(data$steps) == TRUE)
```
- **Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.**
The strategy used consists of replacing the all NA's in data$steps with the mean number of steps for the corresponding time interval. Means were rounded to the nearest integer.
- **Create a new dataset that is equal to the original dataset but with the missing data filled in.**
A new dataset named filledData is created.
```{r}
#Replace all NA's in data$steps with the mean number of steps for the corresponding time interval.
#Means were rounded to the nearest integer
filledData <- data
for(i in seq_len(length(data$steps)))
{
if(is.na(filledData$steps[i]) == TRUE)
{
filledData$steps[i] <- round(meanSteps[meanSteps$interval == filledData$interval[i], "mean"], digits = 0)
}
}
head(filledData, n = 5)
```
- **Make a histogram of the total number of steps taken each day and Calculate and report the *mean* and *median* total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?**
```{r}
#Create a data frame for the total number of steps per day for the new data frame (filledData) that contains no NA's
totalStepsFilled <- aggregate(filledData$steps, list(filledData$date), "sum")
#Rename variables for the data frame containing the step totals with the NA's filled in
names(totalStepsFilled) <- c("date", "total")
#Create a histogram of the total number of steps per day for the new data frame that contains no NA's
hist(totalStepsFilled$total,
main = expression(atop("Histogram of The Total Number of Steps Per Day", "(NA's filled in with interval means)")),
xlab = "Total Number of Steps Per Day",
col = "red")
```
The mean total number of steps taken per day is `r mean(totalStepsFilled$total)`
```{r}
mean(totalStepsFilled$total)
```
The median total number of steps taken per day is `r median(totalStepsFilled$total)`
```{r}
median(totalStepsFilled$total)
```
Once the NA's were filled in with the mean intervals, the mean decreased by `r mean(totalSteps$total, na.rm = TRUE) - mean(totalStepsFilled$total)` and the median decreased by `r median(totalSteps$total, na.rm = TRUE) - median(totalStepsFilled$total)`
```{r}
mean(totalSteps$total, na.rm = TRUE) - mean(totalStepsFilled$total)
median(totalSteps$total, na.rm = TRUE) - median(totalStepsFilled$total)
```
### **Are there differences in activity patterns between weekdays and weekends?**
**For this part the `weekdays()` function may be of some help here. Use the dataset with the filled-in missing values for this part.**
- **Create a new factor variable in the dataset with two levels - "weekday" and "weekend" indicating whether a given date is a weekday or weekend day.**
```{r}
#Create a new column in filledData data frame with a weekday for the corresponding value in the date column for each row
#Used as.POSIXlt() to convert values in date column from interger type to date type
filledData$weekFactor <- weekdays(as.POSIXlt(filledData$date))
#Turned the column weekFactor into a two level factor by converting Saturday and Sunday values to a weekend value and
#converting Monday-Friday values to a weekday value. Set the class of filledData$weekFactor to Factor.
for(i in seq_len(length(filledData$weekFactor)))
{
if(filledData$weekFactor[i] == "Saturday" | filledData$weekFactor[i] == "Sunday")
{
filledData$weekFactor[i] <- "weekend"
}
else
{
filledData$weekFactor[i] <- "weekday"
}
}
filledData$weekFactor <- as.factor(filledData$weekFactor)
head(filledData, n = 5)
```
- **Make a panel plot containing a time series plot (i.e. `type = "l"`) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).**
```{r}
#Create data frame for the mean number of steps per 5 minute interval for the filled data
meanStepsFilled <- aggregate(filledData$steps, by = list(filledData$interval, filledData$weekFactor), "mean", na.rm = TRUE)
#Rename the variables in meanSteps
names(meanStepsFilled) <- c("interval","weekFactor", "mean")
#Create a lattice line plot comparing the mean number of steps taken per interval for weekdays vs. weekends
xyplot(mean ~ interval | weekFactor,
data = meanStepsFilled,
type = "l",
layout = c(1, 2),
main = "Mean Number of Steps Per Interval",
xlab = "Interval",
ylab = "Mean Number of Steps")
```