forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
100 lines (91 loc) · 3.02 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# project1
Aiqun Huang
## Loading and processing data
``` {r}
unzip("activity.zip")
data <- read.csv("activity.csv", header=TRUE)
data$date <- as.Date(data$date)
data_n <- na.omit(data)
```
## What is mean total number of steps taken per day?
Sum the steps taken per day and plot the histgram:
```{r}
daily <- tapply(data_n$steps, data_n$date, sum, simplify=T)
hist(x=daily, breaks=20, col="red", xlab="total steps per day")
```
The mean and median are :
```{r}
mean(daily)
median(daily)
```
## What is the average daily activity patter
```{r}
avgSteps <- aggregate(data_n$steps, list(interval = as.numeric(as.character(data_n$interval))), FUN = "mean")
names(avgSteps)[2] <- "steps"
with(avgSteps,
plot(interval,
steps,
type="l",
xlab="interval index",
ylab="average steps in the interval"))
```
Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
```{r}
avgSteps[avgSteps$steps == max(avgSteps$steps), ]
```
## Imputing missing values
the total number of missing values in the dataset
```{r}
sum(is.na(data))
```
Replace the missing value by the mean for that interval
```{r}
new_data <- data
nas <- is.na(new_data$steps)
avg <- tapply(new_data$steps, new_data$interval, mean, na.rm=TRUE, simplify=T)
new_data$steps[nas] <- avg[as.character(new_data$interval[nas])]
```
Making a histogram
```{r}
new_daily <- tapply(new_data$steps, new_data$date, sum, simplify=T)
hist(x=new_daily, breaks=20, col="red", xlab="total steps per day", main="Missing values replaced by mean")
```
The new mean and median are
```{r}
mean(new_daily)
median(new_daily)
```
After inserting the missing values by the mean, the new mean does't change, the new median is 10766.19 compared to the original 10765.
## Are there differences in activity patterns between weekdays and weekends?
Create a factor varible `wk` with two levels `weekday` and `weekend`, the commented code also works.
```{r}
is_weekday <- function(d) {
wd <- weekdays(d)
ifelse (wd == "Saturday" | wd == "Sunday", "weekend", "weekday")
}
#new_data$weekdays <- factor(format(new_data$date, "%A"))
#levels(new_data$weekdays)
#levels(new_data$weekdays) <- list(weekday = c("Monday", "Tuesday",
# "Wednesday",
# "Thursday", "Friday"),
# weekend = c("Saturday", "Sunday"))
wx <- sapply(new_data$date, is_weekday)
new_data$wk <- as.factor(wx)
```
Average the steps for all intervals in weekday and weekend, and make the plot.
```{r}
wk_df <- aggregate(new_data$steps,
list(interval = as.numeric(as.character(new_data$interval)),
wk = new_data$wk),
FUN = "mean")
#wk_df <- aggregate(steps ~ wk+interval, data=new_data, FUN=mean)
names(wk_df)[3] <- "steps"
library(lattice)
xyplot(steps ~ interval | wk,
layout = c(1, 2),
xlab="Interval",
ylab="Number of steps",
type="l",
lty=1,
data=wk_df)
```