forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
167 lines (134 loc) · 6.36 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "Reproducible Research: Peer Assessment 1"
author: "Lizel Greyling"
date: "December 12, 2014"
output:
html_document:
keep_md: true
---
Movement Activity Monitoring
============================
### 1. Loading and preprocessing data
#### 1.1. Load the data
* Load required libraries:
```{r}
library(lubridate)
library(lattice)
```
* Check whether the data file is in the working directory. If not, download and unzip it.
```{r}
if (!file.exists("activity.csv")) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
temp <- tempfile()
download.file(url, temp)
unzip(temp)
unlink(temp)
}
```
* Read the data file:
```{r}
activity <- read.csv("activity.csv")
```
* Change the date field from factors to date values, using the libridate function ymd(). This will be usefull later, when determining the day of the week.
```{r}
activity$date <- ymd(activity$date)
```
#### 1.2. Process/transform the data into a format suitable for the analysis
* Summarise the steps per day:
```{r}
daysteps <- aggregate(steps ~ date, data = activity, sum)
```
### 2. What is the mean total number of steps taken per day?
*For this part of the assignment, the missing values in the dataset are ignored.*
#### 2.1. Histogram of the total number of steps taken each day
```{r histogram}
hist(daysteps$steps, main = "Histogram of steps per day", xlab = "Steps per day")
```
#### 2.2. Mean and median total number of steps taken per day
```{r}
options(scipen=999) #Used to remove scientific notation, makes output look nicer.
mean(daysteps$steps)
median(daysteps$steps)
```
The mean number of steps per day is **`r format(mean(daysteps$steps),digits=0)`** steps while the median is **`r median(daysteps$steps)`** steps.
### 3. Average daily activity pattern
Create a dataset with the average number of steps per interval:
```{r}
intervalsteps <- aggregate(steps ~ interval, data = activity, mean)
names(intervalsteps) <- c("interval","stepsmean")
```
*The field names are changed to prepare for using a merge() function in step 4 (Replacing missing values).*
#### 3.1. Time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days
```{r timeseriesplot}
plot(stepsmean ~ interval, data = intervalsteps, type = "l",main = "Average steps per interval", xlab = "Interval", ylab="Average steps")
```
#### 3.2. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
```{r}
maxinterval <- intervalsteps[intervalsteps$steps==max(intervalsteps$steps),]
maxinterval
```
The most steps are in interval **`r maxinterval$interval`**, which contains **`r round(maxinterval$steps,digits=0)`** steps.
### 4. Replacing missing values
There are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.
####4.1. Calculate and report the total number of missing values in the dataset.
```{r}
missing <- sum(is.na(activity))
```
There are **`r missing`** rows with missing values.
####4.2. Devise a strategy for filling in all of the missing values in the dataset.
Looking at the data, it seems that there is more variability between intervals than between days. Also, it is possible that there can be missing values for only some intervals in a day, but not for the whole day.
It therefore makes sense to replace any missing value on a interval-by-interval basis with the average for the interval, rather than replacing the missing values for a whole day with the average for the day.
####4.3. Create a new dataset that is equal to the original dataset but with the missing data filled in.
- Create a merged dataframe that is the same as *activity*, but with an additional column that contains the interval mean.
```{r}
activitymrgd <- merge(activity, intervalsteps)
```
- Create an index of rows with missing values:
```{r}
missingrows <- which(is.na(activity))
```
- Create a new dataset that is equal to the original dataset...
```{r}
activityclean <- activity
```
... and replace the missing values in the new dataset with the interval means:
```{r}
activityclean[missingrows,1] <- activitymrgd[missingrows,4]
```
- Check whether all missing values have been replaced:
```{r}
sum(is.na(activityclean))
```
... and whether the new dataset has the same number of rows as the original:
```{r}
nrow(activityclean) - nrow(activity)
```
####4.4. Histogram, mean and median total number of steps taken per day.
```{r histogram2}
daystepsclean <- aggregate(steps ~ date, data = activityclean, sum)
hist(daystepsclean$steps, main = "Histogram of steps per day: NA's removed", xlab = "Steps per day")
mean(daystepsclean$steps)
median(daystepsclean$steps)
```
The mean is now `r round(mean(daystepsclean$steps),digits=0)`, compared to the original `r round(mean(daysteps$steps),digits=0)`; while the median is `r median(daystepsclean$steps)` compared to the original `r median(daysteps$steps)`. There is quite a significant difference.
By estimating the missing values, both the mean and median have increased. The impact is that this might not actually reflect reality.
### 5. Are there differences in activity patterns between weekdays and weekends?
####5.1. Create a new factor variable in the dataset with two levels -- "weekday" and "weekend" indicating whether a given date is a weekday or weekend day.
- Add a column containing the weekday, using the lubridate package function wday():
```{r}
activityclean[,4] <- wday(activityclean$date)
names(activityclean)[4]<-"wday"
```
- Change the numeric value that represents the day of the week (Sunday = 1, etc.) to a character string, either "weekend" or "weekday". If the day is "1" (Sunday) or "6" (Saturday), the value will be "weekend", else it will be "weekday":
```{r}
activityclean$wday <- ifelse(activityclean$wday==1|activityclean$wday==6,"weekend","weekday")
```
- Change the character variable into a factor variable:
```{r}
activityclean$wday <- factor(activityclean$wday, levels = c("weekend","weekday"))
```
####5.2. Time series plot of the 5-minute interval and the average number of steps taken, averaged across all weekday days or weekend days.
```{r panelplot}
cleanintervalsteps <- aggregate(steps ~ wday + interval, data = activityclean, mean)
xyplot(steps ~ interval|wday, type = "l", data = cleanintervalsteps,layout = c(1,2))
```