-
Notifications
You must be signed in to change notification settings - Fork 0
/
p8105_hw3_mbc2178.Rmd
319 lines (253 loc) · 10.3 KB
/
p8105_hw3_mbc2178.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
---
title: "p8105_hw3_mbc2178"
author: "Melvin Coleman"
date: "2022-10-11"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
fig.width = 6,
out.width = "90%",
fig.height = 6)
```
Let's load in the packages needed to perform data import, manipulation and cleaning for this assignment. Themes and color options also added.
```{r message = FALSE}
library(tidyverse)
library(dplyr)
library(readr)
library(p8105.datasets)
library(ggridges)
library(patchwork)
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis"
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
```
## Problem 1
Let's load the `instacart` data from the `p8105.datasets` using the following:
```{r}
data("instacart")
instacart =
instacart %>%
as_tibble(instacart)
```
The `instacart` dataset contains `r nrow(instacart)` observation
and `r ncol(instacart)` variables. The variables that exist in this dataset include
identifiers, orders, aisle and name of products. There are
`r instacart %>% select(product_id) %>% distinct %>% count` products found in
`r instacart %>% select(user_id, order_id) %>% distinct %>% count` orders from
`r instacart %>% select(user_id) %>% distinct %>% count` buyers.
There are 134 aisles and fresh vegetables and fruits are by far the most ordered items.
The table below provides this count.
```{r}
instacart %>%
count(aisle) %>%
arrange(desc(n))
```
Now, let's make a plot that shows the number of items ordered in each aisle, limiting
this to aisles with more than 10000 items ordered arranged in ascending order by aisles.
```{r}
instacart %>%
count(aisle) %>%
filter(n > 10000) %>%
mutate(aisle = fct_reorder(aisle, n)) %>%
ggplot(aes(x = aisle, y = n)) +
geom_point() +
labs(title = "Number of items ordered in each aisle") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
```
The table below shows the three most popular items in each aisles
"baking ingredients", "dog food care", and "packaged vegetables fruits". We show
the name and count of the most popular items per category and rank them in the table
below.
```{r}
instacart %>%
filter(aisle %in% c("baking ingredients", "dog food care", "packaged vegetables fruits")) %>%
group_by(aisle) %>%
count(product_name) %>%
mutate(rank = min_rank(desc(n))) %>%
filter(rank < 4) %>%
arrange(desc(n)) %>%
knitr::kable()
```
Last, we make a table showing the mean hour of the day at which Pink Lady Apples and Coffee Ice Cream are
ordered on each day of the week. Coffee Ice Cream seem to be ordered more frequently
earlier on in the day.
```{r}
instacart %>%
filter(product_name %in% c("Pink Lady Apples", "Coffee Ice Cream")) %>%
group_by(product_name, order_dow) %>%
summarize(mean_hour = mean(order_hour_of_day)) %>%
spread(key = order_dow, value = mean_hour) %>%
knitr::kable(digits = 2)
```
## Problem 2
Let's import the `accel_data` dataset to R. This dataset contains five weeks of
accelorometer data collected on a 63 year old male with BMI 25, who was admitted
to the Advanced Cardiac Care Center of Columbia University Medical Center and diagnosed
with congestive hear failure (CHF).
We will perform some tidying and wrangle the data before proceeding to perform
any analysis. First, we will consider using `janitor::clean_names()` to clean
the names of the variables in the dataset. Next, we will use the `pivot_longer`
function to create a new variable that contains the activity count names and
another variable that contains the activity counts for each minute of a 24-hr
day starting at midnight. We arranged the data set according to day of week and
converted day to a factor variable.
```{r}
accel_df =
read_csv(file = "data/accel_data.csv") %>%
janitor::clean_names() %>%
pivot_longer(
activity_1:activity_1440,
names_to = "activity",
names_prefix= "activity_",
values_to = "activity_cnt"
) %>%
mutate(
day = factor(day,levels =(c("Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))),
activity = as.numeric(activity),
day_of_week = case_when(
day == "Monday" ~ "Weekday",
day == "Tuesday" ~ "Weekday",
day == "Wednesday" ~ "Weekday",
day == "Thursday" ~ "Weekday",
day =="Friday" ~ "Weekend",
day == "Saturday" ~ "Weekend",
day == "Sunday" ~ "Weekend",
)) %>%
arrange(day_of_week)
```
The `accel_df` consists of `r nrow(accel_df)` observations and `r ncol(accel_df)` variables.
The variables in this data set include day id, week, number of activity and activity counts
measured in minutes. The maximum amount of activities that the patient engaged in was `r max(accel_df$activity)`.
Overall, the patient's maximum amount of minutes for any day was `r max(accel_df$activity_cnt)` minutes.
Now let's use our tidied data set to understand the total activity over the day for each week.
We will aggregate across minutes to create a total activity variable for each day and create a
table showing these totals.
From the table below, we can see that there is no inherent trend of minutes per
day and week. However, it appears that minutes engaged in activities from Tuesday
to Thursday are pretty similar. The maximum amount of activities performed measured in minutes was on Monday during week 3.
```{r}
accel_df %>%
group_by(day, week) %>%
summarize(total_acitvity_cnt = sum(activity_cnt)) %>%
pivot_wider(
names_from = day,
values_from = total_acitvity_cnt
) %>%
knitr::kable(digits = 1)
```
Now, let's create a single panel plot that shows the 24 hour activity courses for each day
and use color to indicate the day of the week.
From this graph, we can see that the maximum amount of activities and minutes of
activities appear to be during the weekend and Monday. Nevertheless, the majority
of activities were performed in less than 2,500 minutes.
```{r}
accel_df %>%
ggplot(aes(x=activity, y =activity_cnt, color = day)) +
geom_line(alpha = .3)
```
## Problem 3
Let's load the `ny_noaa` dataset to answer problem 3 using the code below.
```{r}
data("ny_noaa")
```
The data set `ny_noaa` contains `r nrow(ny_noaa)` observations and `r ncol(ny_noaa)` variables. There are
lots of missing values in this data set. Reasons for this could be due to the fact
that about one half of the stations from which the data was collected reported
precipitation only. In addition, the duration of record keeping varies by stations
and those intervals range from less than a year to more than 175 years.
In this data set, for example the number of missing values for snowfall
was 381,221.
We perform some data cleaning and tidying for this data set using the code below.
We ensured that precipitation (`prcp`) and snowfall (`snow`) were converted to millimeters as well as
maximum temperature (`tmax`) and minimum temperature (`tmin`) to degrees Celsius.
Furthermore, we separated the variable`date` into three separate variables, `year`, `month`,
and `day`. We have data from `r min(ny_noaa$date)` to `r max(ny_noaa$date)`.
```{r}
noaa_df =
ny_noaa %>%
as_tibble(ny_noaa)%>%
janitor::clean_names() %>%
separate(date, into = c("year", "month", "day")) %>%
mutate(
month= month.name[as.numeric(month)],
year = as.factor(year),
tmax = as.numeric(tmax),
tmin = as.numeric(tmin)) %>%
mutate(
tmax = tmax / 10,
tmin = tmin / 10,
prcp = prcp / 10,
snow = snow / 10
)
```
Let's find the count of most common values for snowfall (`snow`). The table
below shows that 0 mm is the most common observation. This is probably because
from January 1, 1981 through December 31, 2010,the weather in New York was not
prone to heavy snow fall during the winter and during the other seasonal months the
expected snowfall count is 0mm because snow is not possible.
Global warming and climate change could result in changes in these observations;
however, that data was not explored in this problem.
```{r}
noaa_df %>%
count(snow, name = "n_obs") %>%
arrange(desc(n_obs))
```
Now let's make a two panel plot showing the average max temperature in January and
in July in each station across years.
From the graph below, the average temperature in January as expected over the
years ranges from below 0 to 10 degrees Celsius, with a few exceptions of outliers.
For example, in 1982, the temperature dropped below -10 degrees Celsius and in
2004, the temperature was close to 15 degrees Celsius.
In addition, in July as expected the average temperature was around 25 and 30 degrees
Celsius. A few outliers were observed, for example, in 1987, the average temperature
was around 15 degrees Celsius which is pretty unusual for that time of the year.
```{r}
noaa_df %>%
filter(month == c("January","July")) %>%
group_by(month,year, id) %>%
summarize(
mean_tmax = mean(tmax, na.rm = TRUE)) %>%
drop_na(mean_tmax) %>%
ggplot(aes(x= year, y = mean_tmax, group = id)) +
geom_point(size = .5, color = "gray48") +
facet_grid(~ month) +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
labs(title = "Avg. max temperature in Jan. & Jul. by station across years",
x = "years",
y = "average max temp")
```
Now, we make a two-panel plot showing tmax vs tmin for the full data set.
(*Graph will be shown at bottom of document)
```{r}
tmx_min_p =
noaa_df %>%
ggplot(aes(x = tmin, y = tmax)) +
geom_hex() +
theme(legend.position = "none") +
labs( title = "Max temp vs Min temp")
```
Let's make another plot showing the distribution of snowfall values greater than 0 and less
than 100 separately by year.The distribution of snowfall across the years on average
appear to be relatively similar with differing peaks during some years.
(*Graph will be shown at bottom of document)
```{r}
snow_p=
noaa_df %>%
filter(snow < 0:100) %>%
ggplot(aes(x=snow, y=year, fill =year)) +
geom_density_ridges(alpha = .5, scale = .5) +
theme(legend.position = "none") +
labs(title = "Distribution of snowfall values between 0-100mm by year")
```
We combine both graphs created above to display a two-panel plot showing
maximum temperature vs minimum temperature as well as the distribution of
snowfall between 0 and 100 mm by year.
```{r}
tmx_min_p + snow_p
```