-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathtidyverse_20190923_166_hwa.rmd
554 lines (439 loc) · 24.4 KB
/
tidyverse_20190923_166_hwa.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
---
output: html_notebook
---
# 5.7 Summarizing data
An important part of exploratory data analysis is summarizing data. The average and standard deviation are two examples of widely used summary statistics. More informative summaries can often be achieved by first splitting data into groups. In this section, we cover two new **dplyr** verbs that make these computations easier: `summarize` and `group_by`. We learn to access resulting values using the `pull` function.
### 5.7.1 `summarize`
The `summarize` function in **dplyr** provides a way to compute summary statistics with intuitive and readable code. We start with a simple example based on heights. The `heights` dataset includes heights and sex reported by students in an in-class survey.
```{r}
library(dplyr)
library(dslabs)
data(heights)
```
The following code computes the average and standard deviation for females:
```{r}
s <- heights %>%
filter(sex == "Female") %>%
summarize(median = median(height), minimum = min(height), maximum = max(height))
s
```
We can obtain these three values with just one line using the `quantile` function: for example, `quantile(x, c(0,0.5,1))` returns the min (0th percentile), median (50th percentile), and max (100th percentile) of the vector `x`. However, if we attempt to use a function like this that returns two or more values inside `summarize`:
```{r}
heights %>%
filter(sex == "Female") %>%
summarize(range = quantile(height, c(0, 0.5, 1)))
```
we will receive an error: `Error: expecting result of length one, got : 2`. With the function `summarize`, we can only call functions that return a single value. In Section 5.12, we will learn how to deal with functions that return more than one value.
For another example of how we can use the `summarize` function, let’s compute the average murder rate for the United States. Remember our data table includes total murders and population size for each state and we have already used **dplyr** to add a murder rate column:
```{r}
data(murders)
murders <- murders %>% mutate(rate = total/population*100000)
```
Remember that the US murder rate is **not** the average of the state murder rates:
```{r}
summarize(murders, mean(rate))
```
This is because in the computation above the small states are given the same weight as the large ones. The US murder rate is the total number of murders in the US divided by the total US population. So the correct computation is:
```{r}
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate
```
This computation counts larger states proportionally to their size which results in a larger value.
### 5.7.2 `pull`
The `us_murder_rate` object defined above represents just one number. Yet we are storing it in a data frame:
```{r}
class(us_murder_rate)
```
since, as most **dplyr** functions, `summarize` always returns a data frame.
This might be problematic if we want to use this result with functions that require a numeric value. Here we show a useful trick for accessing values stored in data when using pipes: when a data object is piped that object and its columns can be accessed using the `pull` function. To understand what we mean take a look at this line of code:
```{r}
us_murder_rate %>% pull(rate)
```
This returns the value in the `rate` column of `us_murder_rate` making it equivalent to `us_murder_rate$rate`.
To get a number from the original data table with one line of code we can type:
```{r}
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 100000) %>%
pull(rate)
us_murder_rate
```
which is now a numeric:
```{r}
class(us_murder_rate)
```
### 5.7.3 Group then summarize with `group_by`
A common operation in data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men’s and women’s heights separately. The `group_by` function helps us do this.
If we type this:
```{r}
heights %>% group_by(sex)
```
The result does not look very different from `heights`, except we see `Groups: sex [2]` when we print the object. Although not immediately obvious from its appearance, this is now a special data frame called a _grouped data frame_ and **dplyr** functions, in particular `summarize`, will behave differently when acting on this object. Conceptually, you can think of this table as many tables, with the same columns but not necessarily the same number of rows, stacked together in one object. When we summarize the data after grouping, this is what happens:
```{r}
heights %>%
group_by(sex) %>%
summarize(average = mean(height), standard_deviation = sd(height))
```
The `summarize` function applies the summarization to each group separately.
For another example, let’s compute the median murder rate in the four regions of the country:
```{r}
murders %>%
group_by(region) %>%
summarize(median_rate = median(rate))
```
# 5.8 Sorting data frames
When examining a dataset, it is often convenient to sort the table by the different columns. We know about the `order` and `sort` function, but for ordering entire tables, the **dplyr** function `arrange` is useful. For example, here we order the states by population size:
```{r}
murders %>%
arrange(population) %>%
head()
```
With `arrange` we get to decide which column to sort by. To see the states by rates, from smallest to largest, we arrange by `rate` instead:
```{r}
murders %>%
arrange(rate) %>%
head()
```
Note that the default behavior is to order in ascending order. In **dplyr**, the function `desc` transforms a vector so that it is in descending order. To sort the table in descending order, we can type:
```{r}
murders %>%
arrange(desc(rate)) %>%
head()
```
### 5.8.1 Nested sorting
If we are ordering by a column with ties, we can use a second column to break the tie. Similarly, a third column can be used to break ties between first and second and so on. Here we order by `region`, then within region we order by murder rate:
```{r}
murders %>%
arrange(region, rate) %>%
head()
```
### 5.8.2 The top _n_
In the code above, we have used the function `head` to avoid having the page fill up with the entire dataset. If we want to see a larger proportion, we can use the `top_n` function. This function takes a data frame as it’s first argument, the number of rows to show in the second, and the variable to filter by in the third. Here is an example of how to see the top 10 rows:
```{r}
murders %>% top_n(10, rate)
```
Note that rows are not sorted by `rate`, only filtered. If want to sort, we need to use `arrange`. Note that if the third argument is left blank, `top_n`, filters by the last column.
# 5.9 Exercises
For these exercises, we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the **NHANES** package. Once you install the **NHANES** package, you can load the data like this:
```{r}
library(NHANES)
data(NHANES)
```
The **NHANES** data has many missing values. Remember that the main summarization function in R will return `NA` if any of the entries of the input vector is an `NA`. Here is an example:
```{r}
library(dslabs)
data(na_example)
mean(na_example)
sd(na_example)
```
To ignore the `NA`s we can use the `na.rm` argument:
```{r}
mean(na_example, na.rm = TRUE)
sd(na_example, na.rm = TRUE)
```
Let’s now explore the NHANES data.
1. We will provide some basic facts about blood pressure. First let’s select a group to set the standard. We will use 20-29 year old females. `AgeDecade` is a categorical variable with these ages. Note that the category is coded like " 20-29“, with a space in front! What is the average and standard deviation of systolic blood pressure as saved in the `BPSysAve` variable? Save it to a variable called `ref`.
Hint: Use `filter` and `summarize` and use the `na.rm = TRUE` argument when computing the average and standard deviation. You can also filter the NA values using `filter`.
```{r}
head(NHANES)
```
```{r}
ref <- NHANES %>%
filter(AgeDecade == ' 20-29') %>%
summarize(average = mean(NHANES$BPSysAve, na.rm = TRUE), standard_deviation = sd(NHANES$BPSysAve, na.rm = TRUE))
ref
```
2. Using a pipe, assign the average to a numeric variable `ref_avg`. Hint: Use the code similar to above and then `pull`.
```{r}
ref_avg <- ref %>% pull(average)
ref_avg
```
3. Now report the min and max values for the same group.
```{r}
NHANES %>%
filter(AgeDecade == ' 20-29') %>%
summarize(minimum = min(BPSysAve, na.rm = TRUE), maximum = max(BPSysAve, na.rm = TRUE))
```
4. Compute the average and standard deviation for females, but for each age group separately rather than a selected decade as in question 1. Note that the age groups are defined by `AgeDecade`. Hint: rather than filtering by age and gender, filter by `Gender` and then use `group_by`.
```{r}
NHANES %>%
filter(Gender == 'female') %>%
group_by(AgeDecade) %>%
summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm = TRUE))
```
5. Repeat exercise 4 for males.
```{r}
NHANES %>%
filter(Gender == 'male') %>%
group_by(AgeDecade) %>%
summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm = TRUE))
```
6. We can actually combine both summaries for exercises 4 and 5 into one line of code. This is because `group_by` permits us to group by more than one variable. Obtain one big summary table using `group_by(AgeDecade, Gender)`.
```{r}
NHANES %>%
group_by(AgeDecade, Gender) %>%
summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm = TRUE))
```
7. For males between the ages of 40-49, compare systolic blood pressure across race as reported in the `Race1` variable. Order the resulting table from lowest to highest average systolic blood pressure.
```{r}
NHANES %>%
filter(Gender == 'male', AgeDecade == ' 40-49') %>%
group_by(Race1) %>%
summarize(average = mean(BPSysAve, na.rm = TRUE)) %>%
arrange(average)
```
# 5.10 Tibbles
Tidy data must be stored in data frames. We introduced the data frame in Section 3.5 and have been using the `murders` data frame throughout the book:
```{r}
data(murders)
class(murders)
```
In Section 5.7.3 we introduced the `group_by` function, which permits stratifying data before computing summary statistics. But where is the group information stored in the data frame?
```{r}
murders %>% group_by(region) %>% head()
#> # A tibble: 6 x 5
#> # Groups: region [2]
#> state abb region population total
#> <chr> <chr> <fct> <dbl> <dbl>
#> 1 Alabama AL South 4779736 135
#> 2 Alaska AK West 710231 19
#> 3 Arizona AZ West 6392017 232
#> 4 Arkansas AR South 2915918 93
#> 5 California CA West 37253956 1257
#> 6 Colorado CO West 5029196 65
```
Notice that there are no columns with this information. But, if you look closely at the output above, you see the line `A tibble: 6 x 5`. We can learn the class of the returned object using:
```{r}
murders %>% group_by(region) %>% class()
```
The `tbl`, pronounced tibble, is a special kind of data frame. The functions `group_by` and `summarize` always return this type of data frame. The `group_by` function returns a special kind of `tbl`, the `grouped_df`. We will say more about these later. For consistency, the __dplyr__ manipulation verbs (`select`, `filter`, `mutate`, and `arrange`) preserve the class of the input: if they receive a regular data frame they return a regular data frame, while if they receive a tibble they return a tibble. But tibbles are the preferred format in the tidyverse and as a result tidyverse functions that produce a data frame from scratch return a tibble. For example, in Chapter 6 we will see that tidyverse functions used to import data create tibbles.
Tibbles are very similar to data frames. In fact, you can think of them as a modern version of data frames. Nonetheless there are three important differences which we describe in the next.
### 5.10.1 Tibbles display better
The print method for tibbles is more readable than that of a data frame. To see this, compare the outputs of typing `murders` and the output of murders if we convert it to a tibble. We can do this using `as_tibble(murders)`. If using RStudio, output for a tibble adjusts to your window size. To see this, change the width of your R console and notice how more/less columns are shown.
### 5.10.2 Subsets of tibbles are tibbles
If you subset the columns of a data frame, you may get back an object that is not a data frame, such as a vector or scalar. For example:
```{r}
class(murders[,4])
```
is not a data frame. With tibbles this does not happen:
```{r}
class(as_tibble(murders)[,4])
```
This is useful in the tidyverse since functions require data frames as input.
With tibbles, if you want to access the vector that defines a column, and not get back a data frame, you need to use the accessor `$`:
```{r}
class(as_tibble(murders)$population)
```
A related feature is that tibbles will give you a warning if you try to access a column that does not exist. If we accidentally write `Population` instead of `population` this:
```{r}
murders$Population
```
returns a `NULL` with no warning, which can make it harder to debug. In contrast, if we try this with a tibble we get an informative warning:
```{r}
as_tibble(murders)$Population
```
### 5.10.3 Tibbles can have complex entries
While data frame columns need to be vectors of numbers, strings or logical values, tibbles can have more complex objects, such as lists or functions. Also, we can create tibbles with functions:
```{r}
tibble(id = c(1, 2, 3), func = c(mean, median, sd))
```
### 5.10.4 Tibbles can be grouped
The function `group_by` returns a special kind of tibble: a grouped tibble. This class stores information that lets you know which rows are in which groups. The tidyverse functions, in particular the `summarize` function, are aware of the group information.
### 5.10.5 Create a tibble using `data_frame` instead of `data.frame`
It is sometimes useful for us to create our own data frames. To create a data frame in the tibble format, you can do this by using the `data_frame` function.
```{r}
grades <- data_frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
grades
```
Note that base R (without packages loaded) has a function with a very similar name, `data.frame`, that can be used to create a regular data frame rather than a tibble. One other important difference is that by default `data.frame` coerces characters into factors without providing a warning or message:
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
class(grades$names)
```
To avoid this, we use the rather cumbersome argument `stringsAsFactors`:
```{r}
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)
class(grades$names)
```
To convert a regular data frame to a tibble, you can use the `as_tibble` function.
```{r}
as_tibble(grades) %>% class()
```
# 5.11 The dot operator
One of the advantages of using the pipe `%>%` is that we do not have to keep naming new objects as we manipulate the data frame. As quick reminder, if we want to compute the median murder rate for states in the southern states, instead of typing:
```{r}
tab_1 <- filter(murders, region == "South")
tab_2 <- mutate(tab_1, rate = total / population * 10^5)
rates <- tab_2$rate
median(rates)
```
We can avoid defining any new intermediate objects by instead typing:
```{r}
filter(murders, region == "South") %>%
mutate(rate = total / population * 10^5) %>%
summarize(median = median(rate)) %>%
pull(median)
```
We can do this because each of these functions takes a data frame as the first argument. But what if we want to access a component of the data frame. For example, what if the `pull` function was not available and we wanted to access `tab_2$rate`? What data frame name would we use? The answer is the dot operator.
For example to access the rate vector without the `pull` function we could use
```{r}
rates <-filter(murders, region == "South") %>%
mutate(rate = total / population * 10^5) %>%
.$rate
median(rates)
```
In the next section, we will see other instances in which using the `.` is useful.
# 5.12 `do`
The tidyverse functions know how to interpret grouped tibbles. Furthermore, to facilitate stringing commands through the pipe `%>%`, tidyverse functions consistently return data frames, since this assures that the output of a function is accepted as the input of another. But most R functions do not recognize grouped tibbles nor do they return data frames. The `quantile` function is an example we described in Section 5.7.1. The `do` functions serves as a bridge between R functions such as `quantile` and the tidyverse. The `do` function understands grouped tibbles and always returns a data frame.
In Section 5.7.1, we noted that if we attempt to use `quantile` to obtain the min, median and max in one call, we will receive an error: `Error: expecting result of length one, got : 2`.
```{r}
data(heights)
heights %>%
filter(sex == "Female") %>%
summarize(range = quantile(height, c(0, 0.5, 1)))
```
We can use the `do` function fix this.
First we have to write a function that fits into the tidyverse approach: that is, it receives a data frame and returns a data frame.
```{r}
my_summary <- function(dat){
x <- quantile(dat$height, c(0, 0.5, 1))
data_frame(min = x[1], median = x[2], max = x[3])
}
```
We can now apply the function to the heights dataset to obtain the summaries:
```{r}
heights %>%
group_by(sex) %>%
my_summary
```
But this is not what we want. We want a summary for each sex and the code returned just one summary. This is because `my_summary` is not part of the tidyverse and does not know how to handled grouped tibbles. `do` makes this connection:
```{r}
heights %>%
group_by(sex) %>%
do(my_summary(.))
```
Note that here we need to use the dot operator. The tibble created by `group_by` is piped to `do`. Within the call to `do`, the name of this tibble is `.` and we want to send it to `my_summary`. If you do not use the dot, then `my_summary` has _no argument_ and returns an error telling us that `argument "dat"` is missing. You can see the error by typing:
```{r}
heights %>%
group_by(sex) %>%
do(my_summary())
```
If you do not use the parenthesis, then the function is not executed and instead do tries to return the function. This gives an error because do must always return a data frame. You can see the error by typing:
```{r}
heights %>%
group_by(sex) %>%
do(my_summary)
```
# 5.13 The purrr package
In Section 4.5 we learned about the `sapply` function, which permitted us to apply the same function to each element of a vector. We constructed this function:
```{r}
compute_s_n <- function(n){
x <- 1:n
sum(x)
}
```
and used `sapply` to compute the sum of the first `n` integers for several values of `n` like this:
```{r}
n <- 1:25
s_n <- sapply(n, compute_s_n)
```
This type of operation, applying the same function or procedure to elements of an object, is quite common in data analysis. The **purrr** package includes functions similar to `sapply` but that better interact with other tidyverse functions. The main advantage is that we can better control the output type of functions. In contrast, `sapply` can return several different object types; for example, we might expect a numeric result from a line of code, but `sapply` might convert our result to character under some circumstances. **purrr** functions will never do this: they will return objects of a specified type or return an error if this is not possible.
The first **purrr** function we will learn is `map`, which works very similar to `sapply` but always, without exception, returns a list:
```{r}
library(purrr)
s_n <- map(n, compute_s_n)
class(s_n)
```
If we want a numeric vector, we can instead use `map_dbl` which always returns a vector of numeric values.
```{r}
s_n <- map_dbl(n, compute_s_n)
class(s_n)
```
This produces the same results as the `sapply` call shown above.
A particularly useful **purrr** function for interacting with the rest of the tidyverse is `map_df`, which always returns a tibble data frame. However, the function being called needs to return a vector a or list with names. For this reason, the following code would result in a `Argument 1 must have names` error:
```{r}
s_n <- map_df(n, compute_s_n)
```
We need to change the function to make this work:
```{r}
compute_s_n <- function(n){
x <- 1:n
data_frame(sum = sum(x))
}
s_n <- map_df(n, compute_s_n)
head(s_n)
```
The **purrr** package provides much more functionality not covered here. For more details you can consult [this online resource](https://jennybc.github.io/purrr-tutorial/).
# 5.14 Tidyverse conditionals
A typical data analysis will often involve one or more conditional operation. In Section 4.1 we described the `ifelse` function, which we will use extensively in this book. In this section we present two **dplyr** functions that provide further functionality for performing conditional operations.
### 5.14.1 `case_when`
The `case_when` function is useful for vectorizing conditional statements. It is similar to `ifelse` but can output any number of values, as opposed to just `TRUE` or `FALSE`. Here is an example splitting numbers into negative, positives and 0:
```{r}
x <- c(-2, -1, 0, 1, 2)
case_when(x < 0 ~ "Negative", x > 0 ~ "Positive", TRUE ~ "Zero")
```
A common use for this function is to define categorical variables based on existing variables. For example, suppose we we want compare the murder rates in in three groups of states: _New England_, _West Coast_, _South_, and _other_. For each state, we need to ask if it is in New England, if it is not we ask if it is in the West Coast, if not we ask if it is in the South and if not we assign _other_. Here is how we use `case_when` to do this:
```{r}
murders %>%
mutate(group = case_when(
abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England",
abb %in% c("WA", "OR", "CA") ~ "West Coast",
region == "South" ~ "South",
TRUE ~ "other")) %>%
group_by(group) %>%
summarize(rate = sum(total) / sum(population) * 10^5) %>%
arrange(rate)
```
### 5.14.2 `between`
A common operation in data analysis is to determine if a value falls inside an interval. We can check this using conditionals. For example to check if the elements of a vector `x` are between `a` and `b` we can type
```{r}
x >= a & x <= b
```
However, this can become cumbersome, especially within the tidyverse approach. The `between` function performs the same operation.
```{r}
between(x, a, b)
```
# 5.15 Exercises
1. Load the `murders` dataset. Which of the following is true?
A. `murders` is in tidy format and is stored in a tibble.
**B. `murders` is in tidy format and is stored in a data frame.**
C. `murders` is not in tidy format and is stored in a tibble.
D. `murders` is not in tidy format and is stored in a data frame.
```{r}
class(murders)
head(murders)
```
2. Use `as_tibble` to covert the `murders` data table into a tibble and save it in an object called `murders_tibble`.
```{r}
murders_tibble <- as_tibble(murders)
```
3. Use the `group_by` function to convert murders into a tibble that is grouped by region.
```{r}
murders_tibble %>%
group_by(region)
```
4. Write tidyverse code that is equivalent to this code: `exp(mean(log(murders$population)))`. Write it using the pipe so that each function is called without arguments. Use the dot operator to access the population. Hint: The code should start with `murders %>%`.
```{r}
calc <- function(dat){
x <- exp(mean(log(dat$population)))
data_frame(x)
}
murders %>%
do(calc(.))
```
5. Use the `map_df` to create a data frame with three columns named `n`, `s_n`, and `s_n_2`. The first column should contain the numbers 1 through 100. The second and third columns should each contain the sum of 1 through $n$ with $n$ the row number.
```{r}
compute_s_n <- function(n){
x <- 1:n
data_frame(n = n, s_n = sum(x), s_n_2 = sum(x))
}
n <- 1:100
map_df(n, compute_s_n)
```