forked from hadley/ggplot2-book
-
Notifications
You must be signed in to change notification settings - Fork 1
/
tidy-data.rmd
336 lines (236 loc) · 15.6 KB
/
tidy-data.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
output: oldbookdown::html_chapter
bibliography: references.bib
---
```{r data, include = FALSE}
chapter <- "tidy_data"
source("common.R")
library(tidyr)
```
# Data analysis {#cha:data}
## Introduction
So far, every example in this book has started with a nice dataset that's easy to plot. That's great for learning (because you don't want to struggle with data handling while you're learning visualisation), but in real life, datasets hardly ever come in exactly the right structure. To use ggplot2 in practice, you'll need to learn some data wrangling skills. Indeed, in my experience, visualisation is often the easiest part of the data analysis process: once you have the right data, in the right format, aggregated in the right way, the right visualisation is often obvious.
The goal of this part of the book is to show you how to integrate ggplot2 with other tools needed for a complete data analysis:
* In this chapter, you'll learn the principles of tidy data [@tidy-data],
which help you organise your data in a way that makes it easy to visualise
with ggplot2, manipulate with dplyr and model with the many modelling
packages. The principles of tidy data are supported by the __tidyr__ package,
which helps you tidy messy datasets.
* Most visualisations require some data transformation whether it's
creating a new variable from existing variables, or performing simple
aggregations so you can see the forest for the trees. [dplyr](#cha:dplyr)
will show you how to do this with the __dplyr__ package.
* If you're using R, you're almost certainly using it for its fantastic
modelling capabilities. While there's an R package for almost every type
of model that you can think of, the results of these models can be hard to
visualise. In [modelling](#cha:modelling), you'll learn about the __broom__
package, by David Robinson, to convert models into tidy datasets so you can
easily visualise them with ggplot2.
Tidy data is the foundation for data manipulation and visualising models. In the following sections, you'll learn the definition of tidy data, and the tools you need to make messy data tidy. The chapter concludes with two case studies that show how to apply the tools in sequence to work with real(istic) data.
## Tidy data {#sec:tidy-data}
The principle behind tidy data is simple: storing your data in a consistent way makes it easier to work with it. Tidy data is a mapping between the statistical structure of a data frame (variables and observations) and the physical structure (columns and rows). Tidy data follows two main principles: \index{Tidy data} \index{Data!best form for ggplot2}
1. Variables go in columns.
1. Observations go in rows.
Tidy data is particularly important for ggplot2 because the job of ggplot2 is to map variables to visual properties: if your data isn't tidy, you'll have a hard time visualising it.
Sometimes you'll find a dataset that you have no idea how to plot. That's normally because it's not tidy. For example, take this data frame that contains monthly employment data for the United States:
```{r ec2, include = FALSE}
library("lubridate")
ec2 <-
ggplot2::economics %>%
tbl_df() %>%
transmute(year = year(date), month = month(date), rate = uempmed) %>%
filter(year > 2005) %>%
spread(year, rate)
```
```{r}
ec2
```
(If it looks familiar, it's because it's derived from the `economics` dataset that we used earlier in the book.)
Imagine you want to plot a time series showing how unemployment has changed over the last 10 years. Can you picture the ggplot2 command you'd need to do it? What if you wanted to focus on the seasonal component of unemployment by putting months on the x-axis and drawing one line for each year? It's difficult to see how to create those plots because the data is not tidy. There are three variables, month, year and unemployment rate, but each variable is stored in a different way:
* `month` is stored in a column.
* `year` is spread across the column names.
* `rate` is the value of each cell.
To make it possible to plot this data we first need to tidy it. There are two important pairs of tools:
* Spread & gather.
* Separate & unite.
## Spread and gather {#sec:spread-gather}
Take a look at the two tables below:
```{r sample-data, echo = FALSE}
indexed <- data.frame(
x = c("a", "b", "c", "d", "c"),
y = c("A", "D", "A", "C", "B"),
z = c(1, 5, 4, 9, 10)
) %>% arrange(x, y)
matrix <- indexed %>% spread(y, z)
knitr::kable(indexed)
knitr::kable(matrix)
```
If you study them for a little while, you'll notice that they contain the same data in different forms. I call the first form __indexed__ data, because you look up a value using an index (the values of the `x` and `y` variables). I call the second form __Cartesian__ data, because you find a value by looking at intersection of a row and a column. We can't tell if these datasets are tidy or not. Either form could be tidy depending on what the values "A", "B", "C", "D" mean.
(Also note the missing values: missing values that are explicit in one form may be implicit in the other. An `NA` is the presence of an absense; but sometimes a missing value is the absense of a presence.)
Tidying your data will often require translating Cartesian → indexed forms, called __gathering__, and less commonly, indexed → Cartesian, called __spreading__. The tidyr package provides the `spread()` and `gather()` functions to perform these operations, as described below.
(You can imagine generalising these ideas to higher dimensions. However, data is almost always stored in 2d (rows & columns), so these generalisations are fun to think about, but not that practical. I explore the idea more in @wickham:2007b.
### Gather
`gather()` has four main arguments: \indexf{gather}
* `data`: the dataset to translate.
* `key` & `value`: the key is the name of the variable that will be created
from the column names, and the value is the name of the variable that will
be created from the cell values.
* `...`: which variables to gather. You can specify individually,
`A, B, C, D`, or as a range `A:D`. Alternatively, you can specify which
columns are _not_ to be gathered with `-`: `-E, -F`.
To tidy the economics dataset shown above, you first need to identify the variables: `year`, `month` and `rate`. `month` is already in a column, but `year` and `rate` are in Cartesian form, and we want them in indexed form, so we need to use `gather()`. In this example, the key is `year`, the value is `unemp` and we want to select columns from `2006` to `2015`:
```{r ec2-gather}
gather(ec2, key = year, value = unemp, `2006`:`2015`)
```
Note that the columns have names that are not standard variable names in R (they don't start with a letter). This means that we need to surround them in backticks, i.e. `` `2006` `` to refer to them.
Alternatively, we could gather all columns except `month`:
```{r ec2-gather-exclude}
gather(ec2, key = year, value = unemp, -month)
```
To be most useful, we can provide two extra arguments:
```{r ec2-gather-extra-args}
economics_2 <- gather(ec2, year, rate, `2006`:`2015`,
convert = TRUE, na.rm = TRUE)
economics_2
```
We use `convert = TRUE` to automatically convert the years from character strings to numbers, and `na.rm = TRUE` to remove the months with no data. (In some sense the data isn't actually missing because it represents dates that haven't occurred yet.)
When the data is in this form, it's easy to visualise in many different ways. For example, we can choose to emphasise either long term trend or seasonal variations:
`r columns(2, 2/3)`
```{r ec2-plots}
ggplot(economics_2, aes(year + (month - 1) / 12, rate)) +
geom_line()
ggplot(economics_2, aes(month, rate, group = year)) +
geom_line(aes(colour = year), size = 1)
```
### Spread
`spread()` is the opposite of `gather()`. You use it when you have a pair of columns that are in indexed form, instead of Cartesian form. For example, the following example dataset contains three variables (`day`, `rain` and `temp`), but `rain` and `temp` are stored in indexed form. \indexf{spread}
```{r weather}
weather <- dplyr::data_frame(
day = rep(1:3, 2),
obs = rep(c("temp", "rain"), each = 3),
val = c(c(23, 22, 20), c(0, 0, 5))
)
weather
```
Spread allows us to turn this messy indexed form into a tidy Cartesian form. It shares many of the arguments with `gather()`. You'll need to supply the `data` to translate, as well as the name of the `key` column which gives the variable names, and the `value` column which contains the cell values. Here the key is `obs` and the value is `val`:
```{r weather-spread}
spread(weather, key = obs, value = val)
```
### Exercises
1. How can you translate each of the initial example datasets into
the other form?
1. How can you convert back and forth between the `economics` and
`economics_long` datasets built into ggplot2?
1. Install the EDAWR package from <https://github.com/rstudio/EDAWR>.
Tidy the `storms`, `population` and `tb` datasets.
## Separate and unite {#sec:separate-unite}
Spread and gather help when the variables are in the wrong place in the dataset. Separate and unite help when multiple variables are crammed into one column, or spread across multiple columns. \indexf{separate} \indexf{unite}
For example, the following dataset stores some information about the response to a medical treatment. There are three variables (time, treatment and value), but time and treatment are jammed in one variable together:
```{r trt}
trt <- dplyr::data_frame(
var = paste0(rep(c("beg", "end"), each = 3), "_", rep(c("a", "b", "c"))),
val = c(1, 4, 2, 10, 5, 11)
)
trt
```
The `separate()` function makes it easy to tease apart multiple variables stored in one column. It takes four arguments:
* `data`: the data frame to modify.
* `col`: the name of the variable to split into pieces.
* `into`: a character vector giving the names of the new variables.
* `sep`: a description of how to split the variable apart. This can either be
a regular expression, e.g. `_` to split by underscores, or `[^a-z]` to split
by any non-letter, or an integer giving a position.
In this case, we want to split by the `_` character:
```{r trt-separate}
separate(trt, var, c("time", "treatment"), "_")
```
(If the variables are combined in a more complex form, have a look at `extract()`. Alternatively, you might need to create columns individually yourself using other calculations. A useful tool for this is `mutate()` which you'll learn about in the next chapter.)
`unite()` is the inverse of `separate()` - it joins together multiple columns into one column. This is less common, but it's useful to know about as the inverse of `separate()`.
### Exercises
1. Install the EDAWR package from <https://github.com/rstudio/EDAWR>.
Tidy the `who` dataset.
1. Work through the demos included in the tidyr package
(`demo(package = "tidyr")`)
## Case studies {#sec:tidy-case-study}
For most real datasets, you'll need to use more than one tidying verb. There many be multiple ways to get there, but as long as each step makes the data tidier, you'll eventually get to the tidy dataset. That said, you typically apply the functions in the same order: `gather()`, `separate()` and `spread()` (although you might not use all three).
### Blood pressure
The first step when tidying a new dataset is always to identify the variables. Take the following simulated medical data. There are seven variables in this dataset: name, age, start date, week, systolic & diastolic blood pressure. Can you see how they're stored?
````{r bp0}
# Adapted from example by Barry Rowlingson,
# http://barryrowlingson.github.io/hadleyverse/
bpd <- readr::read_table(
"name age start week1 week2 week3
Anne 35 2014-03-27 100/80 100/75 120/90
Ben 41 2014-03-09 110/65 100/65 135/70
Carl 33 2014-04-02 125/80 <NA> <NA>
", na = "<NA>")
```
The first step is to convert from Cartesian to indexed form:
```{r bp1}
bpd_1 <- gather(bpd, week, bp, week1:week3)
bpd_1
```
This is tidier, but we have two variables combined together in the `bp` variable. This is a common way of writing down the blood pressure, but analysis is easier if we break it into two variables. That's the job of separate:
```{r bp2}
bpd_2 <- separate(bpd_1, bp, c("sys", "dia"), "/")
bpd_2
```
This dataset is now tidy, but we could do a little more to make it easier to use. The following code uses `extract()` to pull the week number out into its own variable (using regular expressions is beyond the scope of the book, but `\\d` stands for any digit). I also use `arrange()` (which you'll learn about in the next chapter) to order the rows to keep the records for each person together.
```{r bp3}
bpd_3 <- extract(bpd_2, week, "week", "(\\d)", convert = TRUE)
bpd_4 <- dplyr::arrange(bpd_3, name, week)
bpd_4
```
You might notice that there's some repetition in this dataset: if you know the name, then you also know the age and start date. This reflects a third condition of tidyness that I don't discuss here: each data frame should contain one and only one data set. Here there are really two datasets: information about each person that doesn't change over time, and their weekly blood pressure measurements. You can learn more about this sort of messiness in the resources mentioned at the end of the chapter.
### Test scores
Imagine you're interested in the effect of an intervention on test scores. You've collected the following data. What are the variables?
```{r scores0}
# Adapted from http://stackoverflow.com/questions/29775461
scores <- dplyr::data_frame(
person = rep(c("Greg", "Sally", "Sue"), each = 2),
time = rep(c("pre", "post"), 3),
test1 = round(rnorm(6, mean = 80, sd = 4), 0),
test2 = round(jitter(test1, 15), 0)
)
scores
```
I think the variables are person, test, pre-test score and post-test score. As usual, we start by converting columns in Cartesian form (`test1` and `test2`) to indexed form (`test` and `score`):
```{r scores1}
scores_1 <- gather(scores, test, score, test1:test2)
scores_1
```
Now we need to do the opposite: `pre` and `post` should be variables, not values, so we need to spread `time` and `score`:
```{r scores2}
scores_2 <- spread(scores_1, time, score)
scores_2
```
A good indication that we have made a tidy dataset is that it's now easy to calculate the statistic of interest: the difference between pre- and post-intervention scores:
```{r scores3}
scores_3 <- mutate(scores_2, diff = post - pre)
scores_3
```
And it's similarly easy to plot:
```{r scores4}
ggplot(scores_3, aes(person, diff, color = test)) +
geom_hline(size = 2, colour = "white", yintercept = 0) +
geom_point() +
geom_path(aes(group = person), colour = "grey50",
arrow = arrow(length = unit(0.25, "cm")))
```
(Again, you'll learn about `mutate()` in the next chapter.)
## Learning more
Data tidying is a big topic and this chapter only scratches the surface. I recommend the following references which go into considerably more depth on this topic:
* The tidyr documentation. I've described the most important arguments, but most
functions have other arguments that help deal with less common situations.
If you're struggling, make sure to read the documentation to see if there's
an argument that might help you.
* "[Tidy data](http://www.jstatsoft.org/v59/i10/)", an article in the _Journal
of Statistical Software_. It describes the ideas of tidy data in more depth
and shows other types of messy data. Unfortunately the paper was written
before tidyr existed, so to see how to use tidyr instead of reshape2, consult
the
[tidyr vignette](http://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).
* The [data wrangling cheatsheet](http://rstudio.com/cheatsheets) by RStudio,
includes the most common tidyr verbs in a form designed to jog your memory
when you're stuck.
## References