-
Notifications
You must be signed in to change notification settings - Fork 9
/
working-with-dataframes.Rmd
432 lines (292 loc) · 9.39 KB
/
working-with-dataframes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
---
title: 'Working with dataframes'
---
## Working with dataframes {- #working-with-dataframes}
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE, message=F)
library(tidyverse)
library(tufte)
library(pander)
```
### Introducing the `tidyverse` {- #tidyverse}
This guide deliberately ignores many common patterns for working with
dataframes.
There are plenty of other guides for working in these older ways, but for
beginners, these techniques can be confusing. The approach shown here is based
only on functions in [the `tidyverse`](#tidyverse). Although simple --- and easy
to read --- the approach is extremely flexible and covers almost all of the
cases you will encounter when working with psychological data.
Specifically, we make extensive use of two tidyverse packages:
- `dplyr`: to select, filter and summarise data
- `ggplot2`: to make plots
To load the tidyverse first write:
```{r, echo=T, message=F}
library(tidyverse)
```
This can either be typed into the console or (better) included at the top of an
markdown file.
<!---
XXX Add note for teachers here???
--->
## Selecting columns {- #selecting-columns}
To pick out single or multiple columns use the `select()` function.
The `select()` function expects a dataframe as it's first input ('argument', in
R language), followed by the names of the columns you want to extract with a
comma between each name.
It returns a _new_ dataframe with just those columns, in the order you
specified:
```{r}
head(
select(mtcars, cyl, hp)
)
```
#### Saving a subset of the data {-}
Because `dplyr` functions return a _new_ dataframe, we can assign the results to
a variable:
```{r}
justcylandweight <- select(mtcars, cyl, wt)
summary(justcylandweight)
```
#### Excluding columns {-}
If you want to keep most of the columns --- perhaps you just want to get rid of
one and keep the rest --- put a minus (`-`) sign in front of the name of the
column to drop. This then selects everything _except_ the column you named:
```{r}
# Note we are just dropping the Ozone column
head(select(airquality, -Ozone))
```
#### Matching specific columns {-}
You can use a patterns to match a subset of the columns you want. For example,
here we select all the columns where the name contains the letter `d`:
```{r}
head(select(mtcars, contains("d")))
```
And you can combine these techniques to make more complex selections:
```{r}
head(select(mtcars, contains("d"), -drat))
```
#### Other methods of selection {-}
As a quick reference, you can use the following 'verbs' to select columns in
different ways:
- `starts_with()`
- `ends_with()`
- `contains()`
- `everything()`
See the help files for more information (type `??dplyr::select` into the
console).
## Selecting rows {- #selecting-rows}
To select rows from a dataframe use the `filter()` function (again from
`dplyr`).
If we only wanted to rows for 6-cylindered cars, we could write:
```{r}
filter(mtcars, cyl==6)
```
## 'Operators' {- #operators}
<!-- <iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/4TYv2PhG89A?rel=0" frameborder="0" allowfullscreen></iframe>
-->
When selecting rows in the [example above](#selecting-rows) we used two equals
signs `==` to select rows where `cyl` was exactly `6`.
As you might guess, there are other 'operators' we can use to create filters.
Rather than describe them, the examples below demonstrate what each of them do.
###### Equality and matching {-}
As above, to compare a single value we use `==`
```{r}
2 == 2
```
And in a filter:
```{r}
filter(mtcars, cyl==6)
```
:::{.explainer}
You might have noted above that we write `==` rather than just `=` to define the
criteria. This is because most programming languages, including R, use two `=`
symbols to distinguish: _comparison_ from _assignment_.
:::
###### Presence/absence {-}
To test if a value is in a vector of suitable matches we can use: `%in%`:
```{r}
5 %in% 1:10
```
Or for an example which is not true:
```{r}
100 %in% 1:10
```
Perhaps less obviously, we can test whether each value in a vector is _in_ a
second vector.
This returns a vector of `TRUE/FALSE` values as long as the first list:
```{r}
c(1, 2) %in% c(2, 3, 4)
```
This is very useful in a dataframe filter:
```{r}
head(filter(mtcars, cyl %in% c(4, 6)))
```
:::{.explainer}
Here we selected all rows where `cyl` matched either `4` or `6`. That is, where
the value of `cyl` was 'in' the vector `c(4,6)`.
:::
###### Greater/less than {-}
The `<` and `>` symbols work as you'd expect:
```{r, eval=F}
head(filter(mtcars, cyl > 4))
head(filter(mtcars, cyl < 5))
```
You can also use `>=` and `<=`:
```{r, eval=F}
filter(mtcars, cyl >= 6)
filter(mtcars, cyl <= 4)
```
###### Negation (opposite of) {-}
The `!` is very useful to tell R to reverse an expression; that is, take the
opposite of the value. In the simplest example:
```{r}
!TRUE
```
This is helpful because we can reverse the meaning of other expressions:
```{r}
is.na(NA)
!is.na(NA)
```
And we can use in dplyr filters.
Here we select rows where `Ozone` is missing (`NA`):
```{r, eval=F}
filter(airquality, is.na(Ozone))
```
And here we use `!` to reverse the expression and select rows which are not
missing:
```{r, eval=F}
filter(airquality, !is.na(Ozone))
```
:::{.exercise}
Try running these commands for yourself and experiment with changing the
operators to make select different combinations of rows
:::
###### Other logical operators {-}
There are operators for 'and'/'or' which can combine other filters.
Using `&` (and) with two condtions makes the filter more restrictive:
```{r}
filter(mtcars, hp > 200 & wt > 4)
```
In contrast, the pipe symbol, `|`, means 'or', so we match more rows:
```{r}
filter(mtcars, hp > 200 | wt > 4)
```
Finally, you can set the order in which operators are applied by using
parentheses. This means these expressions are subtly different:
```{r}
# first
filter(mtcars, (hp > 200 & wt > 4) | cyl==8)
# second reordered evaluation
filter(mtcars, hp > 200 & (wt > 4 | cyl==8))
```
:::{.exercise}
Try writing in plain English the meaning of the two filter expressions above
:::
## Sorting {- #sorting}
Sort dataframes using `arrange()` from `dplyr`:
```{r}
airquality %>%
arrange(Ozone) %>%
head
```
By default sorting is ascending, but you can use a minus sign to reverse this:
```{r}
airquality %>%
arrange(-Ozone) %>%
head
```
You can sort on multiple columns too, but the order of the variables makes a
difference. This:
```{r}
airquality %>%
select(Month, Ozone) %>%
arrange(Month, -Ozone) %>%
head
```
Is different to this:
```{r}
airquality %>%
select(Month, Ozone) %>%
arrange(-Ozone, Month) %>%
head
```
## Pipes {- #pipes}
We often want to combine `select` and `filter` (and other functions) to return a
subset of our original data.
One way to achieve this is to 'nest' function calls.
Taking the `mtcars` data, we can select the weights of cars with a poor `mpg`:
```{r}
gas.guzzlers <- select(filter(mtcars, mpg < 15), wt)
summary(gas.guzzlers)
```
This is OK, but can be confusing to read. The more deeply nested we go, the
easier it is to make a mistake.
#### `tidyverse` provides an alternative to nested function calls, called the 'pipe'. {-}
Imagine your dataframe as a big bucket, containing data.
From this bucket, you can 'pour' your data down the screen, and it passes
through a series of tubes and filters.
At the bottom of your screen you have a smaller bucket, containing only the data
you want.
![Think of your data 'flowing' down the screen.](media/tubes.jpg)
The 'pipe' operator, `%>%` makes our data 'flow' in this way:
```{r}
big.bucket.of.data <- mtcars
big.bucket.of.data %>%
filter(mpg <15) %>%
select(wt) %>%
summary
```
:::{.explainer}
The `%>%` symbol makes the data flow onto the next step. Each function which
follows the pipe takes the incoming data as it's first input.
:::
Pipes do the same thing as nesting functions, but the code stays more readable.
It's especially nice because the order in which the functions happen is the same
as the order in which we read the code (the opposite is true for nested
functions).
We can save intermediate 'buckets' for use later on:
```{r}
smaller.bucket <- big.bucket.of.data %>%
filter(mpg <15) %>%
select(wt)
```
This is an incredibly useful pattern for processing and working with data.
We can 'pour' data through a series of filters and other operations, saving
intermediate states where necessary.
:::{.tip}
You can insert the `%>%` symbol in RStdudio by typing `cmd-shift-M`, which saves
a lot of typing.
:::
## Modifying and creating new columns {- #mutate}
We often want to compute new columns from data we already have.
Imagine we had heights stored in cm, and weights stored in kg for 100
participants in a study on weight loss:
```{r}
set.seed(1234)
weightloss <- tibble(
height_cm = rnorm(100, 150, 20),
weight_kg = rnorm(100, 65, 10)
)
```
```{r}
weightloss %>% head
```
If we want to compute each participants' Body Mass Index, we first need to
convert their height into meters. We do this with mutate:
```{r}
weightloss %>%
mutate(height_meters = height_cm / 100) %>%
head
```
We then want to calculate BMI:
```{r}
weightloss %>%
mutate(height_meters = height_cm / 100,
bmi = weight_kg / height_meters ^ 2) %>%
head
```
:::{.tip}
You could skip the intermediate step of converting to meters and write:
`bmi = weight_kg / (height_cm/100) ^ 2`. But it's often best to be explicit and
simplify each operation.
:::