-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-functional_p.Rmd
executable file
·491 lines (381 loc) · 22 KB
/
08-functional_p.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
---
output:
pdf_document: default
html_document: default
---
# Functional Programming {#rprog4}
```{r rprog3-1, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.align = 'center')
library(tidyverse)
library(lubridate) # for working with date-times
library(knitr)
library(bookdown)
library(scales)
library(kableExtra)
```
## Ch. 8 Objectives
This chapter is designed around the following learning objectives for functional
programming in R. Upon completing this chapter, you should be able to:
- Describe the basic tenets of functional programming
- Define functions and their basic attributes
- Create simple functions using named and optional arguments
- Define the meaning of *"vectorized operations"* in R
- Interpret and apply the `purrr::map` family of functions
## What is ***functional programming***?
*Functional programming* means just that: programming with functions. A basic
philosophy of functional programming in R is to replace "for-loops" with functions.
While there isn't anything inherently wrong with for-loops, they can be difficult
to follow, especially when they are nested within one to another.
R is well-poised to support the elimination of "for-loops" because R is a
*vectorized language*. To be vectorized means that you don't need a for-loop to
execute this code:
``` {r vectorized-code, eval=FALSE}
evens <- seq(from = 2, to = 20, by = 2)
evens_squared <- evens^2
```
with code like this:
```{r for-loop-code, eval=FALSE}
for( i in seq(1, length(evens))){
evens_squared[i] <- evens[i] * evens[i]
}
```
The top code performs the `squared` operator across each entry in the
vector, one after another. The bottom set uses a for-loop to accomplish the
same task, albeit in a longer (and harder to follow) version.
The pipe operator `%>%` is an ally in this endeavor because it allows you to
pass an object (like a vector or data frame) through series of functions in serial order. The pipe function is also easier to follow (with your mind) because
it allows you to interpret the code as if you were reading instructions in a
"how-to" manual:
*"take this object, then do this, then do this, then do that"*. **Without** the `%>%`
operator, we are forced to nest functions together or write long and verbose
code that steps through treatments slowly (and somewhat painfully). Looking at
the code below should make clear which set is easier to comprehend:
``` {r nested-or-piped, eval=FALSE}
#nested code
daily_show_2000 <- select(filter(rename(daily_show,
year = YEAR,
job = GoogleKnowlege_Occupation,
date = Show,
name = Raw_Guest_List),
year == 2000),
job, date, name)
#piped code
daily_show_2000 <- daily_show %>%
rename(year = YEAR,
job = GoogleKnowlege_Occupation,
date = Show,
name = Raw_Guest_List) %>%
filter(year == 2000) %>%
select(job, date, name)
```
In the nested example above, we "see" the `dplyr::select()` function first,
yet the arguments to this function show up last (at the end of the nested code).
This separation between functions and their arguments only gets worse as the
degree of nesting increases. The piped code, however, produces the same result
as the nested code but in a much more "readable" fashion. A similar analogy
holds for the nesting of "for-loops" (which, when nested, are even
harder to follow!).
## Writng Functions
> <span style="color: blue;"> "If you have to do something more than twice,
use a function to do it." - Programming Proverb </span>
Up to this point, we have relied entirely on functions sourced from base R or
from packages like `Tidyverse::`(and for good reason - they are very useful).
Every programmer, however, at some point in their journey, discovers that the
function they are desiring simply doesn't exist - at least not in the way that
suits their vision. To that end, it's worth learning how to create your own
functions.
The syntax for function generation is relatively straight forward:
``` {r function-gen, eval=FALSE}
#example code; will not run
function_name <- function(argument1, argument2, ...) {
# insert code to manipulate arguments here
# maybe some code to check validity of input arguments
# specify a return value, if desired
}
```
Each function has **arguments** as input, **body code** as the working parts, and
an **environment** where the function resides. These components can be
accessed by passing the function name as an argument to:
- `formals()`, to see arguments
- `body()`, to see the function's code
- or type just the function name, without `()`, into the console
- *but note that many base R functions are coded in C++* and can be accessed
[on GitHub](https://github.com/wch/r-source){target="_blank"}.
- `environment()`, to see where the function resides.
- more on *R environments* below
### Example Function: my_mean
Let's create a simple function, named `my_mean`, to calculate the arithmetic mean
for a numeric vector. This function takes only one argument, a vector `x`, for
which we calculate the mean (`sum(x) / length(x)`), assign it to a
value `y`, and the return `y` as output. Note that all the *"action"* for `my_mean`
happens within the curly braces `{ }` that follow the function assignment.
``` {r my-mean}
my_mean <- function(x) {
y <- sum(x) / length(x)
return(y)
}
my_mean(1:5)
```
While this function is simple and straightforward, it does have some limitations.
For example, what happens if we pass a character vector to `my_mean` or a numeric
vector that contains `NA` values? In the former case, we will get an error
because our function uses the base R function `sum()`, which requires numeric,
logical, or complex vectors as input (so our function inherits the properties
of that function).
If any `NA` values are present, however, we are less fortunate: only NA is
returned. This many not seem like a big deal but if your analysis depends on
calling `my_mean` in several locations (over and over), you might have a hard
time debugging it...
``` {r my-mean-NA}
my_mean(c(1, 2, 3, 4, 5, NA))
```
For this reason, functions often contain *optional arguments* that allow the user
to specify how to handle such errors. The function `my_mean2` below contains
an optional argument, `na.rm = TRUE`, that defaults to remove `NA` values when
present. This function uses `if else` logic to handle the `na.rm = TRUE`
argument, since this argument can only have one of two values.
``` {r my-mean2}
my_mean2 <- function(x, na.rm = TRUE) {
if (na.rm == FALSE) {
y <- sum(x) / length(x)
return(y)
}
else {
y <- sum(na.omit(x)) / length(na.omit(x))
return(y)
}
}
```
Now, when we pass a vector containing `NAs` to `my_mean2`, we get a numeric
result.
``` {r my-mean-2-NA}
my_mean2(c(1:5,NA), na.rm = TRUE)
```
A couple points worth noting about the functions above. First, take note that
most *homemade* functions rely on other functions called within their body text,
and so they inherit the properties of those functions.
Second, you may have noticed that while
the `body()` code in `my_mean2` assigns the mean of `x` to a new variable, `y`,
this function-specific variable does not show up in the **"Global Environment"**
pane after the function is executed. This is due to how *environments* are created in R: an *environment* essentially draws walls around objects.
Indeed, objects that are defined within functions are part of the *evaluation*
*environment* within that function, even if the function returns the value of
the object as output (*"what happens inside functions stays inside functions"*).
A detailed discussion on R environments is beyond the scope of this book,
look [here](https://www.r-bloggers.com/environments-in-r/){target="_blank"}
for an introduction to environments and
[here](https://adv-r.hadley.nz/environments.html){target="_blank"}
for a more detailed tutorial.
The take-home point is that a homemade function can exist within the global
environment and return values to the global environment, even if some of the objects
created within that function don't show up in the global environment.
### Example Function: "import.w.name"
Oftentimes, you will have a list of files of the same type/structure
that you want to import and analyze in a single data frame. This exercise (and
the section that follows) will demonstrate how you can streamline that process
using functional programming. Let's create a function to *"import a file with*
*its name appended".*
For this example, assume that your files represent data from a network of sensors,
where the name/ID assigned to each sensor is **included in its file name** but
*not in the file itself*. To give you an example of what we are working with,
let's use `list.files()` to look at the file names and paths (*note: you can find this*
*.zip file on our Canvas site*). For this exercise,
we show a short list of 8 files (4 each from two sensors) but one could imagine
this list being hundreds of entries.*To Note: these are real data collected*
*using sensors from*
*[this citizen-science network](http://www.purpleair.com/map){target="_blank"}*
*and published*
*by our research group*
*[here](https://doi.org/10.1016/j.atmosenv.2019.117067){target="_blank"} in 2019.*
The "PA" in each file name stands for
[Purple Air](https://www2.purpleair.com/){target="_blank"}.
``` {r list-PA-files}
list.files('./data/purpleair/', full.names=TRUE)
```
Thus, we wish to write a function that not only imports these .csv files into a data
frame but also extracts the part of the file name (i.e., PA019 and PA020)
as one of the data columns (otherwise, when the data were combined we might not know what data was associated with a given sensor!). In this function we will also
include a step to clean up the newly created data frame with a call to
`dplyr::select()` to retain only a few variables of interest and `lubridate::ymd()`
to create a date-time object. Seeing that these
files are .csv, we can leverage `readr::read_csv`
``` {r import-file-name, warning=FALSE, message=FALSE}
# create an object that tracks the file names and file paths
file_list <- list.files('./data/purpleair/', full.names=TRUE)
# function to import a .csv and include part of the filename as a data column
import.w.name <- function(pathname) {
#create a tibble by importing the 'pathname' file
df <- read_csv(pathname, col_names = TRUE)
df <- df %>%
# use stringr::str_extract & a regex to get sensor ID from file name
# regex translation: "look for a /, then extract all letters and numbers that follow until _"
mutate(sensor_ID = str_extract(pathname,
"(?<=//)[:alnum:]+(?=_)"),
# convert Date & Time variable to POSIXct with lubridate
datetime = lubridate::ymd_hms(UTCDateTime)) %>%
# return only a few salient variables to the resultant data frame using dplyr::select
select(datetime,
current_temp_f,
current_humidity,
pressure,
pm2_5_atm,
sensor_ID) %>%
na.omit() # remove NA values, which happens when sensor goes offline
return(df)
}
```
After sourcing this function, we can test it out on the first entry of our list
of files. *We specify the first entry with a subset to* `file_list[1]`.
``` {r import-file-name-2, message=FALSE, warning=FALSE}
PA_data_1 <- import.w.name(file_list[1])
head(PA_data_1)
```
The `import.w.name()` function is useful, but not versatile;
it was written to handle a special type of file with a particular naming
convention. Those limitations aside, one could imagine how this function could
be easily adapted to suit other file types and formats. As you develop your
coding skills, a good strategy is to keep useful functions in a .R script
file so that you can call upon them when needed: `source(import.w.name.R)`
## The `purrr::` package
The `purrr::` package was designed specifically with functional programming in mind.
Similar to the discussion of *vectorized operations* above, `purrr::` was created
to help you apply functions to vectors (and data frames) in a way that is easy to implement and easy to "read".
### Function Mapping
The `map_` family of functions are the core of the `purrr` package. These
functions are intended to *map* functions (i.e., to apply them) to individual elements in a vector (or data frames); the `map_` functions are similar to functions like `lapply()` and `vapply()` from base R (but more versatile). *"Mapping"* a function onto a vector is a common theme of functional programming. To illustrate how the `map_` functions work, its best to visualize the process first.
``` {r map-anno1, fig.cap="The map functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input.", fig.align="center", echo=FALSE}
knitr::include_graphics("./images/map_anno1.png")
```
As shown above in Figure \@ref(fig:map-anno1), the generic form of `map()` always returns a `list` object as the output (thanks to Hadley Wickham for the [graphic](https://adv-r.hadley.nz/functionals.html#map){target="_blank"}). A `list` is a generic vector that can contain multiple objects (other vectors, data frames, etc.). Indeed, a data frame (and a `tibble`) is a special kind of list: a data frame is a list of vectors (columns) of equal length. A data frame can contain different vector objects but those objects must be placed in separate columns and each column vector must be the same length (when all the columns are equal length that means that data frames are *rectangular* lists). A generic `list` does not need to be rectangular because generic list entries are more distinct. Further, a list can contain another list (like a set of small boxes that are nested within a larger box). Think of a `list` like a drawer organizer - it can contain lots of different objects, all of which might be related to another in some way or, at least, might be used together for some common purpose. To review:
- A ***vector*** is a one-dimensional object where all the entries are the same type.
- A ***matrix*** is a two-dimensional (rectangular) object where all entries are the same type, but ordered into *rows* and *column* indices.
- A ***data frame*** is a two dimensional (rectangular) object that contains columns of different vectors. The type of vector (e.g., `character`, `numeric`, `logical`, etc.) can differ from one column to the next but all entries within each vector column must be the same.
- A ***list*** can contain different vector objects of different types and different lengths. A list can also contain other lists. A list does not need to be rectangular.
To illustrate the versatility of lists, let's create one that contains a numeric vector, a matrix, and a data frame.
```{r list-example, echo=TRUE}
my.chr.vector <- c("Harry", "Ron", "Hermione", "Draco")
my.num.matrix <- matrix(data = 1:20, nrow=5)
my.df <- slice_sample(.data = mpg, n=7)
my.list <- list("entry_1" = my.chr.vector,
"entry_2" = my.num.matrix,
"entry_3" = my.df)
glimpse(my.list)
```
Lists can be accessed in similar ways to vectors. For example, by using single-bracket indexing, `[ ]`, a list element is returned.
```{r lists2}
my.list[1]
```
Note that *"single bracket"* indexing returns a "list element" of class: `list`.
```{r lists3}
class(my.list[1])
```
If you want access to the vector contents of the list element, you can use the `$` symbol to access the contents of each list entry, or, you can use *"double bracket"* indexing, `[[ ]]`:
```{r lists4}
my.list$entry_1
class(my.list$entry_1)
```
```{r lists5}
my.list[[2]]
class(my.list[[2]])
```
Working with lists is one of the mental hurdles to overcome when learning `map()`. However, there are variants in the `map_` family of functions that return simpler object types; we will work with these simplified functions to begin.
```{r purrr-variants, echo=FALSE}
purrr_variants <- tibble(
name = c("map()",
"map_lgl()",
"map_int()",
"map_dbl()",
"map_chr()",
"map_dfr()",
"map_dfc()"),
returns = c("list",
"logical",
"integer",
"numeric",
"character",
"data frame (output by row-binding)",
"data frame (output by column-binding)")
)
knitr::kable(purrr_variants, col.names = c("`Purrr::` Function", "Object Type Returned"))
```
Let's use `map_dfr()` to apply our function, `import.w.name()`, onto the vector of file names and paths (`file_list`). The *"r"* in `map_dfr()` means that the resultant data frame output by `map_` will be created by binding rows together as the function moves down each index in the file list. This is shown schematically in Figure \@ref(fig:map-dfr-anno) below.
``` {r purrr-import, warning=FALSE, message=FALSE}
file_list <- list.files('./data/purpleair/', full.names=TRUE)
# map the import.w.name() function to all objects within `file_list` sequentially
# and combining the result into a single data frame with row binding
PA_data_merged <- map_dfr(file_list, import.w.name)
glimpse(PA_data_merged)
```
```{r map-dfr-anno, fig.cap="Example: using `map_dfr()` to import a file list using a custom function", echo=FALSE}
knitr::include_graphics("./images/map_dfr_anno.png")
```
## Applying functions to multiple columns: `dplyr::across()`
There are many instances where you will want to apply a function to several columns in a data frame, often in conjunction with `mutate()` or `summarise()`. The `across()` function helps you do this efficiently, by applying a function *across* various column variables in a data frame.
The `across()` function has two required arguments: the column variables being treated (`.cols = `) and the function(s) being applied to those columns (`.fns = `). For example, if you wanted to change the `cyl`, `drv`, and `class` variables in the `mpg` data frame to factors, you could use:
```{r mpg-factors}
mpg_facts <- mutate(mpg, across(.cols = c(cyl, drv, class),
.fns = factor))
```
Before providing more examples of `across()`, it's worth taking a tangent to discuss more efficient ways to select multiple columns beyong using `c()`.
### Using `<tidy-select>` syntax to select columns
Many of the `dplyr::` functions support *Tidy Selection* as a means to choose certain column variables when operating on a data frame. For example, maybe you want to choose only columns whose names start with a common prefix (`starts_with()`) or whose names contain a common string (`contains()`). The `<tidy-select>` helpers provide syntax for efficiently specifying your `.cols = ` argument in R. These helpers, shown below, can be used as column/variable selection arguments whenever you see the text `<tidy-select>` in the function's help file.
```{r tidy-select, echo = FALSE}
tidy_select_1 <- data.frame(
helpers = c("everything()",
"starts_with()",
"ends_with()",
"contains()",
"matches()",
"where()"),
meaning = c("Matches all variables",
"Starts with a prefix",
"Ends with a suffix",
"Contains a literal string",
"Matches a regular expression",
"Selects variables if function returns TRUE"),
example = c("rename_with(mpg, .cols = everything(), .fn = toupper))",
"select(mpg, starts_with(\"c\"))",
"mutate(mpg, across(ends_with(\"y\"), .fns = ~0.425*.x, .names = \"km_per_liter_{.col}\"))",
"summarise(mpg, across(contains(\"y\"), .fns = mean, .names = \"mean_{.col}\"))",
"rename_with(mpg, .cols = matches(\"y$\"), .fn = toupper)",
"select(mpg, where(is.numeric))")
)
knitr::kable(tidy_select_1, col.names = c("Helper", "Explanation", "Example")) %>%
column_spec(2, width = "6cm")
```
There are also helper symbols that can be used (an in conjunction with the helper verbs above) to make selections. For example, if you wanted to keep all columns *except* for those containing `logical` vectors, you could use `select(data = .x, !where(is.logical))`. These symbols are outlined in the table below. For further reference, see this [help page](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html).
```{r tidy-select2, echo = FALSE}
tidy_select_2 <- tibble(
symbol = c(":", "!", "&", "|", "c()"),
meaning = c("for selecting a range of consecutive columns",
"for taking the complement of a set of variables",
"for selecting the intersection of two sets of variables",
"for selecting the union of two sets of variables",
"for combining selections"),
example = c("model:year",
"!where(is.numeric)",
"where(is.numeric) & contains(\"y\")",
"where(is.numeric) | starts(\"m\")",
"!c(Make, hwy)")
)
knitr::kable(tidy_select_2, col.names = c("Symbol", "Explanation", "Example (`mpg` dataframe)"))
```
### Examples `dplyr::across()`
Perform unit conversions:
```{r metric-units-mpg}
# convert the cty and hwy variables in the mpg data frame to km/l
mutate(mpg, across(ends_with("y"), # select cty and hwy
.fns = ~0.425*.x, # unit conversion m/gal to km/l
.names = "km_per_liter_{.col}")) %>% # how to name new columns
# select first 2 cols and those containing strings indicated
select(1:2 | contains("year") | contains("km")) %>%
slice_sample(n=5)
```
Summarizing the data range for only the numeric vectors:
```{r summarise-numerics, warning=FALSE}
# show the ranges of all numeric vectors in the mpg data frame
mpg %>%
summarise(across(where(is.numeric), .fns = range))
```
## Homework
This homework will give you practice at writing functions, mapping functions, and cleaning/plotting data. To begin, download the PurpleAir data files from Canvas. Note: the data are contained in a .zip file, which you can unzip on your computer or using R!