forked from tidyverse/design
-
Notifications
You must be signed in to change notification settings - Fork 0
/
def-inform.qmd
135 lines (102 loc) · 6.04 KB
/
def-inform.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# Explain important defaults {#sec-def-inform}
```{r}
#| include = FALSE
source("common.R")
library(dplyr, warn.conflicts = FALSE)
```
## What's the pattern?
If a default value is important, and the computation is non-trivial, inform the user what value was used.
This is particularly important when the default value is an educated guess, and you want the user to change it.
It is also important when descriptor arguments (@sec-important-args-first)) have defaults.
## What are some examples?
- `dplyr::left_join()` and friends automatically compute the variables to join `by` as the variables that occur in both `x` and `y` (this is called a natural join in SQL).
This is convenient, but it's a heuristic so doesn't always work.
```{r}
#| error = TRUE
library(nycflights13)
library(dplyr)
# Correct
out <- left_join(flights, airlines)
# Incorrect
out <- left_join(flights, planes)
# Error
out <- left_join(flights, airports)
```
- `readr::read_csv()` reads a csv file into a data frame.
Because csv files don't store the type of each variable, readr must guess the types.
In order to be fast, `read_csv()` uses some heuristics, so it might guess wrong.
Or maybe guesses correctly today, but when your automated script runs in two months time when the data format has changed, it might guess incorrectly and give weird downstream errors.
For this reason, `read_csv()` prints the column specification in a way that you can copy-and-paste into your code.
```{r}
library(readr)
mtcars <- read_csv(readr_example("mtcars.csv"))
```
- In `ggplot2::geom_histogram()`, the `binwidth` is an important parameter that you should always experiment with.
This suggests it should be a required argument, but it's hard to know what values to try until you've seen a plot.
For this reason, ggplot2 provides a suboptimal default of 30 bins: this gets you started, and then a message tells you how to modify.
```{r}
library(ggplot2)
ggplot(diamonds, aes(carat)) + geom_histogram()
```
- When installing packages, `install.packages()` informs of the value of the `lib` argument, which defaults to `.libPath()[[1]]`:
```{r}
#| eval = FALSE
install.packages("forcats")
# Installing package into ‘/Users/hadley/R’
# (as ‘lib’ is unspecified)
```
This, however, is not terribly important (most people only use one library), it's easy to ignore this amongst the other output, and the message doesn't refer to the mechanism that controls the default (`.libPaths()`).
## Why is it important?
> There are two ways to fire a machine gun in the dark.
> You can find out exactly where your target is (range, elevation, and azimuth).
> You can determine the environmental conditions (temperature, humidity, air pressure, wind, and so on).
> You can determine the precise specifications of the cartridges and bullets you are using, and their interactions with the actual gun you are firing.
> You can then use tables or a firing computer to calculate the exact bearing and elevation of the barrel.
> If everything works exactly as specified, your tables are correct, and the environment doesn't change, your bullets should land close to their target.
>
> Or you could use tracer bullets.
>
> Tracer bullets are loaded at intervals on the ammo belt alongside regular ammunition.
> When they're fired, their phosphorus ignites and leaves a pyrotechnic trail from the gun to whatever they hit.
> If the tracers are hitting the target, then so are the regular bullets.
>
> --- [The Pragmatic Programmer](https://www.amazon.com/dp/B003GCTQAE)
I think this is a valuable pattern because it helps balance two tensions in function design:
- Forcing the function user to really think about what they want to do.
- Trying to be helpful, so the user of function can achieve their goal as quickly as possible.
Often your thoughts about a problem will be aided by a first attempt, even if that attempt is wrong.
Helps facilitate iteration: you don't sit down and contemplate for an hour and then write one perfectly formed line of R code.
You take a stab at it, look at the result, and then tweak.
Taking a default that the user really should carefully think about and make a decision on, and turning it into a heurstic or educated guess, and reporting the value, is like a tracer bullet.
The counterpoint to this pattern is that people don't read repeated output.
For example, do you know how to cite R in a paper?
It's mentioned every time that you start R.
Human brains are extremely good at filtering out unchanging signals, which means that you must use this technique with caution.
If every argument tells you the default it uses, it's effectively the same as doing nothing: the most important signals will get buried in the noise.
This is why you'll see the technique used in only a handful of places in the tidyverse.
## How can I use it?
To use this message you need to generate a message from the computation of the default value.
The easiest way to do this to write a small helper function.
It should compute the default value given some inputs and generate a `message()` that gives the code that you could copy and paste into the function call.
Take the dplyr join functions, for example.
They use a function like this:
```{r}
common_by <- function(x, y) {
common <- intersect(names(x), names(y))
if (length(common) == 0) {
stop("Must specify `by` when no common variables in `x` and `y`", call. = FALSE)
}
message("Computing common variables: `by = ", rlang::expr_text(common), "`")
common
}
common_by(data.frame(x = 1), data.frame(x = 1))
common_by(flights, planes)
```
The technique you use to generate the code will vary from function to function.
`rlang::expr_text()` is useful here because it automatically creates the code you'd use to build the character vector.
To avoid creating a magical default (@sec-def-magical), either export and document the function, or use one of the techniques in @sec-defaults-short-and-sweet:
```{r}
left_join <- function(x, y, by = NULL) {
by <- by %||% common_by(x, y)
}
```