forked from tidyverse/design
-
Notifications
You must be signed in to change notification settings - Fork 0
/
cs-stringr.qmd
62 lines (43 loc) · 3.15 KB
/
cs-stringr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Case study: stringr {#sec-cs-stringr}
```{r}
#| include = FALSE
source("common.R")
```
```{=html}
<!--
https://github.com/wch/r-source/blob/trunk/src/main/grep.c#L891-L1151 -->
```
This chapter explores some of the considerations designing the stringr package.
```{r}
library(stringr)
```
## Function names
When the base regular expression functions were written, most R users were familiar with the command line and tools like grepl.
This made naming R's string manipulation functions after these tools seem natural.
When I started work on stringr, the majority of R users were not familiar with linux or the command line, so it made more sense to start afresh.
I think there were successes and failures here.
On the whole, I think `str_replace_all()`, `str_locate()`, and `str_detect()` are easier to remember than `gsub()`, `regexpr()`, and `grepl()`.
However, it's harder to remember what makes `str_subset()` and `str_which()` different.
If I was to do stringr again, I would make more of an effort to distinguish between functions that operated on individual matches and individual strings as `str_locate()` and `str_which()` seem like their names should be more closely related as `str_locate()` returns the location of a match within each string in the vector, and `str_subset()` returns the matching locations within a vector.
## Argument order and names
Base R string functions mostly have `pattern` as the first argument, with the chief exception being `strsplit()`.
stringr functions always have `string` as the first argument.
I regret using `string`; I now think `x` would be a more appropriate name.
## `str_flatten()`
`str_flatten()` was a relatively recent addition to stringr.
It took me a long time to realise that one of the challenges of understanding `paste()` was that depending on the presence or absence of the `collapse` argument it could either transform a string (i.e. return something the same length) or summarise a string (i.e. always return a single string).
Once `str_flatten()` existed it become more clear that it would be useful to have `str_flatten_comma()` which made it easier to use the Oxford comma (which seems to be something that's only needed for English, and ironically the Oxford comma is more common in US English than UK English).
## Recycling rules
stringr implements recycling rules so that you can either supply a vector of strings or a vector of patterns:
```{r}
alphabet <- str_flatten(letters, collapse = "")
vowels <- c("a", "e", "i", "o", "u")
grepl(vowels, alphabet)
str_detect(alphabet, vowels)
```
On the whole I regret this.
It's generally not that useful (since you typically have more than one string, not more than one pattern), most people don't use it, and now it feels overly clever.
## Redundant functions
There are a couple of stringr functions that were very useful at the time, but are now less important.
- `nchar(NA)` used to return 2, and `nchar(factor("abc"))` used to return 1. `str_length()` fixed both of these problems, but those fixes also migrated to base R, leaving `str_length()` as less useful.
- `paste0()` did not exist so `str_c()` was very useful. But now `str_c()` primarily only useful for its recycling logic.