-
Notifications
You must be signed in to change notification settings - Fork 4
/
long_vs_wide.qmd
243 lines (186 loc) · 7.52 KB
/
long_vs_wide.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
# Long vs Wide processes
Within the example project there's a bothersome little wrinkle. We call a helper
function `compute_h3_indices_at_resolutions()` twice. This function is creating
a set of spatial indices for our data. It's a potentially expensive process on
larger data, and ideally one we'd only perform once.
We call it:
- Once in `wrangle_and_join_weather()` as part of the creation of our 'clean' dataset
- Once in `plot_species_class_accuracy_hexes()` to create a plot of classifier accuracy by hexagon.
- We're using the testing data for the model with validation metrics at that point and we dropped the spatial index when we created the training data.
What we could perhaps to instead is:
1. Compute the spatial induces for our occurrences in a separate dataset
2. Join them only when needed e.g. for the hex-binned plots.
But why did this wrinkle appear anyway?
In a classic staged script workflow datasets goes through a very linear path.
- There's this kind of unspoken quest to build the perfect-one-true-clean dataset from which all analysis can flow.
- Columns get added and added, rarely removed.
- Datasets get quite wide.
- Often the binding name is reused each time:
```{r}
#| eval: false
the_dataset <-
the_dataset |>
mutate( # or join, summarise, rbind, cbind etc.
# ...
)
```
- This style is what I am going to call a 'long' process.
With `{targets}` long processes probably spend more CPU cycles than necessary. Why?
- <details><summary>
- Small changes to a target possibly trigger a large number of other targets to be rebuilt. Since the target you changed has a huge chain of targets hanging off it.
</summary>
</details>
Initially when users get started with `{targets}` there can be tendency to
continue to pursue this pattern of long chains of data transformations which are
represented as linear sequences of {targets}.
With `{targets}` we have the choice between:
- Minimising end to end running time of the plan
- As is usually the aim in the classic workflow
- Using 'long' processes
- Minimising total amount of running time of the plan ever
- By minimising dependencies between targets
- Using 'wide' processes
We'll make our plan a little wider now by refactoring out the spatial index into a separate target.
# Refactoring Steps
1. Remove the this code from `wrangle_and_join_weather()`:
```{r}
#| eval: false
occurrences_weather_hexes <-
st_as_sf(
occurrences_weather,
coords = c("decimalLongitude", "decimalLatitude"),
remove = FALSE,
crs = first(occurrences$geodeticDatum)
) |>
mutate(
compute_h3_indices_at_resolutions(h3_hex_resolutions, geometry)
)
occurrences_weather_hexes
```
and place it in a function that creates a dataset containing the spatial indices. We'll need to refactor it a bit further in a minute.
```{r}
#| eval: false
tar_target(
occurrences_hexes,
create_h3_indexes(
occurrences_weather,
h3_hex_resolutions
)
)
```
2. Change the name of `occurrences_weather_hexes` to `occurrences_weather`, since it now has nothing to do with hexes.
- Remove the `h3_hex_resolutions` argument from `wrangle_and_join_weather()`
- Change the name also where this dataset is input to other targets
3. Add an `id` column to `occurrences_weather` in `wrangle_and_join_weather()` like:
```{r}
#| eval: false
occurrences_weather |>
mutate(id = seq(n()))
```
- This will be a key for us to join to.
4. Return select the `id` in the `select()` in `create_training_data()`
5. Change the model formulas in `fit_fold_calc_results()` and `fit_final_species_classification_model()` from `scientificName ~ .` to `scientificName ~ . - id` to exclude our `id`.
- Actually changing the model formula in two places highlights we should probably break it out into its own target! Consider that an exercise left to the reader.
6. Refactor `create_h3_indexes` further to just return `id` and the h3 indexes:
```{r}
#| eval: false
occurrences_hexes <-
st_as_sf(
occurrences_weather,
coords = c("decimalLongitude", "decimalLatitude"),
remove = FALSE,
crs = first(occurrences_weather$geodeticDatum)
) |>
mutate(
compute_h3_indices_at_resolutions(h3_hex_resolutions, geometry)
) |> st_drop_geometry() |>
select(
id,
starts_with("h3")
)
occurrences_hexes
```
7. To all the targets that start with `gg_` and end with `hexes` pass in our `occurrences_hexes` target and join to the main dataset before plotting.
- E.g. in `plot_species_distribution_hexes()` do this:
```{r}
#| eval: false
hex_occurrences <-
occurrences_weather |>
left_join(occurrences_hexes, by = "id") |> # the new bit
st_drop_geometry() |>
select(scientificName, h3_hex_8) |>
summarise(
count = n(),
.by = c("scientificName", "h3_hex_8")
) |>
mutate(
geometry = cell_to_polygon(h3_hex_8)
) |>
st_as_sf()
# plot stuff follows
```
- In `gg_species_class_accuracy_hexes()` remove the `h3_hex_resolutions` argument and replace with `occurrences_hexes`.
Replace this code:
```{r}
#| eval: false
model_validation_predictions_hex <-
species_model_validation_data |>
st_as_sf(
coords = c("decimalLongitude", "decimalLatitude"),
remove = FALSE,
crs = 4326
) |>
mutate(
compute_h3_indices_at_resolutions(h3_hex_resolutions, geometry)
) |>
st_drop_geometry()
```
With this:
```{r}
#| eval: false
model_validation_predictions_hex <-
species_model_validation_data |>
left_join(occurrences_hexes, by = "id")
```
8. In `plot_species_distributions_points()` we no longer have a spatial dataset. So have to make our data spatial for plotting:
```{r}
#| eval: false
occurrences_weather_points <-
occurrences_weather |>
st_as_sf(
coords = c("decimalLongitude", "decimalLatitude"),
remove = FALSE,
crs = first(occurrences_weather$geodeticDatum)
)
p <-
ggplot() +
geom_sf(
data = brisbane_river
) +
geom_sf(
data = occurrences_weather_points
) +
facet_wrap(~scientificName) +
theme_light() +
theme()
p
```
- We're re-converting to point geometry here.
- We could make a similar argument to hexes for another target `occurrences_points` to be calculated and joined on as needed.
- Where we calculate hexes
- Where we plot points
- Another exercise for the reader!
The completed refactor is available on [this branch of the example project](https://github.com/MilesMcBain/classic_r_project/tree/refactor4)
# A wider angle
If we compare the network graphs before this refactor:
![Everything depending on occurrences_weather_hexes](images/long_network.png)
With the one post this refactor:
![Some targets do not need the hex information](images/wide_network.png)
We can see `occurrences_weather_hexes` is less of a chokepoint, and that the modeling branch of the pipeline no longer depends on the spatial indices.
# Review
When working with targets `{targets}` you have a new criteria to optimise for: minimise the lengths of dependency chains.
- 'Widening' your process will lower overall total running time, since you can make best re-use of work saved in the store.
- Using this strategy some things that were annoyingly slow in a linear pipeline are less important to optimise.
- Code might be slow, but it hardly ever runs!
- For `{tidyverse}` users `{targets}` enables you to 'have your cake and eat it too'.
- E.g. Why bother with faster packages with more difficult syntax for code that hardly ever runs?