-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathfit_model.Rmd
342 lines (284 loc) · 11.5 KB
/
fit_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: "Test DVC"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidymodels)
library(stacks)
library(xgboost)
library(yardstick)
library(tidyverse)
library(workflows)
library(gt)
library(vip)
library(jsonlite)
library(yaml)
library(kableExtra)
```
The goal of this document is fit and select candidate models. We use the [DVC](https://dvc.org) system to track and manage the model hyper-parameters and metrics, and [tidymodels](https://tidymodels.prg) for the actual fitting.
## Model Goal
Our goal will be to predict housing prices for the Ames housing data set. For the purposes of DVC, I've written this dataset out to a csv to demonstrate how to track the model.
```r
library(AmesHousing)
write_csv(ames_raw, "housing.csv")
```
We will use the `tune` package to evaluate a few different models. I attempted to use `stacks` to fit ensembles as well but hit an error, however we will also consider a manually fit ensemble.
```{r data, message=FALSE, warning=FALSE}
# first read our data and create test / training
data <- read_csv("housing.csv")
housing_split <- initial_split(data)
housing_train <- training(housing_split)
housing_test <- testing(housing_split)
# within training, setup fold cross validation
folds <- rsample::vfold_cv(housing_train, v=4)
```
We will create a simple recipe to clean the data a bit.
```{r recipe}
housing_rec <-
recipe(SalePrice ~ PID + `Lot Area` +`Year Built` + `Total Bsmt SF` + `Full Bath` + `Half Bath` + Fireplaces + `Garage Cars`, data = housing_train) %>%
update_role(PID, new_role = "ID") %>%
step_dummy(all_nominal()) %>%
step_meanimpute(all_numeric()) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors())
housing_rec %>%
prep() %>%
bake(housing_train)
# create a workflow (pipeline) that will track this preprocessing
# and allow us to build up a pipeline for fitting the models
housing_wflow <-
workflow() %>%
add_recipe(housing_rec)
# use rmse as the optimizing metric
metric <- metric_set(rmse)
```
## Mashing up tidymodels and DVC
Now we will define our candidate models and set them up to be tuned Here is where there is a bit of a disconnect between DVC and tidymodels. DVC wants to be in control of the stages of processing data, fitting a model, searching hyper parameters etc. But tidymodels expects to own all of that through the workflow construct!
It is not clear to me how the workflow could be split up so that the different pieces could be controlled and cached by DVC. I think you hit the same problem if you try to mash up packages like `drake` or `targets` and `tidymodels`. My hunch is that you could skip the workflow and just use the raw building blocks of tidymodels, but then you miss out on the simple and clean syntax. For reference, this is a pretty interesting attempt: https://mdneuzerling.com/post/machine-learning-pipelines-with-tidymodels-and-targets/.
> The problem with model flow controllers is that they can't agree where the boundaries should be!
In this case we will do a bit of a hybrid approach. I will let tune handle the lasso regression, but will set the number of trees using an experiment parameter that will be tracked by DVC and read from the file `params.yml`. The downside here is that there will potentially be variation in the "experiments" not accounted for by the experimental inputs! I could fix some of this variation by setting a random seed.
The outcome is that the DVC "pipeline" will really be just one big step, `rmarkdown::render` that contains all the sub-steps inside of it. On the plus side, we still get the benefits of using DVC to version the input data and experiment parameters as well as the output.
## Back to models
The idea with tuning is to fit a bunch of candidate models using cross validation `model_spec` + `workflow` + `tune_grid`. Then from the candidate models, pick the best model, or use the best of the candidates in an ensemble!
```{r linear, message=TRUE, warning=FALSE}
# first the linear regression with penalty
# here we will use tune to set the penalty and mixture
lin_reg_spec <-
linear_reg(penalty = tune("penalty"), mixture = tune("mixture")) %>%
set_engine("glmnet")
# add the preprocessing recipe to the model
lin_reg_wflow <-
housing_wflow %>%
add_model(lin_reg_spec)
# setup the grid that will tune the different parameters
# and generate a bunch of resample fit models to use for
# the ensemble stack!
lin_res <-
tune_grid(
lin_reg_wflow,
resamples = folds,
metrics = metric,
control = stacks::control_stack_grid()
)
```
```{r trees}
# next setup the tree model, using the DVC defined experiment parameters
params <- yaml::read_yaml('params.yaml')
trees <- params$tune$trees
tree_spec <- boost_tree(mode = "regression", trees = trees) %>%
set_engine("xgboost")
# add the preprocessing recipe to the model
tree_wflow <-
housing_wflow %>%
add_model(tree_spec)
# we aren't using a tune grid to create different resamples because we aren't tuning
# but we still need resamples for the model stacking, so we specify a resample spec
# we need resamples so that the trained candidates from the tune grid can
# have a matching resample when we fit the ensemble
tree_res <-
fit_resamples(
tree_wflow,
metric = metric,
resamples = folds,
control = stacks::control_stack_resamples()
)
```
> Tip! Use yaml::read_yaml('params.yml') to read the experiment parameters in the way that is compatible with DVC pipeline definitions
~Finally we can create the ensemble definition using `stacks` and begin fitting!~ Unfortunately this didn't work, but we can get close with our own `glmnet`.
```{r ensemble, message=FALSE, warning=FALSE}
# create the stack
# this is where we align the resample subsets with the resample sets used
# to train the tuned cross validation candidates
housing_model_stack <-
stacks() %>%
add_candidates(lin_res) %>%
add_candidates(tree_res)
# Argh for some reason this is not working
# housing_model_stack %>%
# blend_predictions()
# Instead I can just use glmnet manually to figure out the best blend
# which is what blend_predictions does under the hood
# Note: cv.glmnet uses alpha=1 (LASSO) by default, but a DVC parameter
# controls the blend; alpha=0 is ridge; 1 lasso, in between is elastic net
alpha <- params$ensemble$alpha
cv <- cv.glmnet(
as.matrix(housing_model_stack[,-1]),
as.matrix(housing_model_stack[,1]),
alpha = alpha
)
cv$lambda.1se
best_mix <- which(cv$lambda == cv$lambda.1se)
weights <- cv$glmnet.fit$beta[,best_mix]
intercept <- cv$glmnet.fit$a0[best_mix]
print('Ensemble Weights:')
print(weights)
```
```{r finalize, message=FALSE, warning=FALSE}
# but now I have to do quite a bit of work to finalize the workflows
# in order to have a prediction function
# we finalize the the models that were fit on a resampled subset (during cv)
# so they are fit on the entire training dataset
# first pull out the models used by the stack
# this is a helpful R tip if there isn't a getter for what you need
all_info <- attributes(housing_model_stack)
# finalize the linear models
# this pulls the tuned parameters for the models used to
# generate the data stack
lin_reg_params <- all_info$model_metrics$lin_res %>%
select(penalty, mixture)
# this finalizes the model workflow, fitting the cv candidate models
# with the whole training data
lin_reg_wflow_fit <- map(1:nrow(lin_reg_params), function(i){
finalize_workflow(lin_reg_wflow, lin_reg_params[i,]) %>%
fit(housing_train)
})
# now finalize the tree model; again, no 'tuning' here
# so there is no need to go back to the stack
tree_wflow_fit <-
finalize_workflow(tree_wflow, list(trees = trees)) %>%
fit(housing_train)
# with the fitted workflows, and the glmnet ensemble coefficients
# we can create the final "predict" function for our "ensemble"
ensemble_predict <- function(new_data){
lin_predictions <- map(lin_reg_wflow_fit, predict, new_data)
# list to data frame
lin_predictions_df <- do.call(cbind, lin_predictions)
tree_predictions <- predict(tree_wflow_fit, new_data)
# assemble in the right order (same as the stack used in the glm)
x <- cbind(1, lin_predictions_df, tree_predictions)
colnames(x) <- c("Intercept", all_info$names[-1])
beta <- c(intercept, weights)
ensemble_predictions <- as.matrix(x)%*%beta
x$Intercept <- NULL
x$ensemble <- as.numeric(ensemble_predictions)
x
}
results <- ensemble_predict(housing_test)
```
Now that we have our ensemble we can produce a few metrics that will be displayed in the DVC pull requests.
```{r metrics, message=FALSE, warning=FALSE}
get_metrics <- function(results, model_name) {
tibble(
model = model_name,
rmse = rmse_vec(housing_test$SalePrice, results[,model_name]),
rsq = rsq_vec(housing_test$SalePrice, results[,model_name]),
)
}
metrics <- map_df(colnames(results), ~get_metrics(results, .x))
metrics
```
```{r save-table}
# save metrics to show on DVC comment
# gt's save a png could be used for richer formatted
# tables, but requires webshot
# gt(metrics) %>%
# gtsave("metrics.png")
kableExtra::kable(metrics) %>%
write_file('metrics_table.html')
```
```{r save-metrics}
# write out metrics for DVC
metrics %>%
arrange(desc(rsq)) %>%
head(n=1) %>%
jsonlite::unbox() %>%
jsonlite::write_json('metrics.json')
```
Well, in this case the ensemble did better than the models alone! Can we explain what this resulting ensemble model is really using to make predictions? Is the minor increase worth the increased complexity?
## Model Analysis?
```{r errors}
results$truth <- housing_test$SalePrice
p <- ggplot(results) +
geom_point(aes(truth, ensemble)) +
theme_minimal() +
labs(
title = "Errors",
x = "Actual Home Values",
y = "Predicted Home Values"
) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
geom_abline(slope =1, intercept = 0)
p
ggsave("errors.png", p)
```
```{r errors2}
results$err <- (results$truth - results$ensemble)^2
ggplot(results) +
geom_point(aes(truth, err)) +
theme_minimal() +
labs(
title = "Errors",
x = "Actual Home Values",
y = "Squared Error"
) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
geom_smooth(aes(truth, err))
```
These errors suggest the model predicts worst on the high end for expensive homes, which is also where the least data is; its possible our test set "drifts" from the actual set here, but further investigation would be required.
We can look at the ensemble contributors:
```{r ensemble-parts}
tibble(
models = colnames(results)[1:11],
relative_contribution = weights
) %>%
ggplot() +
geom_col(aes(reorder(models, relative_contribution), relative_contribution)) +
coord_flip() +
theme_minimal() +
labs(
x = "Models",
y = "Relative Contribution"
) -> p
p
ggsave("ensemble.png", p)
```
So there are two models contributing, the tree model mostly and the linear model. For each we can get some sense of what is contributing:
```{r linear-importance}
# this code is a little brittle, ideally it should pull the
# linear model with the biggest relative weight, not just the
# first one
lin_reg_wflow_fit[[1]] %>%
pull_workflow_fit() %>%
broom::tidy() %>%
filter(term != "(Intercept)") %>%
ggplot(aes(reorder(term, estimate), estimate)) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(
title = "20% of Ensemble is...",
x = "",
y = "Contribution"
)
```
And the things that are important to the tree model:
```{r tree-importance}
tree_wflow_fit %>%
pull_workflow_fit() %>%
vip() -> p
p
ggsave('tree-importance.png', p)
```