-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first pass at the post-processing container #1
Conversation
R/container.R
Outdated
#' | ||
#' container() | ||
#' @export | ||
container <- function(mode = "unknown", type = "unknown", outcome = character(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’ll probably need to validate the type of model as well as it “species.” Specifically, we might need to differentiate between binary and multiclass classification.
We can do this via an argument/attribute or by adding a more specific class to the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that users need not specify either of mode
or type
when using containers with a workflow. It probably makes sense for container::fit.container()
to continue requiring these be set before running fit.container()
; workflows::fit.action_container()
could set mode
and type
at workflows::fit.workflow()
time before dispatching off to fit.container()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what I thought that we would do.
R/container.R
Outdated
container <- function(mode = "unknown", type = "unknown", outcome = character(0), | ||
estimate = character(0), probabilities = character(0), | ||
time = character(0), call = rlang::current_env()) { | ||
dat <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another open question is: when do we have the users specify the relevant names of the expected columns? Within a workflow, we will set these but there are a few differences between recipes and what we are doing here are:
-
recipe columns could be anything or type of number. For predictions, we basically know how many things should be there and, for tidymodels, we know the names.
-
We really don’t need the column names until fit-time. The arguments are here because we’ll need to keep them in the container at some point.
-
The
fit
method is where these values are currently handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related: #1 (comment)
R/container.R
Outdated
|
||
#' @export | ||
fit.container <- function(object, .data, outcome, estimate, probabilities = c(), | ||
time = c(), call = rlang::current_env(), ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are placeholders for survival models but we won't have any methods for this for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time
ought to be eval_time
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is a placeholder for the predicted time. We probably need .eval_time
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ------------------------------------------------------------------------------ | ||
|
||
#' @export | ||
fit.container <- function(object, .data, outcome, estimate, probabilities = c(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does .data
as an argument grab you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote we just use data
for consistency with fit.model_spec
and fit.workflow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data
does kind of stink in that it surfaces data()
when not supplied, but I think consistency here outweighs that oddity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I'm trying to avoid error messages about unsettable closures.
R/container.R
Outdated
|
||
.data <- .data[, names(.data) %in% unlist(dat)] | ||
.data <- tibble::as_tibble(.data) | ||
ptype <- .data[0,] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note for future us: I considered adding methods for applicability scores. We'd need to update the ptype
to include the columns of the training set predictors (the data would be an argument to such functions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/equivocal_zone.R
Outdated
#' @param value A numeric value (between zero and 1/2) or [hardhat::tune()]. The | ||
#' value is the size of the buffer around the threshold. | ||
#' @param threshold A numeric value (between zero and one) or [hardhat::tune()]. | ||
#' @examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to add a Details section to explain this as well as what to look out for. The new class predictions have a different class (similar to a factor) that, to date, works with our tidymodels functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/equivocal_zone.R
Outdated
#' | ||
#' predict(post_res, two_class_example) | ||
#' @export | ||
adjust_equivocal_zone <- function(x, value = 0.1, threshold = 1 / 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users might include a step before this to adjust the threshold, so we should maybe add some code to inherit a previous threshold if one exists in x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/equivocal_zone.R
Outdated
"equivocal_zone", | ||
inputs = "probability", | ||
outputs = "class", | ||
arguments = list(value = value, threshold = threshold), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we had structured recipe steps with containers (ha) for inputs and computed results. Let's consider these names in case we do the same for recipes 2E (if that happens).
R/container.R
Outdated
} | ||
|
||
# validate operation order and check duplicates | ||
validate_oper_order(operations, mode, call) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets invoked when operations are added so we can validate them as they go.
R/equivocal_zone.R
Outdated
cls_pred <- probably::make_two_class_pred(new_data[[prob_nm]], levels = lvls, | ||
buffer = object$arguments$value, | ||
threshold = object$arguments$threshold) | ||
new_data[[ est_nm ]] <- cls_pred # todo convert to factor? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yardstick will convert equivocals to NA so we probably don't have to do it here.
R/equivocal_zone.R
Outdated
tunable.equivocal_zone <- function(x, ...) { | ||
tibble::tibble( | ||
name = "buffer", | ||
call_info = list(list(pkg = "dials", fun = "buffer")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A function to be created later (as are many others in the package).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/misc.R
Outdated
@@ -0,0 +1,17 @@ | |||
is_tune <- function(x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should belong in hardhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/misc.R
Outdated
|
||
# for operations with no tunable parameters | ||
|
||
no_param <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as the one in recipes.
renames files that define an `adjust_*()` function and associated method as `adjust_function_name.R`. helps to differentiate this relatively homogenous file type from other kinds of source files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all of the explanatory code comments and review comments! Really helpful in familiarizing.
Plenty to do, for sure, but from my perspective, this is a really solid starting place and I'd vote that we go ahead and merge and iterate from here once those who want to review have a chance to. My changes above are mostly related to tidy styling and code navigability, holding off on additional features / testing / documentation / more substantive refactors for future PRs.
Going to start poking at workflows—I may end up returning here as I do so, but will generally not have local edits unless I've pinged otherwise.
R/container.R
Outdated
# ------------------------------------------------------------------------------ | ||
# set columns via tidyselect | ||
|
||
dat <- list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dat
is somewhat confusing when a .data
lives in the same environment. Maybe columns
, as it's eventually named in the object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ab07580
R/container.R
Outdated
#' @param type The model sub-type. Possible values are `"unknown"`, `"regression"`, | ||
#' `"binary"`, or `"multiclass"`. | ||
#' @param outcome The name of the outcome variable. | ||
#' @param estimate The name of the point estimate (e.g. predicted class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tripped up on this argument name. This is (likely) .pred_class
or .pred
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I added more context in this man file. We use it since it is the name of analgous argument in yardstick metric functions.
R/container.R
Outdated
} | ||
ptype <- .data[0, ] | ||
|
||
object <- set_container_type(object, .data[[dat$outcome]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should look to parsnip::check_outcome()
for lessons learned on this kind of checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That uses the model spec and, if someone is using container as a stand-alone, they would only be able to specify the mode via the main container argument.
R/container.R
Outdated
#' @param mode The model's mode, one of `"unknown"`, `"classification"`, or | ||
#' `"regression"`. Modes of `"censored regression"` are not currently supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with being corrected. Would it be worth rethinking "unknown"
as a possible mode, as in just allow mode to be "classification"
or "regression"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do that (and easily un-do it later). This was meant to be analogous to parsnip's unknown mode. However, I think that this situation is different and we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' `"regression"`. Modes of `"censored regression"` are not currently supported. | ||
#' @param type The model sub-type. Possible values are `"unknown"`, `"regression"`, | ||
#' `"binary"`, or `"multiclass"`. | ||
#' @param outcome The name of the outcome variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for all params:
Right now the code specifies the names of the variables should be specified as character vectors. But I feel like we are moving away from strings and to have everything use {tidypredict} whether possible.
if we decide to stick with character vectors, then the documentation should reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean {tidyselect} instead of {tidypredict}?
We do use tidyselect in fit.container()
and the character defaults are there for that. I can make the man page reflect that it will be a tidyselect selector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, tidyselect
|
||
new_container( | ||
mode, | ||
type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we don't need type
to be passed both in type
argument and as part of columns
. From the looks of it, i think we can exclude it from the columns
argument
if (!tibble::is_tibble(new_data)) { | ||
new_data <- tibble::as_tibble(new_data) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we didn't have this, and rather made sure that each operation returned a tibble
|
||
if (mode == "classification") { | ||
check_classification_order(orderings, call) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets future proof this with
} else { | |
} else if (mode == "regression") { |
output_numeric = purrr::map_lgl(ops, ~ grepl("numeric", .x$outputs)), | ||
output_prob = purrr::map_lgl(ops, ~ grepl("probability", .x$outputs)), | ||
output_class = purrr::map_lgl(ops, ~ grepl("class", .x$outputs)), | ||
output_all = purrr::map_lgl(ops, ~ grepl("everything", .x$outputs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing grepl("numeric", .x$outputs)
here, and which(x$output_numeric)
later, you could do match("numeric", .x$outputs)
here and drop the which
later
Co-authored-by: Emil Hvitfeldt <[email protected]>
ideally, there'd be a more canonical place to link to other than the help-files of specific calibration approaches
#' @param calibrator A pre-trained calibration method from the \pkg{probably} | ||
#' package, such as [probably::cal_estimate_linear()]. | ||
#' @export | ||
adjust_numeric_calibration <- function(x, calibrator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this function (and it's probability analog) take a trained calibrator
and then don't do any actual fitting at fit()
time. This means that:
- If users are not using a workflow, they need to specify the names of
truth
andestimate
(and by different names!). - If they're using a workflow, they now need to locate the names of
truth
andestimate
columns when they didn't used to need to. - Users (and tune, downstream) need to train the calibrator outside of the workflow.
With the current draft, this looks like:
library(modeldata)
library(probably)
#>
#> Attaching package: 'probably'
#> The following objects are masked from 'package:base':
#>
#> as.factor, as.ordered
library(tibble)
library(container)
# create example data
set.seed(1)
dat <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100))
dat
#> # A tibble: 100 × 2
#> y y_pred
#> <dbl> <dbl>
#> 1 -0.626 -0.934
#> 2 0.184 0.134
#> 3 -0.836 -1.33
#> 4 1.60 0.956
#> 5 0.330 -0.490
#> 6 -0.820 1.36
#> 7 0.487 0.960
#> 8 0.738 1.28
#> 9 0.576 0.672
#> 10 -0.305 1.53
#> # ℹ 90 more rows
# calibrate numeric predictions
reg_cal <- cal_estimate_linear(dat, truth = y, estimate = y_pred)
# specify calibration
reg_ctr <-
container(mode = "regression") %>%
adjust_numeric_calibration(reg_cal)
# "train" container
reg_ctr_trained <- fit(reg_ctr, dat, outcome = y, estimate = y_pred)
predict(reg_ctr, dat)
#> # A tibble: 100 × 2
#> y y_pred
#> <dbl> <dbl>
#> 1 -0.626 -0.257
#> 2 0.184 0.303
#> 3 -0.836 -0.512
#> 4 1.60 0.479
#> 5 0.330 0.0108
#> 6 -0.820 0.523
#> 7 0.487 0.480
#> 8 0.738 0.516
#> 9 0.576 0.438
#> 10 -0.305 0.536
#> # ℹ 90 more rows
Created on 2024-04-24 with reprex v2.1.0
I propose that we supply some interface in place of calibrator
that just specifies the kind of calibration that would happen, e.g. type = "linear", ...
or fn = cal_estimate_linear, ...
and actually fit the calibrator at fit()
time. This would allow for workflows users to never have to specify column names and container users to only specify column names once. This would look something like:
library(modeldata)
library(probably)
#>
#> Attaching package: 'probably'
#> The following objects are masked from 'package:base':
#>
#> as.factor, as.ordered
library(tibble)
library(container)
# create example data
set.seed(1)
dat <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100))
dat
#> # A tibble: 100 × 2
#> y y_pred
#> <dbl> <dbl>
#> 1 -0.626 -0.934
#> 2 0.184 0.134
#> 3 -0.836 -1.33
#> 4 1.60 0.956
#> 5 0.330 -0.490
#> 6 -0.820 1.36
#> 7 0.487 0.960
#> 8 0.738 1.28
#> 9 0.576 0.672
#> 10 -0.305 1.53
#> # ℹ 90 more rows
# specify calibration (no training has happened yet)
reg_ctr <-
container(mode = "regression") %>%
adjust_numeric_calibration(type = "linear") # or `fn = cal_estimate_linear`, etc
# actually train the calibration in the container
reg_ctr_trained <- fit(reg_ctr, dat, outcome = y, estimate = y_pred)
predict(reg_ctr, dat)
#> # A tibble: 100 × 2
#> y y_pred
#> <dbl> <dbl>
#> 1 -0.626 -0.257
#> 2 0.184 0.303
#> 3 -0.836 -0.512
#> 4 1.60 0.479
#> 5 0.330 0.0108
#> 6 -0.820 0.523
#> 7 0.487 0.480
#> 8 0.738 0.516
#> 9 0.576 0.438
#> 10 -0.305 0.536
#> # ℹ 90 more rows
Created on 2024-04-24 with reprex v2.1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is: what data do you have to do that?
In the course of tuning, you only have the holdout predictions for that resample. If we had the whole set of out-of-sample predictions, we could train on that. Using re-predicted training data isn't a good option.
Another idea is to supply a data
argument where people can pass in any predictions that they want. However, unless they have a specific validation set held out for this specific operation, we are back to creating those predictions outside of the workflow/tune process.
I'm not happy with what the process is now but I think it is what we can do (for now, at least).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I'll write a blog post about this to outline my thoughts better (and we can refer to that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is: what data do you have to do that?
The current approach leans on users to figure that out--an approach like the one I've proposed allows us to take care of that for the user.
I think framing this in the context of tuning/resampling, like you've done, is helpful. When I think through the interface as implemented, I'm unsure how to imagine tuning parameters from the container. tune_*()
takes in one workflow and resamples it. In the current approach, how is that workflow's postprocessor trained? The calibrator needs predictions, and based on your reply, we don't want those predictions to come from data that the model was trained on. So, do we take a three-way split, train/test the validation split, train some model on the training portion of the validation split, train the calibrator on the testing portion of the validation split, and then pass that fixed calibrator to the workflow? When tuning, which of the resampled models is the one used to train the calibrator? e.g., as I understand it, this would be the resampling flow as implemented:
library(tidymodels)
library(container)
library(probably)
penguins
# usual three-way flow
peng_split <- initial_validation_split(penguins)
peng_train <- training(peng_split)
peng_test <- testing(peng_split)
peng_val <- validation(peng_split)
peng_res <- vfold_cv(peng_train)
bt <- boost_tree("regression", trees = 3, learn_rate = tune())
# don't want to train the calibrator on re-predicted values
peng_split_cal <- initial_split(peng_val)
peng_train_cal <- training(peng_split_cal)
peng_test_cal <- testing(peng_split_cal)
# which of the models to be proposed trains the calibrator?
# bt_finalized <- finalize(bt, ...)
bt_cal_train <- fit(bt_finalized, body_mass_g ~ ., peng_train_cal)
bt_cal_preds <- augment(bt_cal_train, peng_test_cal)
bt_cal <- cal_estimate_linear(bt_cal_preds, ...)
bt_container <- container() %>% adjust_numeric_calibration(bt_cal)
wf <- workflow(body_mass_g ~ ., bt, bt_container)
tune_grid(wf, peng_res)
If we just allow specifying the strategy used to train a post-processor rather than the trained post-processor, the post-processor is actually resampled and the user interface feels much more like tidymodels. Also, the onus of doing the right thing w.r.t. resampling is on us rather than the user:
library(tidymodels)
library(container)
library(probably)
penguins
# usual two-way flow
peng_split <- initial_split(penguins)
peng_train <- training(peng_split)
peng_test <- testing(peng_split)
peng_res <- vfold_cv(peng_train)
bt <- boost_tree("regression", trees = 3, learn_rate = tune())
bt_container <- container() %>% adjust_numeric_calibration(type = "linear")
wf <- workflow(body_mass_g ~ ., bt, bt_container)
tune_grid(wf, peng_res)
In the course of tuning, you only have the holdout predictions for that resample. If we had the whole set of out-of-sample predictions, we could train on that. Using re-predicted training data isn't a good option.
From my understanding of our conversation earlier in the week, you've observed that training calibrators on re-predicted data is typically not a problem practically.
That said, from a technical standpoint, this feels quite solvable in tune:
- Take a computational performance penalty: figure out a reasonably memory-efficient way to train calibrators on unseen data by exchanging out-of-sample predictions after the initial tune grid loop.
- Take a predictive performance penalty: take an "internal" train/test split of the analysis (training portion of the resample) set for each resample when tuning post-processors and train the model on the internal training portion, post-processor on the internal testing portion.
We will be changing that name 😄
Adds a framework for specifying, fitting, and "predicting" post-processing adjustments.