Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first pass at the post-processing container #1

Merged
merged 39 commits into from
Apr 25, 2024
Merged

first pass at the post-processing container #1

merged 39 commits into from
Apr 25, 2024

Conversation

topepo
Copy link
Member

@topepo topepo commented Apr 17, 2024

We will be changing that name 😄

Adds a framework for specifying, fitting, and "predicting" post-processing adjustments.

R/container.R Outdated
#'
#' container()
#' @export
container <- function(mode = "unknown", type = "unknown", outcome = character(0),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ll probably need to validate the type of model as well as it “species.” Specifically, we might need to differentiate between binary and multiclass classification.

We can do this via an argument/attribute or by adding a more specific class to the container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that users need not specify either of mode or type when using containers with a workflow. It probably makes sense for container::fit.container() to continue requiring these be set before running fit.container(); workflows::fit.action_container() could set mode and type at workflows::fit.workflow() time before dispatching off to fit.container().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I thought that we would do.

R/container.R Outdated
container <- function(mode = "unknown", type = "unknown", outcome = character(0),
estimate = character(0), probabilities = character(0),
time = character(0), call = rlang::current_env()) {
dat <-
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another open question is: when do we have the users specify the relevant names of the expected columns? Within a workflow, we will set these but there are a few differences between recipes and what we are doing here are:

  • recipe columns could be anything or type of number. For predictions, we basically know how many things should be there and, for tidymodels, we know the names.

  • We really don’t need the column names until fit-time. The arguments are here because we’ll need to keep them in the container at some point.

  • The fit method is where these values are currently handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: #1 (comment)

R/container.R Outdated

#' @export
fit.container <- function(object, .data, outcome, estimate, probabilities = c(),
time = c(), call = rlang::current_env(), ...) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are placeholders for survival models but we won't have any methods for this for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time ought to be eval_time, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is a placeholder for the predicted time. We probably need .eval_time too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3

# ------------------------------------------------------------------------------

#' @export
fit.container <- function(object, .data, outcome, estimate, probabilities = c(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does .data as an argument grab you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote we just use data for consistency with fit.model_spec and fit.workflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data does kind of stink in that it surfaces data() when not supplied, but I think consistency here outweighs that oddity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm trying to avoid error messages about unsettable closures.

R/container.R Outdated

.data <- .data[, names(.data) %in% unlist(dat)]
.data <- tibble::as_tibble(.data)
ptype <- .data[0,]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for future us: I considered adding methods for applicability scores. We'd need to update the ptype to include the columns of the training set predictors (the data would be an argument to such functions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4

#' @param value A numeric value (between zero and 1/2) or [hardhat::tune()]. The
#' value is the size of the buffer around the threshold.
#' @param threshold A numeric value (between zero and one) or [hardhat::tune()].
#' @examples
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to add a Details section to explain this as well as what to look out for. The new class predictions have a different class (similar to a factor) that, to date, works with our tidymodels functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5

#'
#' predict(post_res, two_class_example)
#' @export
adjust_equivocal_zone <- function(x, value = 0.1, threshold = 1 / 2) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users might include a step before this to adjust the threshold, so we should maybe add some code to inherit a previous threshold if one exists in x.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6

"equivocal_zone",
inputs = "probability",
outputs = "class",
arguments = list(value = value, threshold = threshold),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had structured recipe steps with containers (ha) for inputs and computed results. Let's consider these names in case we do the same for recipes 2E (if that happens).

R/container.R Outdated
}

# validate operation order and check duplicates
validate_oper_order(operations, mode, call)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets invoked when operations are added so we can validate them as they go.

R/equivocal_zone.R Outdated Show resolved Hide resolved
cls_pred <- probably::make_two_class_pred(new_data[[prob_nm]], levels = lvls,
buffer = object$arguments$value,
threshold = object$arguments$threshold)
new_data[[ est_nm ]] <- cls_pred # todo convert to factor?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yardstick will convert equivocals to NA so we probably don't have to do it here.

tunable.equivocal_zone <- function(x, ...) {
tibble::tibble(
name = "buffer",
call_info = list(list(pkg = "dials", fun = "buffer")),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function to be created later (as are many others in the package).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7

R/misc.R Outdated
@@ -0,0 +1,17 @@
is_tune <- function(x) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should belong in hardhat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8

R/misc.R Outdated

# for operations with no tunable parameters

no_param <-
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as the one in recipes.

renames files that define an `adjust_*()` function and associated method as `adjust_function_name.R`. helps to differentiate this relatively homogenous file type from other kinds of source files.
Copy link
Contributor

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all of the explanatory code comments and review comments! Really helpful in familiarizing.

Plenty to do, for sure, but from my perspective, this is a really solid starting place and I'd vote that we go ahead and merge and iterate from here once those who want to review have a chance to. My changes above are mostly related to tidy styling and code navigability, holding off on additional features / testing / documentation / more substantive refactors for future PRs.

Going to start poking at workflows—I may end up returning here as I do so, but will generally not have local edits unless I've pinged otherwise.

R/adjust-numeric-calibration.R Outdated Show resolved Hide resolved
R/container.R Outdated
# ------------------------------------------------------------------------------
# set columns via tidyselect

dat <- list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dat is somewhat confusing when a .data lives in the same environment. Maybe columns, as it's eventually named in the object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ab07580

R/container.R Outdated
#' @param type The model sub-type. Possible values are `"unknown"`, `"regression"`,
#' `"binary"`, or `"multiclass"`.
#' @param outcome The name of the outcome variable.
#' @param estimate The name of the point estimate (e.g. predicted class)
Copy link
Contributor

@simonpcouch simonpcouch Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tripped up on this argument name. This is (likely) .pred_class or .pred?

Copy link
Member Author

@topepo topepo Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I added more context in this man file. We use it since it is the name of analgous argument in yardstick metric functions.

R/container.R Outdated
}
ptype <- .data[0, ]

object <- set_container_type(object, .data[[dat$outcome]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look to parsnip::check_outcome() for lessons learned on this kind of checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That uses the model spec and, if someone is using container as a stand-alone, they would only be able to specify the mode via the main container argument.

R/container.R Outdated Show resolved Hide resolved
R/container.R Outdated Show resolved Hide resolved
R/container.R Outdated
Comment on lines 3 to 4
#' @param mode The model's mode, one of `"unknown"`, `"classification"`, or
#' `"regression"`. Modes of `"censored regression"` are not currently supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with being corrected. Would it be worth rethinking "unknown" as a possible mode, as in just allow mode to be "classification" or "regression"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do that (and easily un-do it later). This was meant to be analogous to parsnip's unknown mode. However, I think that this situation is different and we can remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#' `"regression"`. Modes of `"censored regression"` are not currently supported.
#' @param type The model sub-type. Possible values are `"unknown"`, `"regression"`,
#' `"binary"`, or `"multiclass"`.
#' @param outcome The name of the outcome variable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for all params:

Right now the code specifies the names of the variables should be specified as character vectors. But I feel like we are moving away from strings and to have everything use {tidypredict} whether possible.

if we decide to stick with character vectors, then the documentation should reflect that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean {tidyselect} instead of {tidypredict}?

We do use tidyselect in fit.container() and the character defaults are there for that. I can make the man page reflect that it will be a tidyselect selector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, tidyselect


new_container(
mode,
type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we don't need type to be passed both in type argument and as part of columns. From the looks of it, i think we can exclude it from the columns argument

Comment on lines +144 to +146
if (!tibble::is_tibble(new_data)) {
new_data <- tibble::as_tibble(new_data)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we didn't have this, and rather made sure that each operation returned a tibble


if (mode == "classification") {
check_classification_order(orderings, call)
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets future proof this with

Suggested change
} else {
} else if (mode == "regression") {

Comment on lines +6 to +9
output_numeric = purrr::map_lgl(ops, ~ grepl("numeric", .x$outputs)),
output_prob = purrr::map_lgl(ops, ~ grepl("probability", .x$outputs)),
output_class = purrr::map_lgl(ops, ~ grepl("class", .x$outputs)),
output_all = purrr::map_lgl(ops, ~ grepl("everything", .x$outputs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing grepl("numeric", .x$outputs) here, and which(x$output_numeric) later, you could do match("numeric", .x$outputs) here and drop the which later

R/validation-rules.R Show resolved Hide resolved
R/adjust-equivocal-zone.R Show resolved Hide resolved
simonpcouch and others added 4 commits April 24, 2024 12:11
Co-authored-by: Emil Hvitfeldt <[email protected]>
ideally, there'd be a more canonical place to link to other than the help-files of specific calibration approaches
#' @param calibrator A pre-trained calibration method from the \pkg{probably}
#' package, such as [probably::cal_estimate_linear()].
#' @export
adjust_numeric_calibration <- function(x, calibrator) {
Copy link
Contributor

@simonpcouch simonpcouch Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this function (and it's probability analog) take a trained calibrator and then don't do any actual fitting at fit() time. This means that:

  • If users are not using a workflow, they need to specify the names of truth and estimate (and by different names!).
  • If they're using a workflow, they now need to locate the names of truth and estimate columns when they didn't used to need to.
  • Users (and tune, downstream) need to train the calibrator outside of the workflow.

With the current draft, this looks like:

library(modeldata)
library(probably)
#> 
#> Attaching package: 'probably'
#> The following objects are masked from 'package:base':
#> 
#>     as.factor, as.ordered
library(tibble)
library(container)

# create example data
set.seed(1)
dat <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100))

dat
#> # A tibble: 100 × 2
#>         y y_pred
#>     <dbl>  <dbl>
#>  1 -0.626 -0.934
#>  2  0.184  0.134
#>  3 -0.836 -1.33 
#>  4  1.60   0.956
#>  5  0.330 -0.490
#>  6 -0.820  1.36 
#>  7  0.487  0.960
#>  8  0.738  1.28 
#>  9  0.576  0.672
#> 10 -0.305  1.53 
#> # ℹ 90 more rows

# calibrate numeric predictions
reg_cal <- cal_estimate_linear(dat, truth = y, estimate = y_pred)

# specify calibration 
reg_ctr <-
  container(mode = "regression") %>%
  adjust_numeric_calibration(reg_cal)

# "train" container
reg_ctr_trained <- fit(reg_ctr, dat, outcome = y, estimate = y_pred)

predict(reg_ctr, dat)
#> # A tibble: 100 × 2
#>         y  y_pred
#>     <dbl>   <dbl>
#>  1 -0.626 -0.257 
#>  2  0.184  0.303 
#>  3 -0.836 -0.512 
#>  4  1.60   0.479 
#>  5  0.330  0.0108
#>  6 -0.820  0.523 
#>  7  0.487  0.480 
#>  8  0.738  0.516 
#>  9  0.576  0.438 
#> 10 -0.305  0.536 
#> # ℹ 90 more rows

Created on 2024-04-24 with reprex v2.1.0

I propose that we supply some interface in place of calibrator that just specifies the kind of calibration that would happen, e.g. type = "linear", ... or fn = cal_estimate_linear, ... and actually fit the calibrator at fit() time. This would allow for workflows users to never have to specify column names and container users to only specify column names once. This would look something like:

library(modeldata)
library(probably)
#> 
#> Attaching package: 'probably'
#> The following objects are masked from 'package:base':
#> 
#>     as.factor, as.ordered
library(tibble)
library(container)

# create example data
set.seed(1)
dat <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100))

dat
#> # A tibble: 100 × 2
#>         y y_pred
#>     <dbl>  <dbl>
#>  1 -0.626 -0.934
#>  2  0.184  0.134
#>  3 -0.836 -1.33 
#>  4  1.60   0.956
#>  5  0.330 -0.490
#>  6 -0.820  1.36 
#>  7  0.487  0.960
#>  8  0.738  1.28 
#>  9  0.576  0.672
#> 10 -0.305  1.53 
#> # ℹ 90 more rows

# specify calibration (no training has happened yet)
reg_ctr <-
  container(mode = "regression") %>%
  adjust_numeric_calibration(type = "linear") # or `fn = cal_estimate_linear`, etc

# actually train the calibration in the container
reg_ctr_trained <- fit(reg_ctr, dat, outcome = y, estimate = y_pred)

predict(reg_ctr, dat)
#> # A tibble: 100 × 2
#>         y  y_pred
#>     <dbl>   <dbl>
#>  1 -0.626 -0.257 
#>  2  0.184  0.303 
#>  3 -0.836 -0.512 
#>  4  1.60   0.479 
#>  5  0.330  0.0108
#>  6 -0.820  0.523 
#>  7  0.487  0.480 
#>  8  0.738  0.516 
#>  9  0.576  0.438 
#> 10 -0.305  0.536 
#> # ℹ 90 more rows

Created on 2024-04-24 with reprex v2.1.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: what data do you have to do that?

In the course of tuning, you only have the holdout predictions for that resample. If we had the whole set of out-of-sample predictions, we could train on that. Using re-predicted training data isn't a good option.

Another idea is to supply a data argument where people can pass in any predictions that they want. However, unless they have a specific validation set held out for this specific operation, we are back to creating those predictions outside of the workflow/tune process.

I'm not happy with what the process is now but I think it is what we can do (for now, at least).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I'll write a blog post about this to outline my thoughts better (and we can refer to that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: what data do you have to do that?

The current approach leans on users to figure that out--an approach like the one I've proposed allows us to take care of that for the user.

I think framing this in the context of tuning/resampling, like you've done, is helpful. When I think through the interface as implemented, I'm unsure how to imagine tuning parameters from the container. tune_*() takes in one workflow and resamples it. In the current approach, how is that workflow's postprocessor trained? The calibrator needs predictions, and based on your reply, we don't want those predictions to come from data that the model was trained on. So, do we take a three-way split, train/test the validation split, train some model on the training portion of the validation split, train the calibrator on the testing portion of the validation split, and then pass that fixed calibrator to the workflow? When tuning, which of the resampled models is the one used to train the calibrator? e.g., as I understand it, this would be the resampling flow as implemented:

library(tidymodels)
library(container)
library(probably)

penguins

# usual three-way flow
peng_split <- initial_validation_split(penguins)
peng_train <- training(peng_split)
peng_test <- testing(peng_split)
peng_val <- validation(peng_split)
peng_res <- vfold_cv(peng_train)

bt <- boost_tree("regression", trees = 3, learn_rate = tune())

# don't want to train the calibrator on re-predicted values
peng_split_cal <- initial_split(peng_val)
peng_train_cal <- training(peng_split_cal)
peng_test_cal <- testing(peng_split_cal)

# which of the models to be proposed trains the calibrator?
# bt_finalized <- finalize(bt, ...)
bt_cal_train <- fit(bt_finalized, body_mass_g ~ ., peng_train_cal)
bt_cal_preds <- augment(bt_cal_train, peng_test_cal)
bt_cal <- cal_estimate_linear(bt_cal_preds, ...)

bt_container <- container() %>% adjust_numeric_calibration(bt_cal)

wf <- workflow(body_mass_g ~ ., bt, bt_container)

tune_grid(wf, peng_res)

If we just allow specifying the strategy used to train a post-processor rather than the trained post-processor, the post-processor is actually resampled and the user interface feels much more like tidymodels. Also, the onus of doing the right thing w.r.t. resampling is on us rather than the user:

library(tidymodels)
library(container)
library(probably)

penguins

# usual two-way flow
peng_split <- initial_split(penguins)
peng_train <- training(peng_split)
peng_test <- testing(peng_split)
peng_res <- vfold_cv(peng_train)

bt <- boost_tree("regression", trees = 3, learn_rate = tune())
bt_container <- container() %>% adjust_numeric_calibration(type = "linear")

wf <- workflow(body_mass_g ~ ., bt, bt_container)

tune_grid(wf, peng_res)

In the course of tuning, you only have the holdout predictions for that resample. If we had the whole set of out-of-sample predictions, we could train on that. Using re-predicted training data isn't a good option.

From my understanding of our conversation earlier in the week, you've observed that training calibrators on re-predicted data is typically not a problem practically.

That said, from a technical standpoint, this feels quite solvable in tune:

  1. Take a computational performance penalty: figure out a reasonably memory-efficient way to train calibrators on unseen data by exchanging out-of-sample predictions after the initial tune grid loop.
  2. Take a predictive performance penalty: take an "internal" train/test split of the analysis (training portion of the resample) set for each resample when tuning post-processors and train the model on the internal training portion, post-processor on the internal testing portion.

@simonpcouch simonpcouch merged commit 3874410 into main Apr 25, 2024
1 check passed
@simonpcouch simonpcouch deleted the max branch April 25, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants