first pass at the post-processing container #1

topepo · 2024-04-17T21:25:14Z

We will be changing that name 😄

Adds a framework for specifying, fitting, and "predicting" post-processing adjustments.

topepo · 2024-04-17T21:28:59Z

R/container.R

+#'
+#' container()
+#' @export
+container <- function(mode = "unknown", type = "unknown", outcome = character(0),


We’ll probably need to validate the type of model as well as it “species.” Specifically, we might need to differentiate between binary and multiclass classification.

We can do this via an argument/attribute or by adding a more specific class to the container.

I'd prefer that users need not specify either of mode or type when using containers with a workflow. It probably makes sense for container::fit.container() to continue requiring these be set before running fit.container(); workflows::fit.action_container() could set mode and type at workflows::fit.workflow() time before dispatching off to fit.container().

Yes, this is what I thought that we would do.

topepo · 2024-04-17T21:32:20Z

R/container.R

+container <- function(mode = "unknown", type = "unknown", outcome = character(0),
+                      estimate = character(0), probabilities = character(0),
+                      time = character(0), call = rlang::current_env()) {
+  dat <-


Another open question is: when do we have the users specify the relevant names of the expected columns? Within a workflow, we will set these but there are a few differences between recipes and what we are doing here are:

recipe columns could be anything or type of number. For predictions, we basically know how many things should be there and, for tidymodels, we know the names.

We really don’t need the column names until fit-time. The arguments are here because we’ll need to keep them in the container at some point.

The fit method is where these values are currently handled.

related: #1 (comment)

topepo · 2024-04-17T21:42:12Z

R/container.R

+
+#' @export
+fit.container <- function(object, .data, outcome, estimate, probabilities = c(),
+                          time = c(), call = rlang::current_env(), ...) {


There are placeholders for survival models but we won't have any methods for this for a while.

time ought to be eval_time, no?

No, it is a placeholder for the predicted time. We probably need .eval_time too.

topepo · 2024-04-17T21:42:59Z

R/container.R

+# ------------------------------------------------------------------------------
+
+#' @export
+fit.container <- function(object, .data, outcome, estimate, probabilities = c(),


How does .data as an argument grab you?

I'd vote we just use data for consistency with fit.model_spec and fit.workflow.

data does kind of stink in that it surfaces data() when not supplied, but I think consistency here outweighs that oddity.

Yeah. I'm trying to avoid error messages about unsettable closures.

topepo · 2024-04-17T21:44:49Z

R/container.R

+
+  .data <- .data[, names(.data) %in% unlist(dat)]
+  .data <- tibble::as_tibble(.data)
+  ptype <- .data[0,]


Just a note for future us: I considered adding methods for applicability scores. We'd need to update the ptype to include the columns of the training set predictors (the data would be an argument to such functions).

topepo · 2024-04-17T21:46:19Z

R/equivocal_zone.R

+#' @param value A numeric value (between zero and 1/2) or [hardhat::tune()]. The
+#' value is the size of the buffer around the threshold.
+#' @param threshold A numeric value (between zero and one) or [hardhat::tune()].
+#' @examples


I'll need to add a Details section to explain this as well as what to look out for. The new class predictions have a different class (similar to a factor) that, to date, works with our tidymodels functions.

topepo · 2024-04-17T21:47:27Z

R/equivocal_zone.R

+#'
+#' predict(post_res, two_class_example)
+#' @export
+adjust_equivocal_zone <- function(x, value = 0.1, threshold = 1 / 2) {


Users might include a step before this to adjust the threshold, so we should maybe add some code to inherit a previous threshold if one exists in x.

topepo · 2024-04-17T21:48:48Z

R/equivocal_zone.R

+      "equivocal_zone",
+      inputs = "probability",
+      outputs = "class",
+      arguments = list(value = value, threshold = threshold),


I wish we had structured recipe steps with containers (ha) for inputs and computed results. Let's consider these names in case we do the same for recipes 2E (if that happens).

topepo · 2024-04-17T21:51:06Z

R/container.R

+  }
+
+  # validate operation order and check duplicates
+  validate_oper_order(operations, mode, call)


This gets invoked when operations are added so we can validate them as they go.

R/equivocal_zone.R

topepo · 2024-04-17T21:52:45Z

R/equivocal_zone.R

+  cls_pred <- probably::make_two_class_pred(new_data[[prob_nm]], levels = lvls,
+                                            buffer = object$arguments$value,
+                                            threshold = object$arguments$threshold)
+  new_data[[ est_nm ]] <- cls_pred # todo convert to factor?


yardstick will convert equivocals to NA so we probably don't have to do it here.

topepo · 2024-04-17T21:53:09Z

R/equivocal_zone.R

+tunable.equivocal_zone <- function(x, ...) {
+  tibble::tibble(
+    name = "buffer",
+    call_info = list(list(pkg = "dials", fun = "buffer")),


A function to be created later (as are many others in the package).

topepo · 2024-04-17T21:53:21Z

R/misc.R

@@ -0,0 +1,17 @@
+is_tune <- function(x) {


This should belong in hardhat.

topepo · 2024-04-17T21:53:36Z

R/misc.R

+
+# for operations with no tunable parameters
+
+no_param <-


The same as the one in recipes.

renames files that define an `adjust_*()` function and associated method as `adjust_function_name.R`. helps to differentiate this relatively homogenous file type from other kinds of source files.

simonpcouch

Thanks for all of the explanatory code comments and review comments! Really helpful in familiarizing.

Plenty to do, for sure, but from my perspective, this is a really solid starting place and I'd vote that we go ahead and merge and iterate from here once those who want to review have a chance to. My changes above are mostly related to tidy styling and code navigability, holding off on additional features / testing / documentation / more substantive refactors for future PRs.

Going to start poking at workflows—I may end up returning here as I do so, but will generally not have local edits unless I've pinged otherwise.

R/adjust-numeric-calibration.R

simonpcouch · 2024-04-22T20:44:33Z

R/container.R

+  # ------------------------------------------------------------------------------
+  # set columns via tidyselect
+
+  dat <- list()


dat is somewhat confusing when a .data lives in the same environment. Maybe columns, as it's eventually named in the object?

👍 ab07580

simonpcouch · 2024-04-22T21:03:29Z

R/container.R

+#' @param type The model sub-type. Possible values are `"unknown"`, `"regression"`,
+#' `"binary"`, or `"multiclass"`.
+#' @param outcome The name of the outcome variable.
+#' @param estimate The name of the point estimate (e.g. predicted class)


Tripped up on this argument name. This is (likely) .pred_class or .pred?

Yeah. I added more context in this man file. We use it since it is the name of analgous argument in yardstick metric functions.

simonpcouch · 2024-04-22T21:40:38Z

R/container.R

+  }
+  ptype <- .data[0, ]
+
+  object <- set_container_type(object, .data[[dat$outcome]])


We should look to parsnip::check_outcome() for lessons learned on this kind of checking.

That uses the model spec and, if someone is using container as a stand-alone, they would only be able to specify the mode via the main container argument.

R/container.R

EmilHvitfeldt · 2024-04-22T23:41:32Z

R/container.R

+#' @param mode The model's mode, one of `"unknown"`, `"classification"`, or
+#' `"regression"`. Modes of `"censored regression"` are not currently supported.


I'm fine with being corrected. Would it be worth rethinking "unknown" as a possible mode, as in just allow mode to be "classification" or "regression"?

I think we can do that (and easily un-do it later). This was meant to be analogous to parsnip's unknown mode. However, I think that this situation is different and we can remove it.

EmilHvitfeldt · 2024-04-22T23:46:22Z

R/container.R

+#' `"regression"`. Modes of `"censored regression"` are not currently supported.
+#' @param type The model sub-type. Possible values are `"unknown"`, `"regression"`,
+#' `"binary"`, or `"multiclass"`.
+#' @param outcome The name of the outcome variable.


Same comment for all params:

Right now the code specifies the names of the variables should be specified as character vectors. But I feel like we are moving away from strings and to have everything use {tidypredict} whether possible.

if we decide to stick with character vectors, then the documentation should reflect that.

Did you mean {tidyselect} instead of {tidypredict}?

We do use tidyselect in fit.container() and the character defaults are there for that. I can make the man page reflect that it will be a tidyselect selector.

yes, tidyselect

EmilHvitfeldt · 2024-04-22T23:50:11Z

R/container.R

+
+  new_container(
+    mode,
+    type,


seems like we don't need type to be passed both in type argument and as part of columns. From the looks of it, i think we can exclude it from the columns argument

EmilHvitfeldt · 2024-04-23T00:12:33Z

R/container.R

+  if (!tibble::is_tibble(new_data)) {
+    new_data <- tibble::as_tibble(new_data)
+  }


I would prefer if we didn't have this, and rather made sure that each operation returned a tibble

EmilHvitfeldt · 2024-04-23T00:14:32Z

R/validation-rules.R

+
+  if (mode == "classification") {
+    check_classification_order(orderings, call)
+  } else {


lets future proof this with

Suggested change

} else {

} else if (mode == "regression") {

EmilHvitfeldt · 2024-04-23T00:22:16Z

R/validation-rules.R

+      output_numeric = purrr::map_lgl(ops, ~ grepl("numeric", .x$outputs)),
+      output_prob = purrr::map_lgl(ops, ~ grepl("probability", .x$outputs)),
+      output_class = purrr::map_lgl(ops, ~ grepl("class", .x$outputs)),
+      output_all = purrr::map_lgl(ops, ~ grepl("everything", .x$outputs))


instead of doing grepl("numeric", .x$outputs) here, and which(x$output_numeric) later, you could do match("numeric", .x$outputs) here and drop the which later

R/validation-rules.R

R/adjust-equivocal-zone.R

Co-authored-by: Emil Hvitfeldt <[email protected]>

ideally, there'd be a more canonical place to link to other than the help-files of specific calibration approaches

simonpcouch · 2024-04-24T19:57:21Z

R/adjust-numeric-calibration.R

+#' @param calibrator A pre-trained calibration method from the \pkg{probably}
+#' package, such as [probably::cal_estimate_linear()].
+#' @export
+adjust_numeric_calibration <- function(x, calibrator) {


Currently, this function (and it's probability analog) take a trained calibrator and then don't do any actual fitting at fit() time. This means that:

If users are not using a workflow, they need to specify the names of truth and estimate (and by different names!).

If they're using a workflow, they now need to locate the names of truth and estimate columns when they didn't used to need to.

Users (and tune, downstream) need to train the calibrator outside of the workflow.

With the current draft, this looks like:

library(modeldata) library(probably) #> #> Attaching package: 'probably' #> The following objects are masked from 'package:base': #> #> as.factor, as.ordered library(tibble) library(container) # create example data set.seed(1) dat <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100)) dat #> # A tibble: 100 × 2 #> y y_pred #> <dbl> <dbl> #> 1 -0.626 -0.934 #> 2 0.184 0.134 #> 3 -0.836 -1.33 #> 4 1.60 0.956 #> 5 0.330 -0.490 #> 6 -0.820 1.36 #> 7 0.487 0.960 #> 8 0.738 1.28 #> 9 0.576 0.672 #> 10 -0.305 1.53 #> # ℹ 90 more rows # calibrate numeric predictions reg_cal <- cal_estimate_linear(dat, truth = y, estimate = y_pred) # specify calibration reg_ctr <- container(mode = "regression") %>% adjust_numeric_calibration(reg_cal) # "train" container reg_ctr_trained <- fit(reg_ctr, dat, outcome = y, estimate = y_pred) predict(reg_ctr, dat) #> # A tibble: 100 × 2 #> y y_pred #> <dbl> <dbl> #> 1 -0.626 -0.257 #> 2 0.184 0.303 #> 3 -0.836 -0.512 #> 4 1.60 0.479 #> 5 0.330 0.0108 #> 6 -0.820 0.523 #> 7 0.487 0.480 #> 8 0.738 0.516 #> 9 0.576 0.438 #> 10 -0.305 0.536 #> # ℹ 90 more rows

^{Created on 2024-04-24 with reprex v2.1.0}

I propose that we supply some interface in place of calibrator that just specifies the kind of calibration that would happen, e.g. type = "linear", ... or fn = cal_estimate_linear, ... and actually fit the calibrator at fit() time. This would allow for workflows users to never have to specify column names and container users to only specify column names once. This would look something like:

library(modeldata) library(probably) #> #> Attaching package: 'probably' #> The following objects are masked from 'package:base': #> #> as.factor, as.ordered library(tibble) library(container) # create example data set.seed(1) dat <- tibble(y = rnorm(100), y_pred = y/2 + rnorm(100)) dat #> # A tibble: 100 × 2 #> y y_pred #> <dbl> <dbl> #> 1 -0.626 -0.934 #> 2 0.184 0.134 #> 3 -0.836 -1.33 #> 4 1.60 0.956 #> 5 0.330 -0.490 #> 6 -0.820 1.36 #> 7 0.487 0.960 #> 8 0.738 1.28 #> 9 0.576 0.672 #> 10 -0.305 1.53 #> # ℹ 90 more rows # specify calibration (no training has happened yet) reg_ctr <- container(mode = "regression") %>% adjust_numeric_calibration(type = "linear") # or `fn = cal_estimate_linear`, etc # actually train the calibration in the container reg_ctr_trained <- fit(reg_ctr, dat, outcome = y, estimate = y_pred) predict(reg_ctr, dat) #> # A tibble: 100 × 2 #> y y_pred #> <dbl> <dbl> #> 1 -0.626 -0.257 #> 2 0.184 0.303 #> 3 -0.836 -0.512 #> 4 1.60 0.479 #> 5 0.330 0.0108 #> 6 -0.820 0.523 #> 7 0.487 0.480 #> 8 0.738 0.516 #> 9 0.576 0.438 #> 10 -0.305 0.536 #> # ℹ 90 more rows

^{Created on 2024-04-24 with reprex v2.1.0}

The problem is: what data do you have to do that?

In the course of tuning, you only have the holdout predictions for that resample. If we had the whole set of out-of-sample predictions, we could train on that. Using re-predicted training data isn't a good option.

Another idea is to supply a data argument where people can pass in any predictions that they want. However, unless they have a specific validation set held out for this specific operation, we are back to creating those predictions outside of the workflow/tune process.

I'm not happy with what the process is now but I think it is what we can do (for now, at least).

I think that I'll write a blog post about this to outline my thoughts better (and we can refer to that).

The problem is: what data do you have to do that?

The current approach leans on users to figure that out--an approach like the one I've proposed allows us to take care of that for the user.

I think framing this in the context of tuning/resampling, like you've done, is helpful. When I think through the interface as implemented, I'm unsure how to imagine tuning parameters from the container. tune_*() takes in one workflow and resamples it. In the current approach, how is that workflow's postprocessor trained? The calibrator needs predictions, and based on your reply, we don't want those predictions to come from data that the model was trained on. So, do we take a three-way split, train/test the validation split, train some model on the training portion of the validation split, train the calibrator on the testing portion of the validation split, and then pass that fixed calibrator to the workflow? When tuning, which of the resampled models is the one used to train the calibrator? e.g., as I understand it, this would be the resampling flow as implemented:

library(tidymodels) library(container) library(probably) penguins # usual three-way flow peng_split <- initial_validation_split(penguins) peng_train <- training(peng_split) peng_test <- testing(peng_split) peng_val <- validation(peng_split) peng_res <- vfold_cv(peng_train) bt <- boost_tree("regression", trees = 3, learn_rate = tune()) # don't want to train the calibrator on re-predicted values peng_split_cal <- initial_split(peng_val) peng_train_cal <- training(peng_split_cal) peng_test_cal <- testing(peng_split_cal) # which of the models to be proposed trains the calibrator? # bt_finalized <- finalize(bt, ...) bt_cal_train <- fit(bt_finalized, body_mass_g ~ ., peng_train_cal) bt_cal_preds <- augment(bt_cal_train, peng_test_cal) bt_cal <- cal_estimate_linear(bt_cal_preds, ...) bt_container <- container() %>% adjust_numeric_calibration(bt_cal) wf <- workflow(body_mass_g ~ ., bt, bt_container) tune_grid(wf, peng_res)

If we just allow specifying the strategy used to train a post-processor rather than the trained post-processor, the post-processor is actually resampled and the user interface feels much more like tidymodels. Also, the onus of doing the right thing w.r.t. resampling is on us rather than the user:

library(tidymodels) library(container) library(probably) penguins # usual two-way flow peng_split <- initial_split(penguins) peng_train <- training(peng_split) peng_test <- testing(peng_split) peng_res <- vfold_cv(peng_train) bt <- boost_tree("regression", trees = 3, learn_rate = tune()) bt_container <- container() %>% adjust_numeric_calibration(type = "linear") wf <- workflow(body_mass_g ~ ., bt, bt_container) tune_grid(wf, peng_res)

In the course of tuning, you only have the holdout predictions for that resample. If we had the whole set of out-of-sample predictions, we could train on that. Using re-predicted training data isn't a good option.

From my understanding of our conversation earlier in the week, you've observed that training calibrators on re-predicted data is typically not a problem practically.

That said, from a technical standpoint, this feels quite solvable in tune:

Take a computational performance penalty: figure out a reasonably memory-efficient way to train calibrators on unseen data by exchanging out-of-sample predictions after the initial tune grid loop.

Take a predictive performance penalty: take an "internal" train/test split of the analysis (training portion of the resample) set for each resample when tuning post-processors and train the model on the internal training portion, post-processor on the internal testing portion.

‘topepo’ added 16 commits April 15, 2024 13:34

initial versions

69f1351

type comments

3b4f35f

is_tune

56ca8e5

numeric range constraints

c1b173e

numeric_calibration

c62ac77

add packages + tidy

38b5662

auto-set type with reg mode

8f0b4a2

some example documentation

162c212

pdf doc

c07348c

eq zone

979841c

add threshold

00efa2a

add slot for extra required packages

196a0ca

changes for validation

66f6e2e

regression validation

2037357

updates for R CMD check

aeca10d

README WARNING

7509ba0

topepo commented Apr 17, 2024

View reviewed changes

R/equivocal_zone.R Outdated Show resolved Hide resolved

topepo commented Apr 17, 2024

View reviewed changes

R/misc.R Outdated

# for operations with no tunable parameters

no_param <-

Copy link

Member Author

topepo Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as the one in recipes.

simonpcouch added 3 commits April 22, 2024 14:44

redocument()

561a795

use Remotes probably for bound_prediction()

b92edb0

simplify file structure

809da4a

renames files that define an `adjust_*()` function and associated method as `adjust_function_name.R`. helps to differentiate this relatively homogenous file type from other kinds of source files.

simonpcouch approved these changes Apr 22, 2024

View reviewed changes

R/adjust-numeric-calibration.R Outdated Show resolved Hide resolved

simonpcouch reviewed Apr 22, 2024

View reviewed changes

EmilHvitfeldt approved these changes Apr 23, 2024

View reviewed changes

simonpcouch and others added 4 commits April 24, 2024 12:11

apply emil's review suggestions

a5d5843

Co-authored-by: Emil Hvitfeldt <[email protected]>

type check <container>s

661a323

refine type checks for cal objects

e3038ba

ideally, there'd be a more canonical place to link to other than the help-files of specific calibration approaches

refine + test print methods

fbc9fbe

simonpcouch reviewed Apr 24, 2024

View reviewed changes

simonpcouch and others added 6 commits April 24, 2024 15:05

add example for numeric calibration

cfe9455

parent -> container

9e4f483

pull 'trained' out of results list

72d1007

dat -> columns

ab07580

update man file to explain estimate column better.

3cab3e0

remove missing mode

917196f

default input selectors to NULL

5e6f981

simonpcouch mentioned this pull request Apr 25, 2024

should calibration objects be pre-fitted or fitted at fit.container()? #9

Closed

simonpcouch merged commit 3874410 into main Apr 25, 2024
1 check passed

simonpcouch deleted the max branch April 25, 2024 20:40

		#' @param mode The model's mode, one of `"unknown"`, `"classification"`, or
		#' `"regression"`. Modes of `"censored regression"` are not currently supported.

first pass at the post-processing container #1

first pass at the post-processing container #1

Conversation

topepo commented Apr 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpcouch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpcouch Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

topepo Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpcouch Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpcouch Apr 22, 2024 •

edited

Loading

topepo Apr 25, 2024 •

edited

Loading

simonpcouch Apr 24, 2024 •

edited

Loading