Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add output type conversion functions #150

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
49eab2b
start transformation functions
eahowerton Jun 6, 2024
bca3fab
Add documentation
eahowerton Jun 6, 2024
8dd2ad8
switch from summarize() to reframe()
eahowerton Jun 6, 2024
48dfffb
add preliminary validations
eahowerton Jun 6, 2024
71dacb9
add validations of new_output_type_id
eahowerton Jun 6, 2024
9b81d7a
update documentation, lint
eahowerton Jun 6, 2024
d8f9e4b
Update DESCRIPTION
eahowerton Jun 6, 2024
5b9e311
Create get_task_id_cols.R
eahowerton Jun 6, 2024
2a6c595
use get_task_id_cols()
eahowerton Jun 6, 2024
10ff653
Create get_task_id_cols.R (#149)
eahowerton Jun 6, 2024
7243570
use get_task_id_cols()
eahowerton Jun 6, 2024
243cdb5
Merge branch 'output-type-conversion' of https://github.com/Infectiou…
eahowerton Jun 6, 2024
cc4e002
switch model_outputs to model_out_tbl
eahowerton Jun 6, 2024
4db0992
add functionality for multiple new output_types
eahowerton Jun 7, 2024
bee1c61
lint
eahowerton Jun 7, 2024
94e1b40
Merge branch 'main' into output-type-conversion2
lshandross Sep 11, 2024
28a6be6
Update DESCRIPTION
lshandross Sep 11, 2024
cd82801
Rename `convert_output_types()` title
lshandross Sep 11, 2024
8bfee20
Punctuation fixes
lshandross Sep 11, 2024
727d33d
Remove within package `hubUtils` namespacing
lshandross Sep 11, 2024
d07d985
Update DESCRIPTION
lshandross Sep 11, 2024
2d7d19c
Update NAMESPACE
lshandross Sep 11, 2024
0f73f84
Update convert_output_types.R
lshandross Sep 11, 2024
944404d
Fix linting issues
lshandross Sep 11, 2024
94fa108
Document `convert_output_type()`
lshandross Sep 11, 2024
ac6333a
Update DESCRIPTION
lshandross Sep 11, 2024
e8595e1
Fix failing tests
lshandross Sep 12, 2024
781016d
Improve output type conversion validation messages
lshandross Sep 12, 2024
1a9789e
Define global functions
lshandross Sep 12, 2024
84ff4aa
Clean up `convert_output_types` functions
lshandross Sep 12, 2024
9fa2e35
Don't export smaller conversion functions
lshandross Sep 12, 2024
db08c45
Document `get_task_id_cols()`
lshandross Sep 12, 2024
75df987
Fix `convert_output_type()` example
lshandross Sep 12, 2024
ee6b098
Clarify `convert_output_type()` description and return value
lshandross Sep 12, 2024
de84f22
Fix linting issues
lshandross Sep 12, 2024
d4119e2
Add single starting output type validation and test
lshandross Sep 12, 2024
1e48584
Remove duplicate test
lshandross Sep 12, 2024
66ad690
Remove seed and check for `distfromq`
lshandross Sep 12, 2024
4b8ed54
Fix failing test
lshandross Sep 12, 2024
e23e047
Refactor new `validate_output_type_id()` function
lshandross Sep 13, 2024
6dd2926
Update documentation
lshandross Sep 13, 2024
95d6f28
Testing fixes
lshandross Sep 13, 2024
254b737
Merge branch 'main' into output-type-conversion2
lshandross Oct 1, 2024
522e88f
model_out_tbl output type col is character in `convert_output_type()`…
lshandross Oct 3, 2024
054bffb
Make `new_output_type_id` arg usually a named list; simplify its vali…
lshandross Oct 3, 2024
cbe2acf
Fix quantile -> median, mean/cdf transformation (use intermediary sam…
lshandross Oct 4, 2024
8d2e95b
Cleanup tests
lshandross Oct 4, 2024
874ad81
Rename variable in `convert_output_types()` example to be more accurate
lshandross Oct 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Suggests:
covr,
curl,
digest,
distfromq,
lshandross marked this conversation as resolved.
Show resolved Hide resolved
dplyr,
fs,
gh,
Expand Down
251 changes: 251 additions & 0 deletions R/convert_output_types.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
#' Transform between output types, from one starting output_type into new
#' output_types. See details for supported conversions.
#'
lshandross marked this conversation as resolved.
Show resolved Hide resolved
#' @param model_out_tbl an object of class `model_out_tbl` with component model
lshandross marked this conversation as resolved.
Show resolved Hide resolved
#' outputs (e.g., predictions). `model_out_tbl` should contain only one
#' unique value in the `output_type` column
lshandross marked this conversation as resolved.
Show resolved Hide resolved
#' @param new_output_type `string` indicating the desired output_type after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a suggestion but how do you feel about the argument name to_output_type instead of new_output_type?

#' transformation (`"mean"`, `"median"`, `"quantile"`, `"cdf"`); can also be a
#' vector if multiple new output_types are desired
lshandross marked this conversation as resolved.
Show resolved Hide resolved
#' @param new_output_type_id `vector` indicating desired output_type_ids for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to above comment, how about to_output_type_id?

#' corresponding `new_output_type`; only needs to be specified if
#' `new_output_type` includes `"quantile"` or `"cdf"` (see details for how to
#' specify when both `"quantile"` and `"cdf"` are desired)
lshandross marked this conversation as resolved.
Show resolved Hide resolved
#' @param n_samples `numeric` that specifies the number of samples to use when
#' calculating output_types from an estimated quantile function. Defaults to `1e4`.
#' @param ... parameters that are passed to `distfromq::make_q_fn`, specifying
#' details of how to estimate a quantile function from provided quantile levels
#' and quantile values for `"quantile"` or `"cdf"` output types.
#'
#' @details
#' The following transformations are supported: (i) `"sample"` can be
#' transformed to `"mean"`, `"median"`, `"quantile"`, or `"cdf"`; (ii)
#' `"quantile"` can be transformed to `"mean"`, `"median"`, or `"cdf"`; and
#' (iii) `"cdf"` can be transformed to `"mean"`, `"median"`, or `"quantile"`.
#'
#' For `"quantile"` and `"cdf"` starting output types, we use the following approach:
#' 1. Interpolate and extrapolate from the provided quantiles or probabilities
#' for each component model to obtain an estimate of the cdf of that distribution.
#' 2. Draw samples from the distribution for each component model. To reduce
#' Monte Carlo variability, we use quasi-random samples corresponding to
#' quantiles of the estimated distribution.
#' 3. Calculate the desired quantity (e.g., mean).
#' If the median quantile is provided in the `model_out_tbl` object (i.e., the
#' original output_type is `"median"` and 0.5 is contained in the original
#' output_type_id), the median value is extracted and returned directly.
#'
#' If both `"quantile"` and `"cdf"` output_types are desired, `new_output_type_id`
#' should be a named list, where each element specifies the corresponding
#' `new_output_type_id`. See examples for an illustration.
#'
#' @examples
#' # We illustrate the conversion between output types using normal distributions,
#' ex_quantiles <- c(0.25, 0.5, 0.75)
#' model_out_tbl <- expand.grid(
#' group1 = c(1,2),
#' model_id = "A",
#' output_type = "quantile",
#' output_type_id = ex_quantiles
#' ) %>%
#' dplyr::mutate(value = qnorm(p = output_type_id, mean = group1))
#'
#' convert_output_type(model_out_tbl, c("group1"), new_output_type = "median")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is c("group1") for in this example?

#'
#' # Next, we illustrate conversion from samples to quantile and cdf
#' ex_bins <- seq(-2,2,1)
#' ex_quantiles <- c(0.25, 0.5, 0.75)
#' model_out_tbl <- expand.grid(
#' group1 = c(1,2),
#' model_id = "A
#' output_type = "sample",
#' output_type_id = 1:100
#' ) %>%
#' dplyr::mutate(value = rnorm(100, mean = group1))
#'
#' convert_output_type(model_out_tbl, new_output_type = c("quantile", "cdf"),
#' new_output_type_id = list("quantile" = ex_quantiles, "cdf" = ex_bins))
#'
#' @return object of class `model_out_tbl` containing new output_type
lshandross marked this conversation as resolved.
Show resolved Hide resolved
#' @export
convert_output_type <- function(model_out_tbl, new_output_type,
new_output_type_id = NA, n_samples = 1e4, ...) {
Comment on lines +78 to +79
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this function, where everything is converted to samples and from there to the target output type, has a certain kind of elegance and at the same time is kind of inefficient. For example, here's what we do when the starting_output_type is "quantile" and the new_output_type is "cdf":

  1. In get_samples_from_quantiles, we call distfromq::make_q_fn to estimate the quantile function and draw samples from it
  2. Then we compute the empirical cdf of those samples and evaluate that ecdf in

Restating, our goal was to get to cdf values, and in order to do that we (1) estimate the cdf; (2) invert that to get to a quantile function estimate; (3) draw samples from the quantile function; (4) get the empirical cdf of those samples; (5) evaluate that empirical cdf. Although it is conceptually clean to work through the "universal representation" of samples, steps 2-5 here are extra work and in particular the sample-based stuff in steps 3-5 will introduce more layers of approximation to the result.

Another alternative organization to this code might provide a set of "direct" transformations that don't use samples as an intermediate step. For example, the quantile to cdf transform might define a helper function along the lines of transform_quantile_to_cdf_one_group along these lines:

transform_quantile_to_cdf_one_group <- function(starting_ps, starting_qs, to_qs, ...) {
  p_fn <- distfromq::make_p_fn(
    ps = as.numeric(starting_ps),
    qs = starting_qs,
    ...
  )
  return(p_fn(to_qs))
}

Which could then be called on a grouped data frame somewhere, like

  samples <- model_out_tbl %>%
    dplyr::group_by(model_id, dplyr::across(dplyr::all_of(task_id_cols))) %>%
    dplyr::reframe(
      value = transform_quantile_to_cdf_one_group(
        starting_ps=as.numeric(.data$output_type_id),
        starting_qs=.data$value,
        to_qs=new_output_type_id,
        ...
      )

I think this approach would likely involve more code in the end, but could be nice as it would be more direct and involve less approximation. I'm open to your ideas about this!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I had conceptualized a set of transformations like transform_quantile_to_cdf_one_group() that you describe @elray1. When most required transforming to samples, I decided for the elegant route. But I agree, there is additional approximation that is not necessary, and would ideally be avoided.

Question if we opt for the individual transformation functions: consider, for example, we are transforming from quantile to mean. The steps are (1) estimate the cdf, (2) invert to quantile function, (3) draw samples from quantile function, (4) calculate mean. Would this function call the transform_quantile_to_samples() function? Or would it replicate steps 1-3 within the transform_quantile_to mean() function?

@lshandross, I'm happy for you to make the final decision on this one.

Copy link
Contributor

@elray1 elray1 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was a good prompt for me to think about which transformations require samples, estimates of the QF, and/or estimates of the CDF. Here's my thinking, does this seem corret?

  • starting type = quantile
    • new type = quantile: do we support this? should we? if so, would not require samples, use estimated QF
    • new type = cdf: samples not needed, use estimated CDF
    • new type = pmf: samples not needed, use estimated CDF. Some data structure would be required specifying bin endpoints, or some convention for the names of the output_type_ids for bins, e.g. (0,0.1].
    • new type = sample: samples needed
    • new type = median: samples not needed, use estimated QF
    • new type = mean: samples needed
  • starting type = cdf
    • new type = quantile: samples not needed, use estimated QF
    • new type = cdf: do we support this? should we? if so, would not require samples, use estimated CDF
    • new type = pmf: samples not needed, use estimated CDF
    • new type = sample: samples needed
    • new type = median: samples not needed, use estimated QF
    • new type = mean: samples needed
  • starting type = pmf.
    • overall note, i think that for the most part, transforms here only make sense if the variable is a discretization of an underlying continuous variable and we get some information about the bin endpoints used for the discretization. but in that setting, i think the transformation options match what was said above for starting type = cdf.
    • though if the variable is nominal, we could go to the CDF type naturally.
  • starting type = sample
    • you already have samples!
  • starting type = median
    • can't do any transformations other than the trivial transform to quantile at level 0.5
  • starting type = mean
    • can't do any transformations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this with Li just now, and arrived at a plan to merge this PR more or less as-is, then do the following in separate issues:

  • Allow new_type = sample for existing probabilistic starting types where they are not supported (quantile and cdf)
  • Allow new_type = pmf for existing probabilistic starting types (quantile, cdf, and sample).
    • Maybe at the same time, move away from using samples as intermediary for new type = cdf. This would get us to a place where the same methodology is used when the new output type is pmf and cdf.
    • Maybe at the same time, support cdf -> cdf, e.g. extending the set of points where the cdf is evaluated. This feels like a low priority extension, not sure there is really a use case?
  • Allow starting_type = pmf
  • Lower priority issue, support quantile --> quantile transformations, e.g. extending the set of quantile levels predicted. This could be useful to modelers who are doing something like quantile regression, to allow them to get predictions at a small number of quantile levels and then fill in the others.
    • At this time, could switch to using QF as an intermediary when new_type = quantile or median rather than samples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great, thank you both for pushing this across the finish line!

# validations
task_id_cols <- get_task_id_cols(model_out_tbl)
starting_output_type <- model_out_tbl$output_type %>% unique()
lshandross marked this conversation as resolved.
Show resolved Hide resolved
starting_output_type_ids <- model_out_tbl$output_type_id %>% unique()
task_id_cols <- get_task_id_cols(model_out_tbl)
lshandross marked this conversation as resolved.
Show resolved Hide resolved
validate_new_output_type(
starting_output_type, new_output_type,
new_output_type_id
)
# for cdf and quantile functions, get samples
if (starting_output_type == "cdf") {
# estimate from samples
model_out_tbl <- get_samples_from_cdf(model_out_tbl, task_id_cols, n_samples)
} else if (starting_output_type == "quantile") {
# if median output desired, and Q50 provided return exact value, otherwise
# estimate from samples
if (!("median" %in% new_output_type && 0.5 %in% starting_output_type_ids)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we might still want this calculation if, for example, new_output_type = c("median", "cdf")

Copy link
Contributor

@lshandross lshandross Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the placement of the parentheses, this calculation does occur in that case because it simplifies to !("median" %in% new_output_type) || !(0.5 %in% starting_output_type_ids)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarifying/restating -- what if we have a situation where:

  • starting_output_type = "quantile", with starting_output_type_ids = c(0.025, 0.25, 0.5, 0.75, 0.975)
  • new_output_type = c("median", "cdf")

In this case, !("median" %in% new_output_type) and !(0.5 %in% starting_output_type_ids) both evaluate to FALSE and so the condition does not run, right? But in that case we would still need the representation in terms of samples in order to get to the cdf outputs (if we stick with a workflow that uses samples as an intermediate).

Copy link
Contributor

@lshandross lshandross Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further investigation revealed that your example with median and cdf new output types led to a case where indeed no samples were estimated and the cdf values were instead estimated from the provided quantiles using stats::ecdf().

I went in and made some changes to the function so that samples would be used as an intermediary for consistency, as the plan is to merging this pull request first before switching how we perform the transformation as detailed below in your other comment in commit cbe2acf

model_out_tbl <- get_samples_from_quantiles(model_out_tbl, task_id_cols, n_samples)
}
}
# transform based on new_output_type
grouped_model_out_tbl <- model_out_tbl %>%
dplyr::group_by(model_id, dplyr::across(dplyr::all_of(task_id_cols)))
model_out_tbl_transform <- vector("list", length = length(new_output_type))
for (i in 1:length(new_output_type)) {
# if median output desired, and Q50 provided return exact value
if (new_output_type[i] == "median" && 0.5 %in% starting_output_type_ids) {
model_out_tbl_transform[[i]] <- model_out_tbl %>%
dplyr::filter(output_type_id == 0.5) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably fine, so may not be worth updating -- but it seems like there may be an edge case where one model does not provide a predictive quantile at the level 0.5, or they provide some output_type_id like 0.49999999999999999999999 that doesn't evaluate as equal to 0.5, in which case we might end up without a row with a median where that model did provide forecasts.

I think the conditions of that edge case would be pretty rare, so again, not sure it's worth coding for them...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could file a low priority issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, maybe not the most important case to handle. And I might argue that the user should address this (rather than us making assumptions about what is "sufficiently" close to 0.5 to be considered a "median") but I don't feel strongly about this!

dplyr::mutate(
output_type = new_output_type[i],
output_type_id = NA
) %>%
hubUtils::as_model_out_tbl()
lshandross marked this conversation as resolved.
Show resolved Hide resolved
}
# otherwise calculate new values
# first find new_output_type_id
new_output_type_id_tmp <- new_output_type_id
if (new_output_type[i] %in% c("mean", "median")) {
new_output_type_id_tmp <- NA
} else
if (is.list(new_output_type_id)) {
new_output_type_id_tmp <- new_output_type_id[[new_output_type[i]]]
}
lshandross marked this conversation as resolved.
Show resolved Hide resolved
model_out_tbl_transform[[i]] <- convert_from_sample(
grouped_model_out_tbl, new_output_type[i], new_output_type_id_tmp
)
}
return(dplyr::bind_rows(model_out_tbl_transform))
}

validate_new_output_type <- function(starting_output_type, new_output_type,
new_output_type_id) {
valid_conversions <- list(
"sample" = c("mean", "median", "quantile", "cdf"),
"quantile" = c("mean", "median", "cdf"),
"cdf" = c("mean", "median", "quantile")
)
# check starting_output_type is supported
lshandross marked this conversation as resolved.
Show resolved Hide resolved
valid_starting_output_type <- starting_output_type %in% names(valid_conversions)
if (!valid_starting_output_type) {
cli::cli_abort(c(
"{.var output_type} provided cannot be transformed",
lshandross marked this conversation as resolved.
Show resolved Hide resolved
i = "must be of type {.var sample}, {.var quantile}, {.var cdf}."
lshandross marked this conversation as resolved.
Show resolved Hide resolved
))
}
# check new_output_type is supported
invalid_new_output_type <- which(!(new_output_type %in% valid_conversions[[starting_output_type]]))
if (length(invalid_new_output_type) > 0) {
cli::cli_abort(c(
"{invalid_new_output_type} cannot be transformed to the specified
{.var new_output_type}",
i = "new_output_type must be {valid_conversions[[starting_output_type]]}"
))
}
lshandross marked this conversation as resolved.
Show resolved Hide resolved
# check new_output_type_id
if (all(new_output_type %in% c("mean", "median")) && !all(is.na(new_output_type_id))) {
cli::cli_abort(c(
"{.var new_output_type_id} is incompatible with {.var new_output_type}",
i = "{.var new_output_type_id} should be {.var NA}"
))
lshandross marked this conversation as resolved.
Show resolved Hide resolved
} else if ("quantile" %in% new_output_type) {
new_output_type_id_tmp <- new_output_type_id
if (is.list(new_output_type_id)) {
new_output_type_id_tmp <- new_output_type_id[["quantile"]]
lshandross marked this conversation as resolved.
Show resolved Hide resolved
}
if (!is.numeric(new_output_type_id_tmp)) {
cli::cli_abort(c(
"elements of {.var new_output_type_id} should be numeric",
i = "elements of {.var new_output_type_id} represent quantiles
of the predictive distribution"
))
}
if (any(new_output_type_id_tmp < 0) || any(new_output_type_id_tmp > 1)) {
cli::cli_abort(c(
"elements of {.var new_output_type_id} should be between 0 and 1",
i = "elements of {.var new_output_type_id} represent quantiles
of the predictive distribution"
))
}
} else if (new_output_type == "cdf") {
new_output_type_id_tmp <- new_output_type_id
if (is.list(new_output_type_id)) {
new_output_type_id_tmp <- new_output_type_id[["cdf"]]
}
if (!is.numeric(new_output_type_id_tmp)) {
cli::cli_abort(c(
"elements of {.var new_output_type_id} should be numeric",
i = "elements of {.var new_output_type_id} represent possible
values of the target variable"
))
}
}
}

#' @export
get_samples_from_quantiles <- function(model_out_tbl, task_id_cols, n_samples, ...) {
set.seed(101)
lshandross marked this conversation as resolved.
Show resolved Hide resolved
lshandross marked this conversation as resolved.
Show resolved Hide resolved
samples <- model_out_tbl %>%
dplyr::group_by(model_id, dplyr::across(dplyr::all_of(task_id_cols))) %>%
dplyr::reframe(
value = distfromq::make_q_fn(
ps = as.numeric(.data$output_type_id),
qs = .data$value, ...
)(runif(n_samples, 0, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of notes here:

  • This is actually exactly the calculation that's done in the function that's returned by distfromq::make_r_fn. So you could do this in a slightly easier-to-read way by calling
value = distfromq::make_r_fn(
        ps = as.numeric(.data$output_type_id),
        qs = .data$value, ...
      )(n_samples)
  • But these are pseudo-random samples rather than quasi-random samples. I actually think the distinction here is probably of low importance. In case someone is converting to output type sample and they want the randomness they'd expect from that, maybe it's best to keep this behavior rather than updating to quasi-random samples. Then we should just update the documentation for the main function above to remove the line "To reduce Monte Carlo variability, we use quasi-random samples corresponding to quantiles of the estimated distribution."
  • Otherwise, if we do want to stick with quasi-random samples, we could use a line like the following here:
value = distfromq::make_q_fn(
        ps = as.numeric(.data$output_type_id),
        qs = .data$value, ...
      )(seq(from = 0, to = 1, length = n_samples + 2)[1:(n_samples+1)])
  • To address a comment I saw somewhere else from Anna about what's going on here, it might be helpful to refactor this into a separate function that takes .data$output_type_id and .data$value as arguments and has two lines: one calling distfromq::make_*_fn and storing that result in q_fn or r_fn (depending on which way you prefer to go), and a second calling that function. (this refactoring may be less necessary if you switch to just using make_r_fn since that one is easier to read)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support changing to make_r_fn(), and that following the intuition of samples, we probably want to default to pseudo-random samples. We could also file a low-priority issue to make this sampling scheme an argument that the user can specify? I could imagine use cases of the function where each type of sampling may be desired.

)


return(samples)
}

#' @export
get_samples_from_cdf <- function(model_out_tbl, task_id_cols, n_samples, ...) {
set.seed(101)
lshandross marked this conversation as resolved.
Show resolved Hide resolved
samples <- model_out_tbl %>%
dplyr::group_by(model_id, dplyr::across(dplyr::all_of(task_id_cols))) %>%
dplyr::reframe(
value = distfromq::make_q_fn(
ps = .data$value,
qs = as.numeric(.data$output_type_id), ...
)(runif(n_samples, 0, 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is )(runif(n_samples, 0, 1)) doing?

)
return(samples)
}

#' @export
convert_from_sample <- function(grouped_model_out_tbl, new_output_type,
new_output_type_id) {
if (new_output_type == "mean") {
model_out_tbl_transform <- grouped_model_out_tbl %>%
dplyr::reframe(
value = mean(value),
output_type_id = new_output_type_id
)
} else if (new_output_type == "median") {
model_out_tbl_transform <- grouped_model_out_tbl %>%
dplyr::reframe(
value = median(value),
output_type_id = new_output_type_id
)
} else if (new_output_type == "quantile") {
model_out_tbl_transform <- grouped_model_out_tbl %>%
dplyr::reframe(
value = quantile(value, as.numeric(new_output_type_id), names = FALSE
),
output_type_id = new_output_type_id
)
} else if (new_output_type == "cdf") {
model_out_tbl_transform <- grouped_model_out_tbl %>%
dplyr::reframe(
value = ecdf(value)(as.numeric(new_output_type_id)),
output_type_id = new_output_type_id
)
}
# update output_type and output_type_id columns
model_out_tbl_transform <- model_out_tbl_transform %>%
dplyr::mutate(output_type = new_output_type) %>%
hubUtils::as_model_out_tbl()
lshandross marked this conversation as resolved.
Show resolved Hide resolved
return(model_out_tbl_transform)
}
6 changes: 6 additions & 0 deletions R/get_task_id_cols.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#' @export
get_task_id_cols <- function(model_out_tbl) {
model_out_cols <- colnames(model_out_tbl)
task_id_cols <- model_out_cols[!model_out_cols %in% std_colnames]
return(task_id_cols)
}
Loading
Loading