-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add output type conversion functions #150
base: main
Are you sure you want to change the base?
Conversation
…s-Disease-Modeling-Hubs/hubUtils into output-type-conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good!
I've made a lot of minor/stylistic comments which you can feel free to ignore!
I've also made some suggestions for tackling a couple of more salient points.
R/convert_output_types.R
Outdated
value = distfromq::make_q_fn( | ||
ps = .data$value, | ||
qs = as.numeric(.data$output_type_id), ... | ||
)(runif(n_samples, 0, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is )(runif(n_samples, 0, 1))
doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a really nice start at this functionality! I had one big question about whether it might make sense to avoid the use of samples as an intermediate representation of the distribution in all settings, and some other smaller comments.
R/convert_output_types.R
Outdated
} else if (starting_output_type == "quantile") { | ||
# if median output desired, and Q50 provided return exact value, otherwise | ||
# estimate from samples | ||
if (!("median" %in% new_output_type && 0.5 %in% starting_output_type_ids)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we might still want this calculation if, for example, new_output_type = c("median", "cdf")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the placement of the parentheses, this calculation does occur in that case because it simplifies to !("median" %in% new_output_type) || !(0.5 %in% starting_output_type_ids)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarifying/restating -- what if we have a situation where:
starting_output_type = "quantile"
, withstarting_output_type_ids = c(0.025, 0.25, 0.5, 0.75, 0.975)
new_output_type = c("median", "cdf")
In this case, !("median" %in% new_output_type)
and !(0.5 %in% starting_output_type_ids)
both evaluate to FALSE
and so the condition does not run, right? But in that case we would still need the representation in terms of samples in order to get to the cdf outputs (if we stick with a workflow that uses samples as an intermediate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further investigation revealed that your example with median and cdf new output types led to a case where indeed no samples were estimated and the cdf values were instead estimated from the provided quantiles using stats::ecdf()
.
I went in and made some changes to the function so that samples would be used as an intermediary for consistency, as the plan is to merging this pull request first before switching how we perform the transformation as detailed below in your other comment in commit cbe2acf
R/convert_output_types.R
Outdated
if (new_output_type[i] == "median" && 0.5 %in% starting_output_type_ids) { | ||
model_out_tbl_transform[[i]] <- model_out_tbl %>% | ||
dplyr::filter(output_type_id == 0.5) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably fine, so may not be worth updating -- but it seems like there may be an edge case where one model does not provide a predictive quantile at the level 0.5, or they provide some output_type_id
like 0.49999999999999999999999
that doesn't evaluate as equal to 0.5, in which case we might end up without a row with a median where that model did provide forecasts.
I think the conditions of that edge case would be pretty rare, so again, not sure it's worth coding for them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could file a low priority issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, maybe not the most important case to handle. And I might argue that the user should address this (rather than us making assumptions about what is "sufficiently" close to 0.5 to be considered a "median") but I don't feel strongly about this!
R/convert_output_types.R
Outdated
value = distfromq::make_q_fn( | ||
ps = as.numeric(.data$output_type_id), | ||
qs = .data$value, ... | ||
)(runif(n_samples, 0, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of notes here:
- This is actually exactly the calculation that's done in the function that's returned by
distfromq::make_r_fn
. So you could do this in a slightly easier-to-read way by calling
value = distfromq::make_r_fn(
ps = as.numeric(.data$output_type_id),
qs = .data$value, ...
)(n_samples)
- But these are pseudo-random samples rather than quasi-random samples. I actually think the distinction here is probably of low importance. In case someone is converting to output type sample and they want the randomness they'd expect from that, maybe it's best to keep this behavior rather than updating to quasi-random samples. Then we should just update the documentation for the main function above to remove the line "To reduce Monte Carlo variability, we use quasi-random samples corresponding to quantiles of the estimated distribution."
- Otherwise, if we do want to stick with quasi-random samples, we could use a line like the following here:
value = distfromq::make_q_fn(
ps = as.numeric(.data$output_type_id),
qs = .data$value, ...
)(seq(from = 0, to = 1, length = n_samples + 2)[1:(n_samples+1)])
- To address a comment I saw somewhere else from Anna about what's going on here, it might be helpful to refactor this into a separate function that takes
.data$output_type_id
and.data$value
as arguments and has two lines: one callingdistfromq::make_*_fn
and storing that result inq_fn
orr_fn
(depending on which way you prefer to go), and a second calling that function. (this refactoring may be less necessary if you switch to just usingmake_r_fn
since that one is easier to read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support changing to make_r_fn()
, and that following the intuition of samples, we probably want to default to pseudo-random samples. We could also file a low-priority issue to make this sampling scheme an argument that the user can specify? I could imagine use cases of the function where each type of sampling may be desired.
convert_output_type <- function(model_out_tbl, new_output_type, | ||
new_output_type_id = NA, n_samples = 1e4, ...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of this function, where everything is converted to samples and from there to the target output type, has a certain kind of elegance and at the same time is kind of inefficient. For example, here's what we do when the starting_output_type
is "quantile" and the new_output_type
is "cdf":
- In
get_samples_from_quantiles
, we calldistfromq::make_q_fn
to estimate the quantile function and draw samples from it- internally, to get an estimated quantile function
distfromq::make_q_fn
estimates the cdf and inverts that
- internally, to get an estimated quantile function
- Then we compute the empirical cdf of those samples and evaluate that ecdf in
Restating, our goal was to get to cdf values, and in order to do that we (1) estimate the cdf; (2) invert that to get to a quantile function estimate; (3) draw samples from the quantile function; (4) get the empirical cdf of those samples; (5) evaluate that empirical cdf. Although it is conceptually clean to work through the "universal representation" of samples, steps 2-5 here are extra work and in particular the sample-based stuff in steps 3-5 will introduce more layers of approximation to the result.
Another alternative organization to this code might provide a set of "direct" transformations that don't use samples as an intermediate step. For example, the quantile to cdf transform might define a helper function along the lines of transform_quantile_to_cdf_one_group
along these lines:
transform_quantile_to_cdf_one_group <- function(starting_ps, starting_qs, to_qs, ...) {
p_fn <- distfromq::make_p_fn(
ps = as.numeric(starting_ps),
qs = starting_qs,
...
)
return(p_fn(to_qs))
}
Which could then be called on a grouped data frame somewhere, like
samples <- model_out_tbl %>%
dplyr::group_by(model_id, dplyr::across(dplyr::all_of(task_id_cols))) %>%
dplyr::reframe(
value = transform_quantile_to_cdf_one_group(
starting_ps=as.numeric(.data$output_type_id),
starting_qs=.data$value,
to_qs=new_output_type_id,
...
)
I think this approach would likely involve more code in the end, but could be nice as it would be more direct and involve less approximation. I'm open to your ideas about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I had conceptualized a set of transformations like transform_quantile_to_cdf_one_group()
that you describe @elray1. When most required transforming to samples, I decided for the elegant route. But I agree, there is additional approximation that is not necessary, and would ideally be avoided.
Question if we opt for the individual transformation functions: consider, for example, we are transforming from quantile to mean. The steps are (1) estimate the cdf, (2) invert to quantile function, (3) draw samples from quantile function, (4) calculate mean. Would this function call the transform_quantile_to_samples()
function? Or would it replicate steps 1-3 within the transform_quantile_to mean()
function?
@lshandross, I'm happy for you to make the final decision on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was a good prompt for me to think about which transformations require samples, estimates of the QF, and/or estimates of the CDF. Here's my thinking, does this seem corret?
- starting type = quantile
- new type = quantile: do we support this? should we? if so, would not require samples, use estimated QF
- new type = cdf: samples not needed, use estimated CDF
- new type = pmf: samples not needed, use estimated CDF. Some data structure would be required specifying bin endpoints, or some convention for the names of the
output_type_id
s for bins, e.g.(0,0.1]
. - new type = sample: samples needed
- new type = median: samples not needed, use estimated QF
- new type = mean: samples needed
- starting type = cdf
- new type = quantile: samples not needed, use estimated QF
- new type = cdf: do we support this? should we? if so, would not require samples, use estimated CDF
- new type = pmf: samples not needed, use estimated CDF
- new type = sample: samples needed
- new type = median: samples not needed, use estimated QF
- new type = mean: samples needed
- starting type = pmf.
- overall note, i think that for the most part, transforms here only make sense if the variable is a discretization of an underlying continuous variable and we get some information about the bin endpoints used for the discretization. but in that setting, i think the transformation options match what was said above for starting type = cdf.
- though if the variable is nominal, we could go to the CDF type naturally.
- starting type = sample
- you already have samples!
- starting type = median
- can't do any transformations other than the trivial transform to quantile at level 0.5
- starting type = mean
- can't do any transformations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this with Li just now, and arrived at a plan to merge this PR more or less as-is, then do the following in separate issues:
- Allow new_type = sample for existing probabilistic starting types where they are not supported (quantile and cdf)
- Allow new_type = pmf for existing probabilistic starting types (quantile, cdf, and sample).
- Maybe at the same time, move away from using samples as intermediary for new type = cdf. This would get us to a place where the same methodology is used when the new output type is pmf and cdf.
- Maybe at the same time, support cdf -> cdf, e.g. extending the set of points where the cdf is evaluated. This feels like a low priority extension, not sure there is really a use case?
- Allow
starting_type
= pmf - Lower priority issue, support quantile --> quantile transformations, e.g. extending the set of quantile levels predicted. This could be useful to modelers who are doing something like quantile regression, to allow them to get predictions at a small number of quantile levels and then fill in the others.
- At this time, could switch to using QF as an intermediary when new_type = quantile or median rather than samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds great, thank you both for pushing this across the finish line!
Co-authored-by: Anna Krystalli <[email protected]>
Co-authored-by: Anna Krystalli <[email protected]>
Co-authored-by: Anna Krystalli <[email protected]>
Co-authored-by: Anna Krystalli <[email protected]>
prints more informative validation messages
cdf output type ids are individual values not bins
Our plans have changed since the discussion above to instead to instead essentially overhaul the methodology to not rely on samples for the underlying calculations in this PR rather than merging it basically as-is first. Plans are summarized in this output type conversion vignette for reference |
This PR introduces two main functions:
convert_output_types()
takes in amodel_out_tbl
with a single output type and can return a newmodel_out_tbl
with the desired output types. The idea is to create a user-facing wrapper function that lives in hubData and can process multiple starting output types.validate_ensemble_inputs()
, related to Refactor to useget_task_ids_tbl
function for task id validation hubEnsembles#111, and potentially be useful in other packages (e.g., hubUtils)