Add output type conversion functions #150

eahowerton · 2024-06-07T18:53:32Z

This PR introduces two main functions:

convert between output types - convert_output_types() takes in a model_out_tbl with a single output type and can return a new model_out_tbl with the desired output types. The idea is to create a user-facing wrapper function that lives in hubData and can process multiple starting output types.
get column names of task ID columns - will replace some functionality in hubEnsembles in validate_ensemble_inputs(), related to Refactor to use get_task_ids_tbl function for task id validation hubEnsembles#111, and potentially be useful in other packages (e.g., hubUtils)

…s-Disease-Modeling-Hubs/hubUtils into output-type-conversion

annakrystalli

Overall looking good!

I've made a lot of minor/stylistic comments which you can feel free to ignore!

I've also made some suggestions for tackling a couple of more salient points.

DESCRIPTION

R/convert_output_types.R

annakrystalli · 2024-06-13T15:00:11Z

R/convert_output_types.R

+      value = distfromq::make_q_fn(
+        ps = .data$value,
+        qs = as.numeric(.data$output_type_id), ...
+      )(runif(n_samples, 0, 1))


What is )(runif(n_samples, 0, 1)) doing?

R/convert_output_types.R

elray1

this is a really nice start at this functionality! I had one big question about whether it might make sense to avoid the use of samples as an intermediate representation of the distribution in all settings, and some other smaller comments.

R/convert_output_types.R

elray1 · 2024-06-24T13:56:30Z

R/convert_output_types.R

+  } else if (starting_output_type == "quantile") {
+    # if median output desired, and Q50 provided return exact value, otherwise
+    # estimate from samples
+    if (!("median" %in% new_output_type && 0.5 %in% starting_output_type_ids)) {


It seems like we might still want this calculation if, for example, new_output_type = c("median", "cdf")

From the placement of the parentheses, this calculation does occur in that case because it simplifies to !("median" %in% new_output_type) || !(0.5 %in% starting_output_type_ids)

clarifying/restating -- what if we have a situation where:

starting_output_type = "quantile", with starting_output_type_ids = c(0.025, 0.25, 0.5, 0.75, 0.975)

new_output_type = c("median", "cdf")

In this case, !("median" %in% new_output_type) and !(0.5 %in% starting_output_type_ids) both evaluate to FALSE and so the condition does not run, right? But in that case we would still need the representation in terms of samples in order to get to the cdf outputs (if we stick with a workflow that uses samples as an intermediate).

Further investigation revealed that your example with median and cdf new output types led to a case where indeed no samples were estimated and the cdf values were instead estimated from the provided quantiles using stats::ecdf().

I went in and made some changes to the function so that samples would be used as an intermediary for consistency, as the plan is to merging this pull request first before switching how we perform the transformation as detailed below in your other comment in commit cbe2acf

elray1 · 2024-06-24T14:07:41Z

R/convert_output_types.R

+    if (new_output_type[i] == "median" && 0.5 %in% starting_output_type_ids) {
+      model_out_tbl_transform[[i]] <- model_out_tbl %>%
+        dplyr::filter(output_type_id == 0.5) %>%


I think this is probably fine, so may not be worth updating -- but it seems like there may be an edge case where one model does not provide a predictive quantile at the level 0.5, or they provide some output_type_id like 0.49999999999999999999999 that doesn't evaluate as equal to 0.5, in which case we might end up without a row with a median where that model did provide forecasts.

I think the conditions of that edge case would be pretty rare, so again, not sure it's worth coding for them...

Maybe we could file a low priority issue?

Agree, maybe not the most important case to handle. And I might argue that the user should address this (rather than us making assumptions about what is "sufficiently" close to 0.5 to be considered a "median") but I don't feel strongly about this!

R/convert_output_types.R

elray1 · 2024-06-24T14:37:53Z

R/convert_output_types.R

+      value = distfromq::make_q_fn(
+        ps = as.numeric(.data$output_type_id),
+        qs = .data$value, ...
+      )(runif(n_samples, 0, 1))


Couple of notes here:

This is actually exactly the calculation that's done in the function that's returned by distfromq::make_r_fn. So you could do this in a slightly easier-to-read way by calling

value = distfromq::make_r_fn( ps = as.numeric(.data$output_type_id), qs = .data$value, ... )(n_samples)

But these are pseudo-random samples rather than quasi-random samples. I actually think the distinction here is probably of low importance. In case someone is converting to output type sample and they want the randomness they'd expect from that, maybe it's best to keep this behavior rather than updating to quasi-random samples. Then we should just update the documentation for the main function above to remove the line "To reduce Monte Carlo variability, we use quasi-random samples corresponding to quantiles of the estimated distribution."

Otherwise, if we do want to stick with quasi-random samples, we could use a line like the following here:

value = distfromq::make_q_fn( ps = as.numeric(.data$output_type_id), qs = .data$value, ... )(seq(from = 0, to = 1, length = n_samples + 2)[1:(n_samples+1)])

To address a comment I saw somewhere else from Anna about what's going on here, it might be helpful to refactor this into a separate function that takes .data$output_type_id and .data$value as arguments and has two lines: one calling distfromq::make_*_fn and storing that result in q_fn or r_fn (depending on which way you prefer to go), and a second calling that function. (this refactoring may be less necessary if you switch to just using make_r_fn since that one is easier to read)

I support changing to make_r_fn(), and that following the intuition of samples, we probably want to default to pseudo-random samples. We could also file a low-priority issue to make this sampling scheme an argument that the user can specify? I could imagine use cases of the function where each type of sampling may be desired.

elray1 · 2024-06-24T15:11:30Z

R/convert_output_types.R

+convert_output_type <- function(model_out_tbl, new_output_type,
+                                new_output_type_id = NA, n_samples = 1e4, ...) {


The logic of this function, where everything is converted to samples and from there to the target output type, has a certain kind of elegance and at the same time is kind of inefficient. For example, here's what we do when the starting_output_type is "quantile" and the new_output_type is "cdf":

In get_samples_from_quantiles, we call distfromq::make_q_fn to estimate the quantile function and draw samples from it

internally, to get an estimated quantile function distfromq::make_q_fn estimates the cdf and inverts that

Then we compute the empirical cdf of those samples and evaluate that ecdf in

Restating, our goal was to get to cdf values, and in order to do that we (1) estimate the cdf; (2) invert that to get to a quantile function estimate; (3) draw samples from the quantile function; (4) get the empirical cdf of those samples; (5) evaluate that empirical cdf. Although it is conceptually clean to work through the "universal representation" of samples, steps 2-5 here are extra work and in particular the sample-based stuff in steps 3-5 will introduce more layers of approximation to the result.

Another alternative organization to this code might provide a set of "direct" transformations that don't use samples as an intermediate step. For example, the quantile to cdf transform might define a helper function along the lines of transform_quantile_to_cdf_one_group along these lines:

transform_quantile_to_cdf_one_group <- function(starting_ps, starting_qs, to_qs, ...) { p_fn <- distfromq::make_p_fn( ps = as.numeric(starting_ps), qs = starting_qs, ... ) return(p_fn(to_qs)) }

Which could then be called on a grouped data frame somewhere, like

samples <- model_out_tbl %>% dplyr::group_by(model_id, dplyr::across(dplyr::all_of(task_id_cols))) %>% dplyr::reframe( value = transform_quantile_to_cdf_one_group( starting_ps=as.numeric(.data$output_type_id), starting_qs=.data$value, to_qs=new_output_type_id, ... )

I think this approach would likely involve more code in the end, but could be nice as it would be more direct and involve less approximation. I'm open to your ideas about this!

Originally, I had conceptualized a set of transformations like transform_quantile_to_cdf_one_group() that you describe @elray1. When most required transforming to samples, I decided for the elegant route. But I agree, there is additional approximation that is not necessary, and would ideally be avoided.

Question if we opt for the individual transformation functions: consider, for example, we are transforming from quantile to mean. The steps are (1) estimate the cdf, (2) invert to quantile function, (3) draw samples from quantile function, (4) calculate mean. Would this function call the transform_quantile_to_samples() function? Or would it replicate steps 1-3 within the transform_quantile_to mean() function?

@lshandross, I'm happy for you to make the final decision on this one.

This comment was a good prompt for me to think about which transformations require samples, estimates of the QF, and/or estimates of the CDF. Here's my thinking, does this seem corret?

starting type = quantile

new type = quantile: do we support this? should we? if so, would not require samples, use estimated QF

new type = cdf: samples not needed, use estimated CDF

new type = pmf: samples not needed, use estimated CDF. Some data structure would be required specifying bin endpoints, or some convention for the names of the output_type_ids for bins, e.g. (0,0.1].

new type = sample: samples needed

new type = median: samples not needed, use estimated QF

new type = mean: samples needed

starting type = cdf

new type = quantile: samples not needed, use estimated QF

new type = cdf: do we support this? should we? if so, would not require samples, use estimated CDF

new type = pmf: samples not needed, use estimated CDF

new type = sample: samples needed

new type = median: samples not needed, use estimated QF

new type = mean: samples needed

starting type = pmf.

overall note, i think that for the most part, transforms here only make sense if the variable is a discretization of an underlying continuous variable and we get some information about the bin endpoints used for the discretization. but in that setting, i think the transformation options match what was said above for starting type = cdf.

though if the variable is nominal, we could go to the CDF type naturally.

starting type = sample

you already have samples!

starting type = median

can't do any transformations other than the trivial transform to quantile at level 0.5

starting type = mean

can't do any transformations

Discussed this with Li just now, and arrived at a plan to merge this PR more or less as-is, then do the following in separate issues:

Allow new_type = sample for existing probabilistic starting types where they are not supported (quantile and cdf)

Allow new_type = pmf for existing probabilistic starting types (quantile, cdf, and sample).

Maybe at the same time, move away from using samples as intermediary for new type = cdf. This would get us to a place where the same methodology is used when the new output type is pmf and cdf.

Maybe at the same time, support cdf -> cdf, e.g. extending the set of points where the cdf is evaluated. This feels like a low priority extension, not sure there is really a use case?

Allow starting_type = pmf

Lower priority issue, support quantile --> quantile transformations, e.g. extending the set of quantile levels predicted. This could be useful to modelers who are doing something like quantile regression, to allow them to get predictions at a small number of quantile levels and then fill in the others.

At this time, could switch to using QF as an intermediary when new_type = quantile or median rather than samples.

This sounds great, thank you both for pushing this across the finish line!

Co-authored-by: Anna Krystalli <[email protected]>

github-actions · 2024-09-11T16:37:46Z

🚀 Deployed on https://670e8d4aba37fd325af2e2a2--hubutils-pr-previews.netlify.app

Co-authored-by: Anna Krystalli <[email protected]>

prints more informative validation messages

… example

…dations

…ples)

cdf output type ids are individual values not bins

lshandross · 2024-10-28T20:14:40Z

Our plans have changed since the discussion above to instead to instead essentially overhaul the methodology to not rely on samples for the underlying calculations in this PR rather than merging it basically as-is first. Plans are summarized in this output type conversion vignette for reference

eahowerton added 15 commits June 7, 2024 14:44

start transformation functions

49eab2b

Add documentation

bca3fab

switch from summarize() to reframe()

8dd2ad8

add preliminary validations

48dfffb

add validations of new_output_type_id

71dacb9

update documentation, lint

9b81d7a

Update DESCRIPTION

d8f9e4b

Create get_task_id_cols.R

5b9e311

use get_task_id_cols()

2a6c595

Create get_task_id_cols.R (#149)

10ff653

use get_task_id_cols()

7243570

Merge branch 'output-type-conversion' of https://github.com/Infectiou…

243cdb5

…s-Disease-Modeling-Hubs/hubUtils into output-type-conversion

switch model_outputs to model_out_tbl

cc4e002

add functionality for multiple new output_types

4db0992

lint

bee1c61

eahowerton requested a review from annakrystalli June 7, 2024 18:53

eahowerton mentioned this pull request Jun 7, 2024

Consider returning median output type from linear_pool() hubverse-org/hubEnsembles#108

Open

saraloo self-assigned this Jun 12, 2024

annakrystalli requested changes Jun 13, 2024

View reviewed changes

elray1 requested changes Jun 24, 2024

View reviewed changes

eahowerton mentioned this pull request Sep 11, 2024

Add function for output type conversion hubverse-org/hubData#11

Open

lshandross self-assigned this Sep 11, 2024

lshandross and others added 7 commits September 11, 2024 11:59

Merge branch 'main' into output-type-conversion2

94e1b40

Update DESCRIPTION

28a6be6

Rename convert_output_types() title

cd82801

Co-authored-by: Anna Krystalli <[email protected]>

Punctuation fixes

8bfee20

Co-authored-by: Anna Krystalli <[email protected]>

Remove within package hubUtils namespacing

727d33d

Update DESCRIPTION

d07d985

Update NAMESPACE

2d7d19c

saraloo removed their assignment Sep 11, 2024

lshandross and others added 20 commits September 11, 2024 14:59

Update convert_output_types.R

0f73f84

Fix linting issues

944404d

Document convert_output_type()

94fa108

Update DESCRIPTION

ac6333a

Fix failing tests

e8595e1

Improve output type conversion validation messages

781016d

Co-authored-by: Anna Krystalli <[email protected]>

Define global functions

1a9789e

Clean up convert_output_types functions

84ff4aa

Don't export smaller conversion functions

9fa2e35

Document get_task_id_cols()

db08c45

Fix convert_output_type() example

75df987

Clarify convert_output_type() description and return value

ee6b098

Fix linting issues

de84f22

Add single starting output type validation and test

d4119e2

Remove duplicate test

1e48584

Remove seed and check for distfromq

66ad690

Co-authored-by: Anna Krystalli <[email protected]>

Fix failing test

4b8ed54

Refactor new validate_output_type_id() function

e23e047

prints more informative validation messages

Update documentation

6dd2926

Testing fixes

95d6f28

annakrystalli changed the base branch from main to output-type-conversion September 25, 2024 15:16

annakrystalli changed the base branch from output-type-conversion to main September 25, 2024 15:19

lshandross added 6 commits October 1, 2024 13:17

Merge branch 'main' into output-type-conversion2

254b737

model_out_tbl output type col is character in convert_output_type()…

522e88f

… example

Make new_output_type_id arg usually a named list; simplify its vali…

054bffb

…dations

Fix quantile -> median, mean/cdf transformation (use intermediary sam…

cbe2acf

…ples)

Cleanup tests

8d2e95b

Rename variable in convert_output_types() example to be more accurate

874ad81

cdf output type ids are individual values not bins

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add output type conversion functions #150

Add output type conversion functions #150

eahowerton commented Jun 7, 2024

annakrystalli left a comment

annakrystalli Jun 13, 2024

elray1 left a comment

elray1 Jun 24, 2024

lshandross Sep 13, 2024 •

edited

Loading

elray1 Sep 17, 2024

lshandross Oct 4, 2024 •

edited

Loading

elray1 Jun 24, 2024

lshandross Sep 13, 2024

eahowerton Sep 16, 2024

elray1 Jun 24, 2024

eahowerton Sep 16, 2024

elray1 Jun 24, 2024

eahowerton Sep 16, 2024

elray1 Sep 17, 2024 •

edited

Loading

elray1 Sep 17, 2024

eahowerton Sep 19, 2024

github-actions bot commented Sep 11, 2024 •

edited

Loading

lshandross commented Oct 28, 2024

		convert_output_type <- function(model_out_tbl, new_output_type,
		new_output_type_id = NA, n_samples = 1e4, ...) {

Add output type conversion functions #150

Are you sure you want to change the base?

Add output type conversion functions #150

Conversation

eahowerton commented Jun 7, 2024

annakrystalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elray1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lshandross Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lshandross Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elray1 Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 11, 2024 • edited Loading

lshandross commented Oct 28, 2024

lshandross Sep 13, 2024 •

edited

Loading

lshandross Oct 4, 2024 •

edited

Loading

elray1 Sep 17, 2024 •

edited

Loading

github-actions bot commented Sep 11, 2024 •

edited

Loading