Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add output type conversion functions #150
base: main
Are you sure you want to change the base?
Add output type conversion functions #150
Changes from 15 commits
49eab2b
bca3fab
8dd2ad8
48dfffb
71dacb9
9b81d7a
d8f9e4b
5b9e311
2a6c595
10ff653
7243570
243cdb5
cc4e002
4db0992
bee1c61
94e1b40
28a6be6
cd82801
8bfee20
727d33d
d07d985
2d7d19c
0f73f84
944404d
94fa108
ac6333a
e8595e1
781016d
1a9789e
84ff4aa
9fa2e35
db08c45
75df987
ee6b098
de84f22
d4119e2
1e48584
66ad690
4b8ed54
e23e047
6dd2926
95d6f28
254b737
522e88f
054bffb
cbe2acf
8d2e95b
874ad81
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a suggestion but how do you feel about the argument name
to_output_type
instead ofnew_output_type
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to above comment, how about
to_output_type_id
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is
c("group1")
for in this example?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of this function, where everything is converted to samples and from there to the target output type, has a certain kind of elegance and at the same time is kind of inefficient. For example, here's what we do when the
starting_output_type
is "quantile" and thenew_output_type
is "cdf":get_samples_from_quantiles
, we calldistfromq::make_q_fn
to estimate the quantile function and draw samples from itdistfromq::make_q_fn
estimates the cdf and inverts thatRestating, our goal was to get to cdf values, and in order to do that we (1) estimate the cdf; (2) invert that to get to a quantile function estimate; (3) draw samples from the quantile function; (4) get the empirical cdf of those samples; (5) evaluate that empirical cdf. Although it is conceptually clean to work through the "universal representation" of samples, steps 2-5 here are extra work and in particular the sample-based stuff in steps 3-5 will introduce more layers of approximation to the result.
Another alternative organization to this code might provide a set of "direct" transformations that don't use samples as an intermediate step. For example, the quantile to cdf transform might define a helper function along the lines of
transform_quantile_to_cdf_one_group
along these lines:Which could then be called on a grouped data frame somewhere, like
I think this approach would likely involve more code in the end, but could be nice as it would be more direct and involve less approximation. I'm open to your ideas about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I had conceptualized a set of transformations like
transform_quantile_to_cdf_one_group()
that you describe @elray1. When most required transforming to samples, I decided for the elegant route. But I agree, there is additional approximation that is not necessary, and would ideally be avoided.Question if we opt for the individual transformation functions: consider, for example, we are transforming from quantile to mean. The steps are (1) estimate the cdf, (2) invert to quantile function, (3) draw samples from quantile function, (4) calculate mean. Would this function call the
transform_quantile_to_samples()
function? Or would it replicate steps 1-3 within thetransform_quantile_to mean()
function?@lshandross, I'm happy for you to make the final decision on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was a good prompt for me to think about which transformations require samples, estimates of the QF, and/or estimates of the CDF. Here's my thinking, does this seem corret?
output_type_id
s for bins, e.g.(0,0.1]
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this with Li just now, and arrived at a plan to merge this PR more or less as-is, then do the following in separate issues:
starting_type
= pmfThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds great, thank you both for pushing this across the finish line!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we might still want this calculation if, for example,
new_output_type = c("median", "cdf")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the placement of the parentheses, this calculation does occur in that case because it simplifies to
!("median" %in% new_output_type) || !(0.5 %in% starting_output_type_ids)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarifying/restating -- what if we have a situation where:
starting_output_type = "quantile"
, withstarting_output_type_ids = c(0.025, 0.25, 0.5, 0.75, 0.975)
new_output_type = c("median", "cdf")
In this case,
!("median" %in% new_output_type)
and!(0.5 %in% starting_output_type_ids)
both evaluate toFALSE
and so the condition does not run, right? But in that case we would still need the representation in terms of samples in order to get to the cdf outputs (if we stick with a workflow that uses samples as an intermediate).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further investigation revealed that your example with median and cdf new output types led to a case where indeed no samples were estimated and the cdf values were instead estimated from the provided quantiles using
stats::ecdf()
.I went in and made some changes to the function so that samples would be used as an intermediary for consistency, as the plan is to merging this pull request first before switching how we perform the transformation as detailed below in your other comment in commit cbe2acf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably fine, so may not be worth updating -- but it seems like there may be an edge case where one model does not provide a predictive quantile at the level 0.5, or they provide some
output_type_id
like0.49999999999999999999999
that doesn't evaluate as equal to 0.5, in which case we might end up without a row with a median where that model did provide forecasts.I think the conditions of that edge case would be pretty rare, so again, not sure it's worth coding for them...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could file a low priority issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, maybe not the most important case to handle. And I might argue that the user should address this (rather than us making assumptions about what is "sufficiently" close to 0.5 to be considered a "median") but I don't feel strongly about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of notes here:
distfromq::make_r_fn
. So you could do this in a slightly easier-to-read way by calling.data$output_type_id
and.data$value
as arguments and has two lines: one callingdistfromq::make_*_fn
and storing that result inq_fn
orr_fn
(depending on which way you prefer to go), and a second calling that function. (this refactoring may be less necessary if you switch to just usingmake_r_fn
since that one is easier to read)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support changing to
make_r_fn()
, and that following the intuition of samples, we probably want to default to pseudo-random samples. We could also file a low-priority issue to make this sampling scheme an argument that the user can specify? I could imagine use cases of the function where each type of sampling may be desired.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is
)(runif(n_samples, 0, 1))
doing?