-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for ascending CDF values with numeric output_type_id cast as character #105
base: main
Are you sure you want to change the base?
Conversation
Thanks for working on this @zkamvar ! Unfortunately this does not solve the problem when This has sadly been introduced by the fact we relaxed the formatting requirements for character cdf output type IDs in schema version v3.0.1 which is now coming back to bite us. In previous schema, the need for for character cdf output type IDs to conform to the regex pattern See below: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(hubValidations)
#>
#> Attaching package: 'hubValidations'
#> The following object is masked from 'package:dplyr':
#>
#> combine
library(testthat)
#>
#> Attaching package: 'testthat'
#> The following object is masked from 'package:dplyr':
#>
#> matches
ex <- structure(
list(
location = c(
"01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
"01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01"
),
reference_date = structure(c(
19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371
), class = "Date"),
horizon = c(
0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
),
target_end_date = structure(c(
19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
19371, 19371, 19371, 19371, 19371, 19371
), class = "Date"),
target = c(
"wk inc flu hosp",
"wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
"wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
"wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
"wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
"wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
"wk inc flu hosp", "wk inc flu hosp", "wk flu hosp rate category",
"wk flu hosp rate category", "wk flu hosp rate category", "wk flu hosp rate category",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
"wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate"
),
output_type = c(
"quantile", "quantile", "quantile", "quantile",
"quantile", "quantile", "quantile", "quantile", "quantile", "quantile",
"quantile", "quantile", "quantile", "quantile", "quantile", "quantile",
"quantile", "quantile", "quantile", "quantile", "quantile", "quantile",
"quantile", "pmf", "pmf", "pmf", "pmf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
"cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf"
),
output_type_id = c(
"0.01",
"0.025", "0.05", "0.1", "0.15", "0.2", "0.25", "0.3", "0.35",
"0.4", "0.45", "0.5", "0.55", "0.6", "0.65", "0.7", "0.75", "0.8",
"0.85", "0.9", "0.95", "0.975", "0.99", "low", "moderate", "high",
"very high", "0.25", "0.5", "0.75", "1", "1.25", "1.5", "1.75",
"2", "2.25", "2.5", "2.75", "3", "3.25", "3.5", "3.75", "4",
"4.25", "4.5", "4.75", "5", "5.25", "5.5", "5.75", "6", "6.25",
"6.5", "6.75", "7", "7.25", "7.5", "7.75", "8", "8.25", "8.5",
"8.75", "9", "9.25", "9.5", "9.75", "10", "10.25", "10.5", "10.75",
"11", "11.25", "11.5", "11.75", "12", "12.25", "12.5", "12.75",
"13", "13.25", "13.5", "13.75", "14", "14.25", "14.5", "14.75",
"15", "15.25", "15.5", "15.75", "16", "16.25", "16.5", "16.75",
"17", "17.25", "17.5", "17.75", "18", "18.25", "18.5", "18.75",
"19", "19.25", "19.5", "19.75", "20", "20.25", "20.5", "20.75",
"21", "21.25", "21.5", "21.75", "22", "22.25", "22.5", "22.75",
"23", "23.25", "23.5", "23.75", "24", "24.25", "24.5", "24.75",
"25"
),
value = c(
17, 44, 72, 105, 122, 125, 127, 128, 131, 132,
133, 136, 139, 140, 141, 144, 145, 147, 150, 167, 200, 228, 255,
0.220842781557067, 0.768398474282558, 0.0107282559276931, 3.04882326815914e-05,
0.00853380042747561, 0.0135533534527697, 0.0208413454592117,
0.0299015976961877, 0.0406341808051564, 0.0548617841719505, 0.0720793809659529,
0.0929593384551091, 0.111686153095421, 0.236379680785012, 0.560349384758665,
0.864271648664744, 0.89630163333021, 0.918320726070205, 0.937138052331475,
0.952967021875378, 0.96520141962482, 0.974905656518415, 0.983169904293088,
0.989315411865382, 0.993311235511738, 0.995919595802813, 0.997577247537977,
0.998600149380724, 0.999213049267616, 0.999569631735671, 0.999771071520066,
0.999881567193594, 0.999940419028641, 0.999970855156356, 0.999986139027956,
0.999993591354829, 0.999997119649744, 0.999998741655025, 0.999999465680723,
0.999999779493023, 0.999999911561791, 0.999999965530796, 0.999999986945038,
0.999999995195438, 0.999999998281901, 0.999999999403044, 0.999999999798479,
0.999999999933905, 0.999999999978939, 0.99999999999348, 0.999999999998039,
0.999999999999427, 0.999999999999838, 0.999999999999955, 0.999999999999988,
0.999999999999997, 0.999999999999999, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
)
),
row.names = c(
NA,
-127L
),
class = c("tbl_df", "tbl", "data.frame")
)
res <- check_tbl_value_col_ascending(ex, file_path = "")
expect_s3_class(res, "check_success")
#> Error: `res` inherits from 'check_failure'/'hub_check'/'rlang_warning'/'warning'/'condition' not 'check_success'.
expect_null(res$error_tbl)
#> Error: res$error_tbl is not NULL
#>
#> `actual` is an S3 object of class <tbl_df/tbl/data.frame>, a list
#> `expected` is NULL
# Test on character cdf values with consistent lengths. This should pass and
# was imposed by schema versions less than v3.0.1 where the format of a character
# output_type_id was strict.
ex <- ex |>
dplyr::filter(output_type == "cdf") |>
dplyr::mutate(output_type_id = paste0(
"EW2024",
stringr::str_pad(seq_along(output_type),
width = 2, pad = "0"
)
)) |>
dplyr::slice(1:52)
ex[, c("output_type_id", "value")]
#> # A tibble: 52 × 2
#> output_type_id value
#> <chr> <dbl>
#> 1 EW202401 0.00853
#> 2 EW202402 0.0136
#> 3 EW202403 0.0208
#> 4 EW202404 0.0299
#> 5 EW202405 0.0406
#> 6 EW202406 0.0549
#> 7 EW202407 0.0721
#> 8 EW202408 0.0930
#> 9 EW202409 0.112
#> 10 EW202410 0.236
#> # ℹ 42 more rows
res <- check_tbl_value_col_ascending(ex, file_path = "")
expect_s3_class(res, "check_success")
expect_null(res$error_tbl)
# test on character cdf values with inconsistent lengths. After relaxation of
# the format requirements of cdf character output_type_id this sort of situation
# is now possible. We want this to pass but at the minute it fails for the same
# reasons reported in #78
ex <- ex |>
dplyr::filter(output_type == "cdf") |>
dplyr::mutate(output_type_id = paste0("EW", seq_along(output_type))) |>
dplyr::slice(1:52)
ex[, c("output_type_id", "value")]
#> # A tibble: 52 × 2
#> output_type_id value
#> <chr> <dbl>
#> 1 EW1 0.00853
#> 2 EW2 0.0136
#> 3 EW3 0.0208
#> 4 EW4 0.0299
#> 5 EW5 0.0406
#> 6 EW6 0.0549
#> 7 EW7 0.0721
#> 8 EW8 0.0930
#> 9 EW9 0.112
#> 10 EW10 0.236
#> # ℹ 42 more rows
res <- check_tbl_value_col_ascending(ex, file_path = "")
expect_s3_class(res, "check_success")
#> Error: `res` inherits from 'check_failure'/'hub_check'/'rlang_warning'/'warning'/'condition' not 'check_success'.
expect_null(res$error_tbl)
#> Error: res$error_tbl is not NULL
#>
#> `actual` is an S3 object of class <tbl_df/tbl/data.frame>, a list
#> `expected` is NULL Created on 2024-08-13 with reprex v2.1.0 @elray1 any thoughts on this? |
(NOTE: adding hubverse-org/schemas#80 and the corresponding PR (hubverse-org/schemas#93) so that we have a more direct reference as the issue shows it being fixed by hubverse-org/schemas#88) So, if I understand this correctly, we had a formatting requirement for cdf that was either a number (which this PR would fix) OR an epi week string and then in hubverse-org/schemas#80, and then it was suggested to remove the restrictive requirement for an epi week string because "one could imagine collecting CDF predictions for an ordinal categorical variable other than epidemic weeks with strings naming the categories that don't match that format." The solution was to remove the requirement. In order to solve #78, we need to enforce some format that we can sort numerically. Would a regex of |
I suggest that we address #78 by asserting/requiring that the hub has listed the values in their tasks.json file in the correct order, and do the sorting here in the same order as they were given in that config file. I think it's best to avoid relying on enforcing that the category labels/names can be sorted in a standard numeric or lexicographic order that aligns with the ordering of those categories. There are some examples of how a hub might need to specify this under point 1 in this dicussion comment. The suggestion I made there is that even if the values are numeric (and so an ordering is implicit), we would just use the order that was given in that file. |
This makes sense. We can not enforce the creation of this part of the config computationally, but it's a reasonable social expectation as long as it's documented well. If I am not mistaken, this is a field that hub administrators would set this once and never touch again? As I understand it from #70, modellers want to submit in any order, but they were met with this restriction and we worked around it by arranging by the I will see what I can do to incorporate the order from the config file. |
Quick implementation question @elray1, how would we make use of any ordering split across |
This all sounds right to me. Just to be clear, I think we would still (potentially) need to re-order the submission by the
Good question that I hadn't thought of. It kind of seems intractable? I think I'd be inclined to say that for the cdf and pmf types (where this matters), it doesn't make sense to put some values in required and some in optional, they should either all be required or all be optional. Certainly this seems logical for the pmf type, where we would generally want the probabilities across all categories to sum to 1. It also feels reasonable to me for the cdf type. But I'm a little worried that I'm making up new rules here. |
Clear as an unmuddied lake under an expanse of endless blue sky 👍. My approach uses |
I think this is ready for review now. The hardest part was actually coming up with the test data due to the new requirement of a hub, but I was able to get it to work. |
@elray1 it does feel a little like making up rules on the fly and I worry about how ad hoc this might look in the documentation. If there is definitely no reasonable sortable pattern we can enforce, can we atleast make the requirement for @zkamvar thanks for the speedy response to the feedback! The use of Would you mind holding back until that is all merged in shortly as this PR will likely require modification to make use of the optimisations coming and I wouldn't want to merge as is as it would add to the current computational load. |
Of course! I also realized this is solvable by directly pulling |
In fact you were on the right track with using the expand grid approach because, to be safe when we're comparing submitted data to config settings, because a single round might have multiple model tasks, to make sure we're comparing data to the correct model task config we need to split our See for example what this new function is doing in terms of separating hubValidations/R/match_tbl_to_model_task.R Lines 25 to 49 in 7fe78b4
which is then used to compared the value column values to the config for each model task:hubValidations/R/check_tbl_value_col.R Lines 47 to 58 in 7fe78b4
But the new functions will allow us to just subset for output types of interest and ignore derived task ids which makes everything much faster and more efficient. |
I think the I thought about bringing this up on the hubverse devteam call today, but it felt like there were too few people on the call. If it seems reasonable to you two, suggest that we just hold off on this discussion for a couple of weeks till Nick is back, settle that piece in a devteam call, and let that inform how we deal with this PR/determining the ordering. |
That's a correct assessment since the default hubValidations/R/expand_model_out_grid.R Lines 200 to 206 in 172091c
That being said, I think this can work if we do not assert any relationship between required and optional values (which, yes is kinda silly since they are literally coming from the same model, but it would still be an improvement over not having the ability to validate ascending values for character
|
I agree @elray1 , let's discuss a bit more with others before pushing ahead. |
Note that this should be revisited for schemas v4 |
Ensuring a mix of numeric and character values are sorted correctly involves creating a separate column for the numerics and using that as the first sort key. This then allows the original method of validating ascent to work. Another solution is split the data frame into a list and then perform a numeric sort if the column can be transformed, but the coercion will probably have the equivalent amount of overhead. This will fix #78
02f3939
to
cc7bfe8
Compare
NOTE 1: this requires #107 to be merged first to reduce bottleneck.
NOTE 2: due to the uncertainty of handling ordering the split between required/optional output type ids (#105 (comment)), we are going to leave the solution for discussion until September when we can have more eyeballs on it (#105 (comment))
EDIT: This PR creates a lookup table from the required values in the config file and perform an inner join to force the values in the submission to sort with the config order. An earlier version added a new column to sort by numeric, but that did not work with output types such as
EW42
(#105 (comment))This will fix #78