-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/ Handle V3 sample specification #82
Conversation
DESCRIPTION
Outdated
Infectious-Disease-Modeling-Hubs/hubUtils@enhancement/v3-utils, | ||
Infectious-Disease-Modeling-Hubs/hubData@feature/handle-samples, | ||
Infectious-Disease-Modeling-Hubs/hubAdmin@feature/sample-support, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noting that we will probably want to change this before going ahead with a merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most definitely.
It's the last thing that needs to be done throughout the packages. It is required atm for tests to run successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions about the check_tbl_spl_n()
behavior. I also have some minor documentations questions.
I was also debating if it makes sense to have somewhere in the documentation (function documentation and/or vignette) a warning saying that large files with a lot of samples might take time to validation. However, as we don't have a clear estimation on what is "large" and "take time", I am not sure how helpful it is.
In my review, I refer to a test that I made where I had issue with the results.
For the test, I use my own temporary "test" repo at LucieContamin/hub_test and the file causing issue is "Hubtest-hubtemp_subset/2024-04-28-Hubtest-hubtemp_subset.parquet"
n_tbl <- dplyr::group_by(hash_tbl, .data$compound_idx) %>% | ||
dplyr::summarise( | ||
n = dplyr::n_distinct(.data$output_type_id), | ||
mt_id = unique(.data$mt_id) | ||
) %>% | ||
dplyr::left_join(n_ranges, by = "mt_id") %>% | ||
dplyr::mutate( | ||
less = .data$n < .data$n_min, | ||
more = .data$n > .data$n_max, | ||
out_range = .data$less | .data$more | ||
) %>% | ||
dplyr::filter(.data$out_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the logic here, we want to check the number of samples, but per compound it and not by unique task id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Compound_idx is what defines a sample and we need to count samples across compound_idx to ensure they are within the correct range. See for eaxample how samples (output_type_ids) are marked up under different coumpound task id sets in the docs: https://hubverse.io/en/latest/user-guide/sample-output-type.html#four-submissions-differing-by-compound-modeling-task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also opened the following issue. You can currently get this information from the hubData::expand_model_out_val_grid(include_sample_ids = TRUE)` but I think a function that creates a clear table illustrating compound idx structure of a given round/modeling task would be really useful https://github.com/Infectious-Disease-Modeling-Hubs/hubData/issues/40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part that I don't understand is why is it link to the output type id information? For me it's two different information. We are checking the output type id column in the other 2 functions with the compound ID checking, so if you have an error of ID, you will have an error but it does not mean you are missing samples.
For example if someone provided the correct number of samples but made an error in the output type id column, than it should not return an error here but only about the output type id numbering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the validation instriuctions it states:
The number of rows for each combination of individual modeling task should fall between min_samples_per_task and max_samples_per_task (inclusive).
Compound_idx defines the individual modeling task of the combination unique compound task id values, so samples need to match a modeling task first to then be counted against the required number for that compound_idx.
I you have provided the wrong output type ID, that should cause an error here too because you have not properly specified the samples, so it would be wrong in my view to count them as valid samples against the expected number.
From the validation spec sheet, this the test I've tried to encode here:
- Example: task ids are location, origin_date, horizon; compound units defined by combinations of location and origin_date; expect the number of samples for each such compound unit to be between specified min and max.
- Logic: Within each group defined by a unique combination of values for the compound_taskid_set, the number of unique values for output_type_id should be between min_samples_per_task and max_samples_per_task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry! It is confusing...I'm confused !
My interpretation is that combination of individual modeling tasks is defined by the compund_taskid_set
in the config. Each unique combination of values in the variables in the compund_taskid_set
corresponds to a single compound_idx
. And number of samples (i.e. unique output_type_ids
) is counted separately for each compound_idx
.
I note as well that some of this confusion is probably coming from the fact that in the description we do mention coarser samples passing validation (as you noted below) but I haven't implemented that as it wasn't in the validation instructions.
See below for my suggestion of how to proceed with this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I tested the validation with a coarser sample settings and it passes validation without any issue (JHU subset file) if the same compound id set is used everywhere.
My hypothesis (but I am very not sure), is that it's because in the get_mt_spl_hash_tb()
and in particular in the
split(tbl, f = tbl$output_type_id) %>%
purrr::map(
function(.x, compound_taskids, non_compound_taskids) {
tibble::tibble(
compound_idx = names(sort(table(.x$compound_idx), decreasing = TRUE))[1L],
output_type_id = unique(.x$output_type_id),
hash_comp_tid = rlang::hash(unique(.x[, compound_taskids])),
hash_non_comp_tid = rlang::hash(.x[, non_compound_taskids]),
hash_spl_id = rlang::hash(.x)
)
},
non_compound_taskids = non_compound_taskids,
compound_taskids = compound_taskids
) %>%
purrr::list_rbind()
If all the samples share the same first compound id then it passes without issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I you have provided the wrong output type ID, that should cause an error here too because you have not properly specified the samples, so it would be wrong in my view to count them as valid samples against the expected number.
Also, I understand the logic behind the error but I also wonder if it's confusing for the user.
For example, I still don't understand the output I got from my test. I have the expected number of sample, but they are indeed misidentified. However, as they fit the minimal requirement of 100 samples per the hub compound id set test, I have no idea how to fix it. As we say before the error message might also need more information because for example I got an error and I don't know what the compound idx 10
means here as it's associated with different output type id
:
Required samples per compound idx task not present.
File contains less ("69") than the minimum required number of samples per task ("100") for
compound idx "10" [...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you dig into the errors
object, it gives you a lot more detail including the composition of each compound_idx
and how many samples of that idx were successfully counted, eg. here's the first element of the output:
$compound_idx
[1] "10"
$n
[1] 69
$min_samples_per_task
[1] 100
$min_samples_per_task
[1] 300
$compound_idx_tbl
# A tibble: 39,312 × 6
origin_date scenario_id location target horizon age_group
<chr> <chr> <chr> <chr> <chr> <chr>
1 2024-04-28 A-2024-03-01 51 inc hosp 1 0-64
2 2024-04-28 A-2024-03-01 51 inc hosp 2 0-64
3 2024-04-28 A-2024-03-01 51 inc hosp 3 0-64
4 2024-04-28 A-2024-03-01 51 inc hosp 4 0-64
5 2024-04-28 A-2024-03-01 51 inc hosp 5 0-64
6 2024-04-28 A-2024-03-01 51 inc hosp 6 0-64
7 2024-04-28 A-2024-03-01 51 inc hosp 7 0-64
8 2024-04-28 A-2024-03-01 51 inc hosp 8 0-64
9 2024-04-28 A-2024-03-01 51 inc hosp 9 0-64
10 2024-04-28 A-2024-03-01 51 inc hosp 10 0-64
# ℹ 39,302 more rows
# ℹ Use `print(n = ...)` to see more rows
There's only so much detail I can add to the message (in fact looking at it it's too long atm!). Also after I've implemented https://github.com/Infectious-Disease-Modeling-Hubs/hubData/issues/40, users will have another eay to inspect expected hub compound idxs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sorry I was not clear, I dig into the errors message but still cannot find how to replicate the n=69, I know the solution here is to update the sample ID in the output type id but it was just to illustrate my comment on how I find it confusing.
While it would be useful, I also agree that as "large" and "take time" are hard to properly define, not sure just how useful. There are plans to try and improve the performance of validations though so as part of that work we might get a better sense of what would be useful in the documentation also? I'll draft the performance issue up today and make a note about including more on performance in the docs too. |
…gle compound idx. Streamline message.
…. Minor docs tweaks.
Firstly thanks so much for your review and thorough testing @LucieContamin ! It's been really useful to work through. In response I've made a number of changes to the functionality / docs:
Let me know if these resolve your issues for the time being and feel free to open more issues if you think there's more that needs to be addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the updates! To respond to your three points updates:
- (streamline) I think it's a very good idea. It makes the validation faster
- (error message) I ran some tests on the new version and the error I got is easier to understand, thanks again!
- (coarser compound id) The new version of the validations does not support correctly ID sample for coarser compound task id group anymore (it was in the first reviewed version). I understand why but wonder if it's going to be something we need to add relatively quickly. I will comment in the open issue.
I also add some minor comment on the output error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for all the updates and additional information! I think we can merge it for a first version with the samples!
84/ignore compress ext
Update org name
This PR implement and adds new tests for checking the validity of submissions of samples using the v3 schema sample spec. See v3 sample validation spec for details.
Specific Sample validation tests implemented (#80)
check_tbl_values()
)check_tbl_values_required()
)check_tbl_spl_n()
.check_tbl_spl_non_compound_tid()
.check_tbl_spl_compound_tid()
.The key to the new functions
check_tbl_spl_n()
,check_tbl_spl_non_compound_tid()
andcheck_tbl_spl_compound_tid()
is a table of hashes on model output data joined to the output of the newhubData::expand_model_out_val_grid(include_sample_ids = TRUE)
, where the output type id column for v3 samples effectively contains the compound_idx. The hashes are calculated on the relevant subsets of values of each sample and aggregated/counted at the relevant level for each check, ie:check_tbl_spl_compound_tid()
: Ensure there is only a single unique hash of the combination of values across compound task id columns of all rows associated with samples for a given compound idx.check_tbl_spl_non_compound_tid()
: Ensure there is only a single unique hash of the combination of values across non-compound task id columns of all rows associated with samples for a modeling task.These checks are performed separately for each round modeling task item, allowing for differences between compound task id sets between round modeling tasks.
Still to do:
spl_hash_tbl()
on which many of the new checks depend can be time consuming with complex configs. Have attempted memoisation but have encountered difficulties in testing so this is still a work in progress but shouldn't change the rest of the functionality.