Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ Handle V3 sample specification #82

Merged
merged 40 commits into from
Jun 19, 2024
Merged

Conversation

annakrystalli
Copy link
Member

@annakrystalli annakrystalli commented May 20, 2024

This PR implement and adds new tests for checking the validity of submissions of samples using the v3 schema sample spec. See v3 sample validation spec for details.

Specific Sample validation tests implemented (#80)

  • Validate all value combinations valid (as part of check_tbl_values())
  • Validate all required value combinations submitted (as part of check_tbl_values_required())
  • Validate correct number of samples per compound idx submitted. Added through function check_tbl_spl_n().
  • Validate that samples within a submission file contain the same combination of optional (non-compound task id) values across all samples. Added through function check_tbl_spl_non_compound_tid().
  • Validate that samples conform to sample dependence defined by the compound task id set configuration i.e. all samples for a given compound idx contain the same unique combination of compound task id values. Added through function check_tbl_spl_compound_tid().
  • Add new validation checks to validate_model_data() in a back compatible way (i.e. only deploy if using a v3 config).

The key to the new functions check_tbl_spl_n(), check_tbl_spl_non_compound_tid() and check_tbl_spl_compound_tid() is a table of hashes on model output data joined to the output of the new hubData::expand_model_out_val_grid(include_sample_ids = TRUE), where the output type id column for v3 samples effectively contains the compound_idx. The hashes are calculated on the relevant subsets of values of each sample and aggregated/counted at the relevant level for each check, ie:

  • check_tbl_spl_n()` count unique output type id values per compound idx. The hash table provides a mapping between output type ids and compound idxs.
  • check_tbl_spl_compound_tid(): Ensure there is only a single unique hash of the combination of values across compound task id columns of all rows associated with samples for a given compound idx.
  • check_tbl_spl_non_compound_tid(): Ensure there is only a single unique hash of the combination of values across non-compound task id columns of all rows associated with samples for a modeling task.

These checks are performed separately for each round modeling task item, allowing for differences between compound task id sets between round modeling tasks.

Still to do:

  • The spl_hash_tbl() on which many of the new checks depend can be time consuming with complex configs. Have attempted memoisation but have encountered difficulties in testing so this is still a work in progress but shouldn't change the rest of the functionality.
  • Add more tests, especially varying the compound task id set.

@annakrystalli annakrystalli self-assigned this May 20, 2024
@annakrystalli annakrystalli added this to the sample output_type v1.0 milestone May 20, 2024
@annakrystalli annakrystalli linked an issue May 20, 2024 that may be closed by this pull request
@annakrystalli annakrystalli marked this pull request as draft May 20, 2024 08:46
Copy link

github-actions bot commented May 20, 2024

@annakrystalli annakrystalli changed the title [WIP] Feature/ Handle V3 sample specification Feature/ Handle V3 sample specification May 28, 2024
@annakrystalli annakrystalli marked this pull request as ready for review May 28, 2024 13:52
DESCRIPTION Outdated
Comment on lines 63 to 65
Infectious-Disease-Modeling-Hubs/hubUtils@enhancement/v3-utils,
Infectious-Disease-Modeling-Hubs/hubData@feature/handle-samples,
Infectious-Disease-Modeling-Hubs/hubAdmin@feature/sample-support,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noting that we will probably want to change this before going ahead with a merge

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most definitely.
It's the last thing that needs to be done throughout the packages. It is required atm for tests to run successfully.

Copy link
Contributor

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions about the check_tbl_spl_n() behavior. I also have some minor documentations questions.

I was also debating if it makes sense to have somewhere in the documentation (function documentation and/or vignette) a warning saying that large files with a lot of samples might take time to validation. However, as we don't have a clear estimation on what is "large" and "take time", I am not sure how helpful it is.

In my review, I refer to a test that I made where I had issue with the results.
For the test, I use my own temporary "test" repo at LucieContamin/hub_test and the file causing issue is "Hubtest-hubtemp_subset/2024-04-28-Hubtest-hubtemp_subset.parquet"

R/check_tbl_spl_compound_tid.R Show resolved Hide resolved
tests/testthat/_snaps/check_tbl_spl_compound_tid.md Outdated Show resolved Hide resolved
Comment on lines +17 to +28
n_tbl <- dplyr::group_by(hash_tbl, .data$compound_idx) %>%
dplyr::summarise(
n = dplyr::n_distinct(.data$output_type_id),
mt_id = unique(.data$mt_id)
) %>%
dplyr::left_join(n_ranges, by = "mt_id") %>%
dplyr::mutate(
less = .data$n < .data$n_min,
more = .data$n > .data$n_max,
out_range = .data$less | .data$more
) %>%
dplyr::filter(.data$out_range)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the logic here, we want to check the number of samples, but per compound it and not by unique task id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Compound_idx is what defines a sample and we need to count samples across compound_idx to ensure they are within the correct range. See for eaxample how samples (output_type_ids) are marked up under different coumpound task id sets in the docs: https://hubverse.io/en/latest/user-guide/sample-output-type.html#four-submissions-differing-by-compound-modeling-task

Copy link
Member Author

@annakrystalli annakrystalli Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also opened the following issue. You can currently get this information from the hubData::expand_model_out_val_grid(include_sample_ids = TRUE)` but I think a function that creates a clear table illustrating compound idx structure of a given round/modeling task would be really useful https://github.com/Infectious-Disease-Modeling-Hubs/hubData/issues/40

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that I don't understand is why is it link to the output type id information? For me it's two different information. We are checking the output type id column in the other 2 functions with the compound ID checking, so if you have an error of ID, you will have an error but it does not mean you are missing samples.
For example if someone provided the correct number of samples but made an error in the output type id column, than it should not return an error here but only about the output type id numbering.

Copy link
Member Author

@annakrystalli annakrystalli Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the validation instriuctions it states:

The number of rows for each combination of individual modeling task should fall between min_samples_per_task and max_samples_per_task (inclusive).

Compound_idx defines the individual modeling task of the combination unique compound task id values, so samples need to match a modeling task first to then be counted against the required number for that compound_idx.

I you have provided the wrong output type ID, that should cause an error here too because you have not properly specified the samples, so it would be wrong in my view to count them as valid samples against the expected number.

From the validation spec sheet, this the test I've tried to encode here:

  • Example: task ids are location, origin_date, horizon; compound units defined by combinations of location and origin_date; expect the number of samples for each such compound unit to be between specified min and max.
  • Logic: Within each group defined by a unique combination of values for the compound_taskid_set, the number of unique values for output_type_id should be between min_samples_per_task and max_samples_per_task

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry! It is confusing...I'm confused !

My interpretation is that combination of individual modeling tasks is defined by the compund_taskid_set in the config. Each unique combination of values in the variables in the compund_taskid_set corresponds to a single compound_idx. And number of samples (i.e. unique output_type_ids) is counted separately for each compound_idx.

I note as well that some of this confusion is probably coming from the fact that in the description we do mention coarser samples passing validation (as you noted below) but I haven't implemented that as it wasn't in the validation instructions.

See below for my suggestion of how to proceed with this for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, I tested the validation with a coarser sample settings and it passes validation without any issue (JHU subset file) if the same compound id set is used everywhere.
My hypothesis (but I am very not sure), is that it's because in the get_mt_spl_hash_tb() and in particular in the

split(tbl, f = tbl$output_type_id) %>%
    purrr::map(
      function(.x, compound_taskids, non_compound_taskids) {
        tibble::tibble(
          compound_idx = names(sort(table(.x$compound_idx), decreasing = TRUE))[1L],
          output_type_id = unique(.x$output_type_id),
          hash_comp_tid = rlang::hash(unique(.x[, compound_taskids])),
          hash_non_comp_tid = rlang::hash(.x[, non_compound_taskids]),
          hash_spl_id = rlang::hash(.x)
        )
      },
      non_compound_taskids = non_compound_taskids,
      compound_taskids = compound_taskids
    ) %>%
    purrr::list_rbind()

If all the samples share the same first compound id then it passes without issue.

Copy link
Contributor

@LucieContamin LucieContamin Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I you have provided the wrong output type ID, that should cause an error here too because you have not properly specified the samples, so it would be wrong in my view to count them as valid samples against the expected number.

Also, I understand the logic behind the error but I also wonder if it's confusing for the user.
For example, I still don't understand the output I got from my test. I have the expected number of sample, but they are indeed misidentified. However, as they fit the minimal requirement of 100 samples per the hub compound id set test, I have no idea how to fix it. As we say before the error message might also need more information because for example I got an error and I don't know what the compound idx 10 means here as it's associated with different output type id:

Required samples per compound idx task not present.
  File contains less ("69") than the minimum required number of samples per task ("100") for
  compound idx "10" [...]

Copy link
Member Author

@annakrystalli annakrystalli Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you dig into the errors object, it gives you a lot more detail including the composition of each compound_idx and how many samples of that idx were successfully counted, eg. here's the first element of the output:

$compound_idx
[1] "10"

$n
[1] 69

$min_samples_per_task
[1] 100

$min_samples_per_task
[1] 300

$compound_idx_tbl
# A tibble: 39,312 × 6
   origin_date scenario_id  location target   horizon age_group
   <chr>       <chr>        <chr>    <chr>    <chr>   <chr>    
 1 2024-04-28  A-2024-03-01 51       inc hosp 1       0-64     
 2 2024-04-28  A-2024-03-01 51       inc hosp 2       0-64     
 3 2024-04-28  A-2024-03-01 51       inc hosp 3       0-64     
 4 2024-04-28  A-2024-03-01 51       inc hosp 4       0-64     
 5 2024-04-28  A-2024-03-01 51       inc hosp 5       0-64     
 6 2024-04-28  A-2024-03-01 51       inc hosp 6       0-64     
 7 2024-04-28  A-2024-03-01 51       inc hosp 7       0-64     
 8 2024-04-28  A-2024-03-01 51       inc hosp 8       0-64     
 9 2024-04-28  A-2024-03-01 51       inc hosp 9       0-64     
10 2024-04-28  A-2024-03-01 51       inc hosp 10      0-64     
# ℹ 39,302 more rows
# ℹ Use `print(n = ...)` to see more rows

There's only so much detail I can add to the message (in fact looking at it it's too long atm!). Also after I've implemented https://github.com/Infectious-Disease-Modeling-Hubs/hubData/issues/40, users will have another eay to inspect expected hub compound idxs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry I was not clear, I dig into the errors message but still cannot find how to replicate the n=69, I know the solution here is to update the sample ID in the output type id but it was just to illustrate my comment on how I find it confusing.

R/check_tbl_spl_n.R Outdated Show resolved Hide resolved
R/check_tbl_spl_compound_tid.R Outdated Show resolved Hide resolved
@annakrystalli
Copy link
Member Author

I was also debating if it makes sense to have somewhere in the documentation (function documentation and/or vignette) a warning saying that large files with a lot of samples might take time to validation. However, as we don't have a clear estimation on what is "large" and "take time", I am not sure how helpful it is.

While it would be useful, I also agree that as "large" and "take time" are hard to properly define, not sure just how useful. There are plans to try and improve the performance of validations though so as part of that work we might get a better sense of what would be useful in the documentation also? I'll draft the performance issue up today and make a note about including more on performance in the docs too.

@annakrystalli
Copy link
Member Author

Firstly thanks so much for your review and thorough testing @LucieContamin ! It's been really useful to work through. In response I've made a number of changes to the functionality / docs:

  • Firstly, to make things more streamlined, I've changed the sequence of execution of sample checks and set the check for the compound task id and non-compound task id to return errors and cause validation to return early. That way samples are only counted once we know we have well formed samples (I know it's different to what you do in scenario hub but it makes more sense to me atm, happy to revisit and get more opinions in next round of work on samples though!).
  • Next I've reworked the check messages, the names of the objects returned if validation fails and the information returned as well as part of each errors. Hopefully, the information returned is much more useful and intuitive and the information in the docs is now enough to help explain what each check failure means and direct a tema to fixing it.
  • I've also opened this issue to add functionality to validate coarser compound task id sets which atm hubValidations does not support (Validate coarse-grained samples #88). I have added questions to that issue to help me understand the functionality better, your input would be greatly appreciated!

Let me know if these resolve your issues for the time being and feel free to open more issues if you think there's more that needs to be addressed.

Copy link
Contributor

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the updates! To respond to your three points updates:

  1. (streamline) I think it's a very good idea. It makes the validation faster
  2. (error message) I ran some tests on the new version and the error I got is easier to understand, thanks again!
  3. (coarser compound id) The new version of the validations does not support correctly ID sample for coarser compound task id group anymore (it was in the first reviewed version). I understand why but wonder if it's going to be something we need to add relatively quickly. I will comment in the open issue.

I also add some minor comment on the output error messages.

R/check_tbl_spl_non_compound_tid.R Outdated Show resolved Hide resolved
R/check_tbl_spl_n.R Show resolved Hide resolved
@LucieContamin LucieContamin self-requested a review June 18, 2024 14:29
Copy link
Contributor

@LucieContamin LucieContamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for all the updates and additional information! I think we can merge it for a first version with the samples!

@annakrystalli annakrystalli merged commit bfab4d3 into main Jun 19, 2024
8 checks passed
@annakrystalli annakrystalli deleted the feature/handle-samples branch June 19, 2024 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Add validations for sample output_type data
3 participants