Feature/ Handle V3 sample specification #82

annakrystalli · 2024-05-20T08:45:32Z

This PR implement and adds new tests for checking the validity of submissions of samples using the v3 schema sample spec. See v3 sample validation spec for details.

Specific Sample validation tests implemented (#80)

Validate all value combinations valid (as part of check_tbl_values())
Validate all required value combinations submitted (as part of check_tbl_values_required())
Validate correct number of samples per compound idx submitted. Added through function check_tbl_spl_n().
Validate that samples within a submission file contain the same combination of optional (non-compound task id) values across all samples. Added through function check_tbl_spl_non_compound_tid().
Validate that samples conform to sample dependence defined by the compound task id set configuration i.e. all samples for a given compound idx contain the same unique combination of compound task id values. Added through function check_tbl_spl_compound_tid().
Add new validation checks to validate_model_data() in a back compatible way (i.e. only deploy if using a v3 config).

The key to the new functions check_tbl_spl_n(), check_tbl_spl_non_compound_tid() and check_tbl_spl_compound_tid() is a table of hashes on model output data joined to the output of the new hubData::expand_model_out_val_grid(include_sample_ids = TRUE), where the output type id column for v3 samples effectively contains the compound_idx. The hashes are calculated on the relevant subsets of values of each sample and aggregated/counted at the relevant level for each check, ie:

check_tbl_spl_n()` count unique output type id values per compound idx. The hash table provides a mapping between output type ids and compound idxs.
check_tbl_spl_compound_tid(): Ensure there is only a single unique hash of the combination of values across compound task id columns of all rows associated with samples for a given compound idx.
check_tbl_spl_non_compound_tid(): Ensure there is only a single unique hash of the combination of values across non-compound task id columns of all rows associated with samples for a modeling task.

These checks are performed separately for each round modeling task item, allowing for differences between compound task id sets between round modeling tasks.

Still to do:

The spl_hash_tbl() on which many of the new checks depend can be time consuming with complex configs. Have attempted memoisation but have encountered difficulties in testing so this is still a work in progress but shouldn't change the rest of the functionality.
Add more tests, especially varying the compound task id set.

github-actions · 2024-05-20T08:48:17Z

🚀 Deployed on https://66728738769adaca9558e175--hubvalidations-pr-previews.netlify.app

…h horizons

…aem parsing

elray1 · 2024-06-06T19:25:25Z

DESCRIPTION

+    Infectious-Disease-Modeling-Hubs/hubUtils@enhancement/v3-utils,
+    Infectious-Disease-Modeling-Hubs/hubData@feature/handle-samples,
+    Infectious-Disease-Modeling-Hubs/hubAdmin@feature/sample-support,


noting that we will probably want to change this before going ahead with a merge

Most definitely.
It's the last thing that needs to be done throughout the packages. It is required atm for tests to run successfully.

LucieContamin

I have some questions about the check_tbl_spl_n() behavior. I also have some minor documentations questions.

I was also debating if it makes sense to have somewhere in the documentation (function documentation and/or vignette) a warning saying that large files with a lot of samples might take time to validation. However, as we don't have a clear estimation on what is "large" and "take time", I am not sure how helpful it is.

In my review, I refer to a test that I made where I had issue with the results.
For the test, I use my own temporary "test" repo at LucieContamin/hub_test and the file causing issue is "Hubtest-hubtemp_subset/2024-04-28-Hubtest-hubtemp_subset.parquet"

R/check_tbl_spl_compound_tid.R

tests/testthat/_snaps/check_tbl_spl_compound_tid.md

LucieContamin · 2024-06-13T22:12:48Z

R/check_tbl_spl_n.R

+  n_tbl <- dplyr::group_by(hash_tbl, .data$compound_idx) %>%
+    dplyr::summarise(
+      n = dplyr::n_distinct(.data$output_type_id),
+      mt_id = unique(.data$mt_id)
+    ) %>%
+    dplyr::left_join(n_ranges, by = "mt_id") %>%
+    dplyr::mutate(
+      less = .data$n < .data$n_min,
+      more = .data$n > .data$n_max,
+      out_range = .data$less | .data$more
+    ) %>%
+    dplyr::filter(.data$out_range)


I am not sure I understand the logic here, we want to check the number of samples, but per compound it and not by unique task id?

Right. Compound_idx is what defines a sample and we need to count samples across compound_idx to ensure they are within the correct range. See for eaxample how samples (output_type_ids) are marked up under different coumpound task id sets in the docs: https://hubverse.io/en/latest/user-guide/sample-output-type.html#four-submissions-differing-by-compound-modeling-task

I've also opened the following issue. You can currently get this information from the hubData::expand_model_out_val_grid(include_sample_ids = TRUE)` but I think a function that creates a clear table illustrating compound idx structure of a given round/modeling task would be really useful https://github.com/Infectious-Disease-Modeling-Hubs/hubData/issues/40

The part that I don't understand is why is it link to the output type id information? For me it's two different information. We are checking the output type id column in the other 2 functions with the compound ID checking, so if you have an error of ID, you will have an error but it does not mean you are missing samples.
For example if someone provided the correct number of samples but made an error in the output type id column, than it should not return an error here but only about the output type id numbering.

In the validation instriuctions it states:

The number of rows for each combination of individual modeling task should fall between min_samples_per_task and max_samples_per_task (inclusive).

Compound_idx defines the individual modeling task of the combination unique compound task id values, so samples need to match a modeling task first to then be counted against the required number for that compound_idx.

I you have provided the wrong output type ID, that should cause an error here too because you have not properly specified the samples, so it would be wrong in my view to count them as valid samples against the expected number.

From the validation spec sheet, this the test I've tried to encode here:

Example: task ids are location, origin_date, horizon; compound units defined by combinations of location and origin_date; expect the number of samples for each such compound unit to be between specified min and max.

Logic: Within each group defined by a unique combination of values for the compound_taskid_set, the number of unique values for output_type_id should be between min_samples_per_task and max_samples_per_task

Don't worry! It is confusing...I'm confused !

My interpretation is that combination of individual modeling tasks is defined by the compund_taskid_set in the config. Each unique combination of values in the variables in the compund_taskid_set corresponds to a single compound_idx. And number of samples (i.e. unique output_type_ids) is counted separately for each compound_idx.

I note as well that some of this confusion is probably coming from the fact that in the description we do mention coarser samples passing validation (as you noted below) but I haven't implemented that as it wasn't in the validation instructions.

See below for my suggestion of how to proceed with this for now.

Interestingly, I tested the validation with a coarser sample settings and it passes validation without any issue (JHU subset file) if the same compound id set is used everywhere.
My hypothesis (but I am very not sure), is that it's because in the get_mt_spl_hash_tb() and in particular in the

split(tbl, f = tbl$output_type_id) %>% purrr::map( function(.x, compound_taskids, non_compound_taskids) { tibble::tibble( compound_idx = names(sort(table(.x$compound_idx), decreasing = TRUE))[1L], output_type_id = unique(.x$output_type_id), hash_comp_tid = rlang::hash(unique(.x[, compound_taskids])), hash_non_comp_tid = rlang::hash(.x[, non_compound_taskids]), hash_spl_id = rlang::hash(.x) ) }, non_compound_taskids = non_compound_taskids, compound_taskids = compound_taskids ) %>% purrr::list_rbind()

If all the samples share the same first compound id then it passes without issue.

I you have provided the wrong output type ID, that should cause an error here too because you have not properly specified the samples, so it would be wrong in my view to count them as valid samples against the expected number.

Also, I understand the logic behind the error but I also wonder if it's confusing for the user.
For example, I still don't understand the output I got from my test. I have the expected number of sample, but they are indeed misidentified. However, as they fit the minimal requirement of 100 samples per the hub compound id set test, I have no idea how to fix it. As we say before the error message might also need more information because for example I got an error and I don't know what the compound idx 10 means here as it's associated with different output type id:

Required samples per compound idx task not present. File contains less ("69") than the minimum required number of samples per task ("100") for compound idx "10" [...]

If you dig into the errors object, it gives you a lot more detail including the composition of each compound_idx and how many samples of that idx were successfully counted, eg. here's the first element of the output:

$compound_idx [1] "10" $n [1] 69 $min_samples_per_task [1] 100 $min_samples_per_task [1] 300 $compound_idx_tbl # A tibble: 39,312 × 6 origin_date scenario_id location target horizon age_group <chr> <chr> <chr> <chr> <chr> <chr> 1 2024-04-28 A-2024-03-01 51 inc hosp 1 0-64 2 2024-04-28 A-2024-03-01 51 inc hosp 2 0-64 3 2024-04-28 A-2024-03-01 51 inc hosp 3 0-64 4 2024-04-28 A-2024-03-01 51 inc hosp 4 0-64 5 2024-04-28 A-2024-03-01 51 inc hosp 5 0-64 6 2024-04-28 A-2024-03-01 51 inc hosp 6 0-64 7 2024-04-28 A-2024-03-01 51 inc hosp 7 0-64 8 2024-04-28 A-2024-03-01 51 inc hosp 8 0-64 9 2024-04-28 A-2024-03-01 51 inc hosp 9 0-64 10 2024-04-28 A-2024-03-01 51 inc hosp 10 0-64 # ℹ 39,302 more rows # ℹ Use `print(n = ...)` to see more rows

There's only so much detail I can add to the message (in fact looking at it it's too long atm!). Also after I've implemented https://github.com/Infectious-Disease-Modeling-Hubs/hubData/issues/40, users will have another eay to inspect expected hub compound idxs

So sorry I was not clear, I dig into the errors message but still cannot find how to replicate the n=69, I know the solution here is to update the sample ID in the output type id but it was just to illustrate my comment on how I find it confusing.

R/check_tbl_spl_n.R

R/check_tbl_spl_compound_tid.R

annakrystalli · 2024-06-14T07:05:26Z

I was also debating if it makes sense to have somewhere in the documentation (function documentation and/or vignette) a warning saying that large files with a lot of samples might take time to validation. However, as we don't have a clear estimation on what is "large" and "take time", I am not sure how helpful it is.

While it would be useful, I also agree that as "large" and "take time" are hard to properly define, not sure just how useful. There are plans to try and improve the performance of validations though so as part of that work we might get a better sense of what would be useful in the documentation also? I'll draft the performance issue up today and make a note about including more on performance in the docs too.

…gle compound idx. Streamline message.

…. Minor docs tweaks.

…eturns.

annakrystalli · 2024-06-17T09:14:45Z

Firstly thanks so much for your review and thorough testing @LucieContamin ! It's been really useful to work through. In response I've made a number of changes to the functionality / docs:

Firstly, to make things more streamlined, I've changed the sequence of execution of sample checks and set the check for the compound task id and non-compound task id to return errors and cause validation to return early. That way samples are only counted once we know we have well formed samples (I know it's different to what you do in scenario hub but it makes more sense to me atm, happy to revisit and get more opinions in next round of work on samples though!).
Next I've reworked the check messages, the names of the objects returned if validation fails and the information returned as well as part of each errors. Hopefully, the information returned is much more useful and intuitive and the information in the docs is now enough to help explain what each check failure means and direct a tema to fixing it.
I've also opened this issue to add functionality to validate coarser compound task id sets which atm hubValidations does not support (Validate coarse-grained samples #88). I have added questions to that issue to help me understand the functionality better, your input would be greatly appreciated!

Let me know if these resolve your issues for the time being and feel free to open more issues if you think there's more that needs to be addressed.

LucieContamin

Thank you for all the updates! To respond to your three points updates:

(streamline) I think it's a very good idea. It makes the validation faster
(error message) I ran some tests on the new version and the error I got is easier to understand, thanks again!
(coarser compound id) The new version of the validations does not support correctly ID sample for coarser compound task id group anymore (it was in the first reviewed version). I understand why but wonder if it's going to be something we need to add relatively quickly. I will comment in the open issue.

I also add some minor comment on the output error messages.

R/check_tbl_spl_non_compound_tid.R

R/check_tbl_spl_n.R

LucieContamin

Thank you very much for all the updates and additional information! I think we can merge it for a first version with the samples!

84/ignore compress ext

Update org name

annakrystalli added 4 commits May 17, 2024 18:34

use v3 dependencies

9b1024e

add samples examples hub

5e94eb9

handle v3 samples when checking valid values

2957a82

handle v3 samples when checking required values

8285a60

annakrystalli self-assigned this May 20, 2024

annakrystalli added this to the sample output_type v1.0 milestone May 20, 2024

annakrystalli linked an issue May 20, 2024 that may be closed by this pull request

Add validations for sample output_type data #80

Closed

annakrystalli marked this pull request as draft May 20, 2024 08:46

fix lint indentation issue

9f87504

annakrystalli changed the title ~~[WIP] Feature/ Handle V3 sample specification~~ Feature/ Handle V3 sample specification May 28, 2024

annakrystalli marked this pull request as ready for review May 28, 2024 13:52

annakrystalli added 9 commits May 28, 2024 16:53

Add v3 sample utilities

edcb464

Add function to check number of v3 samples

b33e0d6

Add check that validates compound task id values in v3 samples

e3ac2a1

Add check that validates non-compound task id values in v3 samples

1ac2806

Add new checks conditionally to validate_model_data

524b3d8

Shorten example hub name to silence note.

d2f7d86

Update checks table. Add missing utils namespace

5308eef

Bump version, add minimum dplyr version.

cebb2e6

Update NAMESPACE

c1cf4c4

annakrystalli requested review from elray1 and LucieContamin May 28, 2024 14:02

annakrystalli added 6 commits May 28, 2024 17:28

Ensure target end date contains 3 dates more than target date to matc…

53d436a

…h horizons

use repository property. Remove name

22ce0ee

Add vector of compression codex extension prefixes to ignore in filen…

08b34ec

…aem parsing

remove "uncompressed" as a compression engine

388c19d

Parse compression extension. Resolves #84

26efd69

clarify that compression ext extracted rather than just ignored.

9efb91f

elray1 reviewed Jun 6, 2024

View reviewed changes

build docs

6443416

LucieContamin requested changes Jun 13, 2024

View reviewed changes

annakrystalli added 10 commits June 14, 2024 19:23

correct output max samples element name

7a248d2

streamline error message

1a2e077

Use compound_idx n per sample to ensure each is associated with a sin…

5d109dd

…gle compound idx. Streamline message.

Update check_tbl_spl_n snapshots

1a3e5e5

Document and make output errors names more descriptive.

a690bef

Perform sample checks sequentially and return early if first two fail…

85f71bf

…. Minor docs tweaks.

style

c098660

increase cyclocomp complexitty to allow for enough validation early r…

88e91ff

…eturns.

fix lint issues

b58c931

document

3bb4617

LucieContamin reviewed Jun 17, 2024

View reviewed changes

R/check_tbl_spl_non_compound_tid.R Outdated Show resolved Hide resolved

R/check_tbl_spl_n.R Show resolved Hide resolved

annakrystalli added 4 commits June 18, 2024 09:29

add extra clarification re prevalent table.

b53866a

Change "prevalent" to "frequent".

123e277

Rename org name

35058d6

add schema back-compatibility test

3bda211

LucieContamin self-requested a review June 18, 2024 14:29

LucieContamin approved these changes Jun 18, 2024

View reviewed changes

annakrystalli added 5 commits June 18, 2024 18:08

Merge pull request #87 from hubverse-org/84/ignore-compress-ext

475b95b

84/ignore compress ext

Update minimum hubverse dep versions

4a457cd

Fix back compatible tests, silence additional file messages

3636cf3

Merge pull request #90 from hubverse-org/to-hubverse

2b1fd09

Update org name

remove hubverse dep refs

86f3a2c

annakrystalli merged commit bfab4d3 into main Jun 19, 2024
8 checks passed

annakrystalli deleted the feature/handle-samples branch June 19, 2024 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/ Handle V3 sample specification #82

Feature/ Handle V3 sample specification #82

annakrystalli commented May 20, 2024 •

edited

Loading

github-actions bot commented May 20, 2024 •

edited

Loading

elray1 Jun 6, 2024

annakrystalli Jun 14, 2024

LucieContamin left a comment

LucieContamin Jun 13, 2024

annakrystalli Jun 14, 2024

annakrystalli Jun 14, 2024 •

edited

Loading

LucieContamin Jun 14, 2024

annakrystalli Jun 14, 2024 •

edited

Loading

annakrystalli Jun 14, 2024

LucieContamin Jun 14, 2024

LucieContamin Jun 14, 2024 •

edited

Loading

annakrystalli Jun 14, 2024 •

edited

Loading

LucieContamin Jun 14, 2024

annakrystalli commented Jun 14, 2024

annakrystalli commented Jun 17, 2024

LucieContamin left a comment

LucieContamin left a comment

Feature/ Handle V3 sample specification #82

Feature/ Handle V3 sample specification #82

Conversation

annakrystalli commented May 20, 2024 • edited Loading

Specific Sample validation tests implemented (#80)

Still to do:

github-actions bot commented May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucieContamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

annakrystalli Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

annakrystalli Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LucieContamin Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

annakrystalli Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

annakrystalli commented Jun 14, 2024

annakrystalli commented Jun 17, 2024

LucieContamin left a comment

Choose a reason for hiding this comment

LucieContamin left a comment

Choose a reason for hiding this comment

annakrystalli commented May 20, 2024 •

edited

Loading

github-actions bot commented May 20, 2024 •

edited

Loading

annakrystalli Jun 14, 2024 •

edited

Loading

annakrystalli Jun 14, 2024 •

edited

Loading

LucieContamin Jun 14, 2024 •

edited

Loading

annakrystalli Jun 14, 2024 •

edited

Loading