[internal] separate functions concerned with standard output type formats into separate file #163

zkamvar · 2024-11-19T17:14:51Z

As discussed with @annakrystalli, one of the reasons why #161 had a lot of back and forth was because there were a lot of fluid representations of terminology that turned my brain to oatmeal.

Note

This is a discussion about internal data representation in relation to the hubverse schemas.

One concept that escaped me until I had a walkthrough was a "standard" ID type representation. A standard ID type representation is a list with two elements:

required: ALL elements that must be present
optional: ANY elements that may be present

These are present in both task_ids and output_type_ids. For example, from the cdcepi/FluSight-forecast-hub are the horizon task ID and the pmf output type ID (modified to be "optional" for this example):

                        "horizon": {
                            "required": null,
                            "optional": [-1, 0, 1, 2, 3]
                        },

                        "pmf": {
                            "output_type_id": {
                                "required": null,
                                "optional": [
                                    "large_decrease",
                                    "decrease",
                                    "stable",
                                    "increase",
                                    "large_increase"
                                ]
                            },
                        ...
                        }

By having the list of required and optional, we can use the same machinery to process both task IDs and output type IDs. The optional task IDs gives modelers flexibility in what they could model against. For example, if a given location is optional and the model is not performing well, then the modeling team does not have to submit that location. For output type IDs pre-v4, it was also true that modeling teams could submit a subset of them, but this made less sense because the output type IDs are all related within a given output type ID, so excluding one makes it more difficult to handle these downstream.

The v4 switch

The switch to v4 meant that the output_type_id of all output_types now had a new child element called is_required and removes the optional element from output_type_id. First, consider this PMF output type with only optional values showing a diff between v3 (-) and v4 (+)

                        "pmf": {
                            "output_type_id": {
-                               "required": null,
-                               "optional": [
+                               "required": [
                                    "large_decrease",
                                    "decrease",
                                    "stable",
                                    "increase",
                                    "large_increase"
                                ]
                            },
                            "value": {
                                "type": "double",
                                "minimum": 0,
                                "maximum": 1
-                           }
+                           },
+                           "is_required": false
                        }

In pre-v4 hubValidations, all output type ids (with the exception of sample) were represented internally as a list of required and optional elements. This was implicitly the standard data representation and did not need to be explicitly stated.

However, with v4, all output types had only required elements and no optional, even if the output type was optional. To avoid writing a completely new codebase for v4, @annakrystalli had the clever idea to contextualize the list of optional and required values to be the standard type and implemented it in a series of functions:

hubValidations/R/expand_model_out_grid.R

Lines 602 to 692 in 9a19893

    
           extract_mt_output_type_ids <- function(x, config_tid, force_output_types = FALSE) { 
        
             purrr::map( 
        
               x, 
        
               function(.x) { 
        
                 output_type_ids <- .x[[config_tid]] 
        
                 is_std <- std_output_type_ids(output_type_ids) 
        
                 if (is_std && !force_output_types) { 
        
                   return(output_type_ids) 
        
                 } 
        
                 if (is_std && force_output_types) { 
        
                   return(as_required(output_type_ids)) 
        
                 } 
        
                 is_required <- is_required_output_type(.x) || force_output_types 
        
                 standardise_output_types_ids(output_type_ids, is_required) 
        
               } 
        
             ) 
        
           } 
        
           pre_v4_std <- function(output_type_ids) { 
        
             setequal( 
        
               names(output_type_ids), 
        
               c("required", "optional") 
        
             ) 
        
           } 
        
           is_required_output_type <- function(output_type) { 
        
             if (pre_v4_std(output_type[["output_type_id"]])) { 
        
               return(!is.null(output_type[["output_type_id"]][["required"]])) 
        
             } 
        
             isTRUE(output_type[["is_required"]]) || 
        
               isTRUE(output_type[["output_type_id_params"]][["is_required"]]) 
        
           } 
        
           std_output_type_ids <- function(output_type_ids) { 
        
             # Valid output type id configurations cannot be `NULL` . This is the situation 
        
             # in sample output types which are configured through a output_type_id_params 
        
             # property 
        
             no_output_type_ids <- is.null(output_type_ids) 
        
             if (no_output_type_ids) { 
        
               return(FALSE) 
        
             } 
        
             # pre v4 configs have standard format (i.e. both `required` and `optional` fields.) 
        
             # which is the format we are standardising to 
        
             pre_v4_std(output_type_ids) 
        
           } 
        
           standardise_output_types_ids <- function(output_type_ids, is_required) { 
        
             # Determine whether we're dealing with a v4 configuration for a point estimate or 
        
             #  a `NULL` output_type_ids object (in the case of pre v4 samples). 
        
             #  The below expression will return `TRUE` in both situations. 
        
             required_null <- is.null(output_type_ids[["required"]]) 
        
             if (required_null) { 
        
               return(null_output_type_ids(is_required)) 
        
             } 
        
             to_std_output_type_ids(output_type_ids, is_required) 
        
           } 
        
           to_std_output_type_ids <- function(output_type_ids, is_required) { 
        
             if (is_required) { 
        
               as_required(output_type_ids) 
        
             } else { 
        
               as_optional(output_type_ids) 
        
             } 
        
           } 
        
           as_required <- function(output_type_ids) { 
        
             opt_values <- output_type_ids$optional 
        
             req_values <- output_type_ids$required 
        
             values <- c(req_values, opt_values) 
        
             list(required = values, optional = NULL) 
        
           } 
        
           as_optional <- function(output_type_ids) { 
        
             opt_values <- output_type_ids$optional 
        
             req_values <- output_type_ids$required 
        
             values <- c(req_values, opt_values) 
        
             list(required = NULL, optional = values) 
        
           } 
        
           # Create a list of NULL or NA required and optional output type id values depending 
        
           # on whether the output type is required or optional. Allows us to use current 
        
           # infrastructure to convert `NULL`s to `NA`s in a back-compatible way. 
        
           null_output_type_ids <- function(is_required) { 
        
             if (is_required) { 
        
               list(required = NA, optional = NULL) 
        
             } else { 
        
               list(required = NULL, optional = NA) 
        
             } 
        
           }

Internal documentation need

This distinction of a "standard" type id representation is a new concept and is lightly documented in the code, but it could benefit from compartmentalization. Thus, I would like to take a stab at moving all the code concerned with "standard" type ids to a separate file and rearrange some of the comments to bubble up the take-home messages. I believe this will help with maintainability and refactorability of the code in the long run.

The text was updated successfully, but these errors were encountered:

github-project-automation bot moved this to Todo in hubverse Development overview Nov 19, 2024

github-project-automation bot added this to hubverse Development overview Nov 19, 2024

zkamvar added v4.0.0 upkeep maintenance, infrastructure, and similar labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[internal] separate functions concerned with standard output type formats into separate file #163

[internal] separate functions concerned with standard output type formats into separate file #163

zkamvar commented Nov 19, 2024

[internal] separate functions concerned with standard output type formats into separate file #163

[internal] separate functions concerned with standard output type formats into separate file #163

Comments

zkamvar commented Nov 19, 2024

The v4 switch

Internal documentation need