Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[internal] separate functions concerned with standard output type formats into separate file #163

Open
zkamvar opened this issue Nov 19, 2024 · 0 comments
Labels
upkeep maintenance, infrastructure, and similar v4.0.0

Comments

@zkamvar
Copy link
Member

zkamvar commented Nov 19, 2024

As discussed with @annakrystalli, one of the reasons why #161 had a lot of back and forth was because there were a lot of fluid representations of terminology that turned my brain to oatmeal.

Note

This is a discussion about internal data representation in relation to the hubverse schemas.

One concept that escaped me until I had a walkthrough was a "standard" ID type representation. A standard ID type representation is a list with two elements:

  • required: ALL elements that must be present
  • optional: ANY elements that may be present

These are present in both task_ids and output_type_ids. For example, from the cdcepi/FluSight-forecast-hub are the horizon task ID and the pmf output type ID (modified to be "optional" for this example):

                        "horizon": {
                            "required": null,
                            "optional": [-1, 0, 1, 2, 3]
                        },
                        "pmf": {
                            "output_type_id": {
                                "required": null,
                                "optional": [
                                    "large_decrease",
                                    "decrease",
                                    "stable",
                                    "increase",
                                    "large_increase"
                                ]
                            },
                        ...
                        }

By having the list of required and optional, we can use the same machinery to process both task IDs and output type IDs. The optional task IDs gives modelers flexibility in what they could model against. For example, if a given location is optional and the model is not performing well, then the modeling team does not have to submit that location. For output type IDs pre-v4, it was also true that modeling teams could submit a subset of them, but this made less sense because the output type IDs are all related within a given output type ID, so excluding one makes it more difficult to handle these downstream.

The v4 switch

The switch to v4 meant that the output_type_id of all output_types now had a new child element called is_required and removes the optional element from output_type_id. First, consider this PMF output type with only optional values showing a diff between v3 (-) and v4 (+)

                        "pmf": {
                            "output_type_id": {
-                               "required": null,
-                               "optional": [
+                               "required": [
                                    "large_decrease",
                                    "decrease",
                                    "stable",
                                    "increase",
                                    "large_increase"
                                ]
                            },
                            "value": {
                                "type": "double",
                                "minimum": 0,
                                "maximum": 1
-                           }
+                           },
+                           "is_required": false
                        }

In pre-v4 hubValidations, all output type ids (with the exception of sample) were represented internally as a list of required and optional elements. This was implicitly the standard data representation and did not need to be explicitly stated.

However, with v4, all output types had only required elements and no optional, even if the output type was optional. To avoid writing a completely new codebase for v4, @annakrystalli had the clever idea to contextualize the list of optional and required values to be the standard type and implemented it in a series of functions:

extract_mt_output_type_ids <- function(x, config_tid, force_output_types = FALSE) {
purrr::map(
x,
function(.x) {
output_type_ids <- .x[[config_tid]]
is_std <- std_output_type_ids(output_type_ids)
if (is_std && !force_output_types) {
return(output_type_ids)
}
if (is_std && force_output_types) {
return(as_required(output_type_ids))
}
is_required <- is_required_output_type(.x) || force_output_types
standardise_output_types_ids(output_type_ids, is_required)
}
)
}
pre_v4_std <- function(output_type_ids) {
setequal(
names(output_type_ids),
c("required", "optional")
)
}
is_required_output_type <- function(output_type) {
if (pre_v4_std(output_type[["output_type_id"]])) {
return(!is.null(output_type[["output_type_id"]][["required"]]))
}
isTRUE(output_type[["is_required"]]) ||
isTRUE(output_type[["output_type_id_params"]][["is_required"]])
}
std_output_type_ids <- function(output_type_ids) {
# Valid output type id configurations cannot be `NULL` . This is the situation
# in sample output types which are configured through a output_type_id_params
# property
no_output_type_ids <- is.null(output_type_ids)
if (no_output_type_ids) {
return(FALSE)
}
# pre v4 configs have standard format (i.e. both `required` and `optional` fields.)
# which is the format we are standardising to
pre_v4_std(output_type_ids)
}
standardise_output_types_ids <- function(output_type_ids, is_required) {
# Determine whether we're dealing with a v4 configuration for a point estimate or
# a `NULL` output_type_ids object (in the case of pre v4 samples).
# The below expression will return `TRUE` in both situations.
required_null <- is.null(output_type_ids[["required"]])
if (required_null) {
return(null_output_type_ids(is_required))
}
to_std_output_type_ids(output_type_ids, is_required)
}
to_std_output_type_ids <- function(output_type_ids, is_required) {
if (is_required) {
as_required(output_type_ids)
} else {
as_optional(output_type_ids)
}
}
as_required <- function(output_type_ids) {
opt_values <- output_type_ids$optional
req_values <- output_type_ids$required
values <- c(req_values, opt_values)
list(required = values, optional = NULL)
}
as_optional <- function(output_type_ids) {
opt_values <- output_type_ids$optional
req_values <- output_type_ids$required
values <- c(req_values, opt_values)
list(required = NULL, optional = values)
}
# Create a list of NULL or NA required and optional output type id values depending
# on whether the output type is required or optional. Allows us to use current
# infrastructure to convert `NULL`s to `NA`s in a back-compatible way.
null_output_type_ids <- function(is_required) {
if (is_required) {
list(required = NA, optional = NULL)
} else {
list(required = NULL, optional = NA)
}
}

Internal documentation need

This distinction of a "standard" type id representation is a new concept and is lightly documented in the code, but it could benefit from compartmentalization. Thus, I would like to take a stab at moving all the code concerned with "standard" type ids to a separate file and rearrange some of the comments to bubble up the take-home messages. I believe this will help with maintainability and refactorability of the code in the long run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upkeep maintenance, infrastructure, and similar v4.0.0
Projects
Status: Wishlist
Development

No branches or pull requests

1 participant