Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify missingness of point estimates; add examples for determining output_type_id format #197

Merged
merged 21 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/source/quickstart-hub-admin/tasks-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,9 +186,11 @@ As seen previously, each `task_ids` has a `required` and an `optional` property

### 6.1. Setting the `"mean"`:
- <mark style="background-color: #FFE331">Here, the `"mean"` of the predictive distribution</mark> is set as a valid value for a submission file.
- <mark style="background-color: #32E331">`"output_type_id"` is used</mark> to determine whether the `mean` is a required or an optional `output_type`. Both `"required"` and `"optional"` should be declared, and the option that is chosen (required or optional) should be set to `["NA"]`, whereas the one that is not selected should be set to `null`. In this example, the mean is optional, not required. If the mean is required, `"required"` should be set to `["NA"]`, and `"optional"` should be set to `null`.
- <mark style="background-color: #32E331">`"output_type_id"` is used</mark> to determine whether the `mean` is a required or an optional `output_type`. Both `"required"` and `"optional"` should be declared, and the option that is chosen (required or optional) should be set to `["NA"]`[^missy], whereas the one that is not selected should be set to `null`. In this example, the mean is optional, not required. If the mean is required, `"required"` should be set to `["NA"]`, and `"optional"` should be set to `null`.
- <mark style="background-color: #38C7ED">`"value"` sets the characteristics</mark> of this valid `output_type` (i.e., the mean). In this instance, the value must be an `integer` greater than or equal to `0`.

[^missy]: `NA` (without quotes) is how missingness is represented in R. This notation may seem a bit strange, but it allows us to indicate what we expect to see from modeler submissions.

```{image} ../images/tasks-schema-6-1.png
:alt: Some more lines of code in the tasks.json file
:class: bordered
Expand Down
19 changes: 12 additions & 7 deletions docs/source/user-guide/model-output.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ one-week-ahead incidence, but probabilities for the timing of a season peak:
:::{table} An example of a model output submission for modelA
| `origin_epiweek` | `target` | `horizon` | `output_type` | `output_type_id` | `value` |
| ------ | ------ | ------ | ------ | ------ | ------ |
| EW202242 | weekly rate | 1 | mean | NA | 5 |
| EW202242 | weekly rate | 1 | mean | NA[^batman] | 5 |
| EW202242 | weekly rate | 1 | quantile | 0.25 | 2 |
| EW202242 | weekly rate | 1 | quantile | 0.5 | 3 |
| EW202242 | weekly rate | 1 | quantile | 0.75 | 10 |
Expand All @@ -46,6 +46,10 @@ one-week-ahead incidence, but probabilities for the timing of a season peak:
| EW202242 | weekly rate | 1 | sample | 2 | 3 |
:::

[^batman]: `NA` (without quotes) indicates missingness in R, which is the expected `output_type_id` for a `mean` `output_type`.
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
This is discussed in the [Output type table](#output-type-table)
zkamvar marked this conversation as resolved.
Show resolved Hide resolved


(formats-of-model-output)=
### File formats

Expand Down Expand Up @@ -112,19 +116,20 @@ The following table provides more detail on how to configure the three "model ou
:::{table} Relationship between the three model output representation columns with respect to the type of prediction (`output_type`)
| `output_type` | `output_type_id` | `value` |
| ------ | ------ | ------ |
| `mean` | `"NA"`[^batman] (not used for mean predictions) | Numeric: the mean of the predictive distribution |
| `median` | `"NA"` (not used for median predictions) | Numeric: the median of the predictive distribution |
| `mean` | None[^missingno] (not used for mean predictions) | Numeric: the mean of the predictive distribution |
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
| `median` | None (not used for median predictions) | Numeric: the median of the predictive distribution |
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
| `quantile` | Numeric between 0.0 and 1.0: a probability level | Numeric: the quantile of the predictive distribution at the probability level specified by the output_type_id |
| `cdf`[^cdf] | String or numeric: a possible value of the target variable | Numeric between 0.0 and 1.0: the value of the cumulative distribution function of the predictive distribution at the value of the outcome variable specified by the output_type_id |
| `pmf`[^pmf] | String naming a possible category of a discrete outcome variable | Numeric between 0.0 and 1.0: the value of the probability mass function of the predictive distribution when evaluated at a specified level of a categorical outcome variable.[^cdf] |
| `sample`[^sample] | Positive integer sample index | Numeric: a sample from the predictive distribution.
:::


[^batman]: Why have `"NA"` as the `output_type_id`? There are two reasons for this.
First, this provides a placeholder for the model output CSV file in the presence of other output types for validation.
The second reason is that we already use `null` to indicate the presence of an absence in the `required` and `optional` fields, and having this allows hubverse tools to treat point estimates without special handling.

[^missingno]: Point estimates don't have an `output_type_id` because you can only have one point estimate for each combination of task IDs.
However, because the `output_type_id` column is requrired, something has to go in this place, which is a missing value.
This is encoded as [`NA_character_` in R](https://www.njtierney.com/post/2020/09/17/missing-flavour/) (which is why our schemas prior to 4.0.0 encoded these as `["NA"]`).
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
**If you use Python** to write parquet files, the `output_type_id` column should be an array of `None` and you will need to [explicitly cast the `output_type_id` column as a "string"](https://github.com/hubverse-org/hubValidations/issues/131#issuecomment-2427654006) (via Pandas):
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
`df["output_type_id"] = df["output_type_id"].astype("string")`
[^pmf]: **Note on `pmf` model output type**: Values are required to sum to 1 across all `output_type_id` values within each combination of values of task ID variables. This representation should only be used if the outcome variable is truly discrete; a CDF representation is preferred if the categories represent a binned discretization of an underlying continuous variable.

[^sample]: **Note on `sample` model output type**: Depending on the hub specification, samples with the same sample index (specified by the `output_type_id`) may be assumed to correspond to a single sample from a joint distribution across multiple levels of the task ID variables — further details are discussed below.
Expand Down