Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify missingness of point estimates; add examples for determining output_type_id format #197

Merged
merged 21 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/source/quickstart-hub-admin/tasks-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,9 +186,11 @@ As seen previously, each `task_ids` has a `required` and an `optional` property

### 6.1. Setting the `"mean"`:
- <mark style="background-color: #FFE331">Here, the `"mean"` of the predictive distribution</mark> is set as a valid value for a submission file.
- <mark style="background-color: #32E331">`"output_type_id"` is used</mark> to determine whether the `mean` is a required or an optional `output_type`. Both `"required"` and `"optional"` should be declared, and the option that is chosen (required or optional) should be set to `["NA"]`, whereas the one that is not selected should be set to `null`. In this example, the mean is optional, not required. If the mean is required, `"required"` should be set to `["NA"]`, and `"optional"` should be set to `null`.
- <mark style="background-color: #32E331">`"output_type_id"` is used</mark> to determine whether the `mean` is a required or an optional `output_type`. Both `"required"` and `"optional"` should be declared, and the option that is chosen (required or optional) should be set to `["NA"]`[^missy], whereas the one that is not selected should be set to `null`. In this example, the mean is optional, not required. If the mean is required, `"required"` should be set to `["NA"]`, and `"optional"` should be set to `null`.
- <mark style="background-color: #38C7ED">`"value"` sets the characteristics</mark> of this valid `output_type` (i.e., the mean). In this instance, the value must be an `integer` greater than or equal to `0`.

[^missy]: `NA` (without quotes) is how missingness is represented in R. This notation may seem a bit strange, but it allows us to indicate what we expect to see from modeler submissions.

```{image} ../images/tasks-schema-6-1.png
:alt: Some more lines of code in the tasks.json file
:class: bordered
Expand Down
140 changes: 122 additions & 18 deletions docs/source/user-guide/model-output.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ one-week-ahead incidence, but probabilities for the timing of a season peak:
:::{table} An example of a model output submission for modelA
| `origin_epiweek` | `target` | `horizon` | `output_type` | `output_type_id` | `value` |
| ------ | ------ | ------ | ------ | ------ | ------ |
| EW202242 | weekly rate | 1 | mean | NA | 5 |
| EW202242 | weekly rate | 1 | mean | NA[^batman] | 5 |
| EW202242 | weekly rate | 1 | quantile | 0.25 | 2 |
| EW202242 | weekly rate | 1 | quantile | 0.5 | 3 |
| EW202242 | weekly rate | 1 | quantile | 0.75 | 10 |
Expand All @@ -46,7 +46,11 @@ one-week-ahead incidence, but probabilities for the timing of a season peak:
| EW202242 | weekly rate | 1 | sample | 2 | 3 |
:::

(formats-of-model-output)=
[^batman]: `NA` (without quotes) indicates missingness in R, which is the expected `output_type_id` for a `mean` `output_type`.
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
This is discussed in the [output type table](#output-type-table)


(file-formats)=
### File formats

Hubs can take submissions in tabular data formats, namely `csv` and `parquet`. These
Expand All @@ -66,8 +70,10 @@ submission formats are _not mutually exclusive_; **hubs may choose between
* Disadvantages:
* Compatibility: Harder to work with; teams and people who want to work with files need to install additional libraries

Examples of how to create these file formats in R and Python are listed below in
[the writing model output section](#writing-model-output).

(model-output-format)=
(formats-of-model-output)=
## Formats of model output

```{admonition} Reference
Expand All @@ -85,37 +91,135 @@ Per hubverse convention, **there are two groups of columns providing metadata ab
As shown in the [model output submission table](#model-output-example-table) above, there are three **"task ID"** columns: `origin_epiweek`, `target`, and `horizon`; and there are two **"model output representation"** columns: `output_type` and `output_type_id` followed by the `value` column.
More detail about each of these column groups is given in the following points:


1. **"Task IDs" (multiple columns)**: The details of the outcome (the model task) are provided by the modeler and can be stored in a series of "task ID" columns as described in this [section on task ID variables](#task-id-vars). These "task ID" columns may also include additional information, such as any conditions or assumptions used to generate the predictions. Some example task ID variables include `target`, `location`, `reference_date`, and `horizon`. Although there are no restrictions on naming task ID variables, we suggest that hubs adopt the standard task ID or column names and definitions specified in the [section on usage of task ID variables](#task-id-use) when appropriate.
2. **"Model output representation" (2 columns)**: consists of two columns specifying how the model outputs are represented. Both of these columns will be present in all model output data:
1. `output_type` specifies the type of representation of the predictive distribution, namely `"mean"`, `"median"`, `"quantile"`, `"cdf"`, `"cmf"`, `"pmf"`, or `"sample"`.
2. `output_type_id` specifies more identifying information specific to the output type, which varies depending on the `output_type`.
3. `value` contains the model’s prediction.
1. `output_type`{.codeitem} specifies the type of representation of the predictive distribution, namely `"mean"`, `"median"`, `"quantile"`, `"cdf"`, `"cmf"`, `"pmf"`, or `"sample"`.
2. `output_type_id`{.codeitem} specifies more identifying information specific to the output type, which varies depending on the `output_type`.
3. `value`{.codeitem} contains the model’s prediction.


The following table provides more detail on how to configure the three "model output representation" columns based on each model output type:
The following table provides more detail on how to configure the three "model output representation" columns based on each model output type.

(output-type-table)=
:::{table} Relationship between the three model output representation columns with respect to the type of prediction (`output_type`)
| `output_type` | `output_type_id` | `value` |
| ------ | ------ | ------ |
| `mean` | `"NA"`[^batman] (not used for mean predictions) | Numeric: the mean of the predictive distribution |
| `median` | `"NA"` (not used for median predictions) | Numeric: the median of the predictive distribution |
| `mean` | `NA`/`None` (not used for mean predictions) | Numeric: the mean of the predictive distribution |
| `median` | `NA`/`None` (not used for median predictions) | Numeric: the median of the predictive distribution |
| `quantile` | Numeric between 0.0 and 1.0: a probability level | Numeric: the quantile of the predictive distribution at the probability level specified by the output_type_id |
| `cdf`[^cdf] | String or numeric: a possible value of the target variable | Numeric between 0.0 and 1.0: the value of the cumulative distribution function of the predictive distribution at the value of the outcome variable specified by the output_type_id |
| `pmf`[^pmf] | String naming a possible category of a discrete outcome variable | Numeric between 0.0 and 1.0: the value of the probability mass function of the predictive distribution when evaluated at a specified level of a categorical outcome variable.[^cdf] |
| `sample`[^sample] | Positive integer sample index | Numeric: a sample from the predictive distribution.
| `cdf` | String or numeric: a possible value of the target variable | Numeric between 0.0 and 1.0: the value of the cumulative distribution function of the predictive distribution at the value of the outcome variable specified by the output_type_id |
| `pmf` | String naming a possible category of a discrete outcome variable | Numeric between 0.0 and 1.0: the value of the probability mass function of the predictive distribution when evaluated at a specified level of a categorical outcome variable. |
| `sample` | Positive integer sample index | Numeric: a sample from the predictive distribution.
:::

:::{note}
:name: output-type-caveats

The model output type IDs have different caveats depending on the `output_type`:

`mean` and `median`
: Point estimates do not have an `output_type_id` because you can only have one
point estimate for each combination of task IDs. However, because the
`output_type_id` column is required, something has to go in this place, which
is a missing value. This is encoded as [`NA` in
R](https://www.njtierney.com/post/2020/09/17/missing-flavour/) (which is why
our schemas prior to 4.0.0 encoded these as `["NA"]`) and `None` in Python. See
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
[The example on writing parquet files](#example-parquet) for details.

`pmf`
: Values are required to sum to 1 across all
`output_type_id` values within each combination of values of task ID variables.
This representation should only be used if the outcome variable is truly
discrete; a CDF representation is preferred if the categories represent a
binned discretization of an underlying continuous variable.

`sample`
: Depending on the hub specification, samples with the same sample index
(specified by the `output_type_id`) may be assumed to correspond to a single
sample from a joint distribution across multiple levels of the task ID
variables — further details are discussed below.


`cdf` (and `pmf` for ordinal variables)
: In the hub's `tasks.json` configuration file, the values of the
`output_type_id` should be listed in order from low to high.

:::

(writing-model-output)=
## Writing model output to a hub

[^batman]: Why have `"NA"` as the `output_type_id`? There are two reasons for this.
First, this provides a placeholder for the model output CSV file in the presence of other output types for validation.
The second reason is that we already use `null` to indicate the presence of an absence in the `required` and `optional` fields, and having this allows hubverse tools to treat point estimates without special handling.
When submitting model output to a hub, it should be placed in a folder with the
name of your model in the model outputs folder specified by the hub
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
administrator (this is usually called `model-output`). Below are two examples
of writing model output to a hub in R and Python using parquet and CSV files.
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
In these examples, we are assuming the following variables already exist:

[^pmf]: **Note on `pmf` model output type**: Values are required to sum to 1 across all `output_type_id` values within each combination of values of task ID variables. This representation should only be used if the outcome variable is truly discrete; a CDF representation is preferred if the categories represent a binned discretization of an underlying continuous variable.
- `model_out_tbl` is the tabular output from your model formatted as specified
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
in [the formats of model output section](#formats-of-model-output).
- `path_to_hub` is the path to the hub cloned on your local computer
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
- `model_name` is the file name of your model formatted as
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
`<round_id>-<model_name>.csv` (or parquet)
zkamvar marked this conversation as resolved.
Show resolved Hide resolved

[^sample]: **Note on `sample` model output type**: Depending on the hub specification, samples with the same sample index (specified by the `output_type_id`) may be assumed to correspond to a single sample from a joint distribution across multiple levels of the task ID variables — further details are discussed below.
(example-csv)=
### Example: model output as CSV

[^cdf]: **Note on `cdf` model output type** and `pmf` output type for ordinal variables: In the hub's `tasks.json` configuration file, the values of the `output_type_id` should be listed in order from low to high.
To write to CSV, you would use the `write_csv()` from the `readr` package in R and
the `to_csv()` method in Python.
zkamvar marked this conversation as resolved.
Show resolved Hide resolved

#### Writing CSV with R

zkamvar marked this conversation as resolved.
Show resolved Hide resolved
```r
library("fs")
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
library("readr")
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
# ... generate model data ...
outfile <- path(path_to_hub, "model-output", "team1-modelA", model_name)
write_csv(model_out_tbl, outfile)
```

#### Writing CSV with Python

zkamvar marked this conversation as resolved.
Show resolved Hide resolved
```python
import pandas as pd
import os.path
# ... generate model data ...
outfile = os.path.join(path_to_hub, "model-output", "team1-modelA", model_name)
model_out_tbl.to_csv(outfile, index = False, na_rep = "NA")
```
zkamvar marked this conversation as resolved.
Show resolved Hide resolved

(example-parquet)=
### Example: model output as parquet

Writing to parquet is similar as writing to CSV, but with the caveat that you
additionally need to ensure that the `output_type_id` column matches the
[expected `output_type_id_datatype` property of the schema](#output-type-id-datatype).
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
In practice, you will need to know whether or not the expected data type is a
**string/character** or a **float/numeric**.
zkamvar marked this conversation as resolved.
Show resolved Hide resolved


zkamvar marked this conversation as resolved.
Show resolved Hide resolved
#### Writing parquet with R

```r
library("fs")
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
library("arrow")
zkamvar marked this conversation as resolved.
Show resolved Hide resolved
# ... generate model data ...
outfile <- path(path_to_hub, "model-output", "team1-modelA", model_name)
model_out_tbl$output_type_id <- as.character(model_out_tbl$output_type_id) # or as.numeric()
arrow::write_parquet(model_out_tbl, outfile)
```


#### Writing parquet with Python

```python
import pandas as pd
import os.path
# ... generate model data ...
outfile = os.path.join(path_to_hub, "model-output", "team1-modelA", model_name)
model_out_tbl["output_type_id"] = model_out_tbl["output_type_id"].astype("string") # or "float"
model_out_tbl.to_parquet(outfile)
```
zkamvar marked this conversation as resolved.
Show resolved Hide resolved

(model-output-task-relationship)=
## Model output relationships to task ID variables
Expand Down