Support for forecast evaluation #125

elray1 · 2023-11-14T14:57:19Z

elray1
Nov 14, 2023
Maintainer

Introduction/Overview

I'm hoping that we can use this discussion as a space to sort out what we want to do about supporting forecast evaluation. Here are a few desiderata, mostly suggested by Logan:

Evaluation of individual forecasts
Evaluation of collections of forecasts.
1. Just relative WIS approach might be sufficient, but
2. sometimes on smaller sets of forecasters it's nice to subset down to a common set of forecast subtasks & just do a simple mean.
Forecast formats/output_types and evaluation metrics: (See the metric details page for scoringutils for what is supported there). Logan notes that flexibility would be nice.
1. quantile format: WIS, per-capita WIS, log-transformed WIS, relative WIS, MAE, MSE, one-sided quantile coverage, interval coverage.
2. sample format: same as for quantile format, replacing WIS by CRPS
3. pmf format: Brier score, CRPS, log score, I think there is some relevant notion of coverage/calibration that could be applied
4. cdf format: Brier score, CRPS, log score, I think there is some relevant notion of coverage/calibration that could be applied
Grouping: I'd like to be able to group forecasting subtasks, e.g., by whether things were going up/down/steady, and have evaluations by those groups. (Maybe this is just simple application of group_by, but I haven't given it much thought.)
Efficiency: it would also be nice if this worked naturally on large analyses.
Compatibility with hub model output formats

Many of these items are satisfied by scoringutils (at least numbers 1, 2i, most of 3i [aside from per-capita WIS and one-sided quantile coverage], 3ii, and 4). Important items that don't seem to be satisfied include 3iii and 3iv (i.e., there seems to be no support for pmf-formatted forecasts in scoringutils). Currently, some data massaging is required to get between hub formats and formats used by scoringutils, see example below. I haven't really tried to do much that was intensive using the package, so I'm not sure about efficiency -- but I think they're using data.table, which suggests attention has been paid to this issue.

My overall takeaway is that for quantile and sample format forecasts, it is reasonable to just refer users to scoringutils, likely providing a small function that does some data format conversion. We would need to either work with scoringutils maintainers on adding support for pmf and cdf format forecasts, or implement scoring for those output types ourselves.

What evaluation using scoringutils currently looks like

My understanding is that some redesign of scoringutils is currently underway, but to ground the discussion here's a brief overview of what using scoringutils to do forecast evaluation looks like for an example using FluSight forecasts.

I first do some set-up and then load forecast data in the hub format:

library(hubUtils)
library(scoringutils)

library(lubridate)
library(dplyr)
library(ggplot2)
library(plotly)

library(here)
setwd(here::here())

current_ref_date <- lubridate::ceiling_date(Sys.Date(), "week") - lubridate::days(1)

hub_path <- "../FluSight-forecast-hub"

hub_con <- connect_hub(hub_path)
forecasts <- hub_con |>
  dplyr::filter(
    output_type == "quantile"
  ) |>
  dplyr::collect() |>
  as_model_out_tbl()

head(forecasts)

This produces the following output:

# A tibble: 6 × 9
  model_id              reference_date target          horizon target_end_date location output_type output_type_id value
  <chr>                 <date>         <chr>             <int> <date>          <chr>    <chr>       <chr>          <dbl>
1 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp      -1 2023-10-07      06       quantile    0.01            40.0
2 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp      -1 2023-10-07      06       quantile    0.025           40.9
3 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp      -1 2023-10-07      06       quantile    0.05            41.7
4 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp      -1 2023-10-07      06       quantile    0.1             42.6
5 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp      -1 2023-10-07      06       quantile    0.15            43.3
6 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp      -1 2023-10-07      06       quantile    0.2             43.7

Note that the FluSight project has forecasts for a categorical target in pmf format, but (a) scoringutils only naturally works with one forecast format at a time (e.g. just quantile forecasts or just sample forecasts), and (b) scoringutils does not support scoring of pmf forecasts). Here we're just filtering to quantile forecasts.

We can also load corresponding target data:

target_data <- readr::read_csv("https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/main/target-data/target-hospital-admissions.csv")
head(target_data)

...which looks like this:

> head(target_data)
# A tibble: 6 × 6
   ...1 date       location location_name value weekly_rate
  <dbl> <date>     <chr>    <chr>         <dbl>       <dbl>
1     1 2023-11-04 02       Alaska           32       4.50 
2     2 2023-11-04 01       Alabama          38       0.750
3     3 2023-11-04 05       Arkansas         15       0.493
4     4 2023-11-04 04       Arizona          51       0.695
5     5 2023-11-04 06       California       98       0.252
6     6 2023-11-04 08       Colorado         31       0.534

The date column here matches up with target_end_date in the forecasts data, and the target value is given by value.

To use scoringutils, we need to merge the forecast and target data together and get them to have a specific set of column names:

data_for_su <- forecasts |>
  dplyr::filter(horizon >= 0) |>
  dplyr::left_join(
    target_data |> dplyr::select(target_end_date = date, location, true_value = value),
    by = c("location", "target_end_date")
  ) |>
  dplyr::rename(model=model_id, quantile=output_type_id, prediction=value) |>
  dplyr::mutate(quantile = as.numeric(quantile))

head(data_for_su)

# A tibble: 6 × 10
  model                 reference_date target          horizon target_end_date location output_type quantile prediction true_value
  <chr>                 <date>         <chr>             <int> <date>          <chr>    <chr>          <dbl>      <dbl>      <dbl>
1 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp       0 2023-10-14      06       quantile       0.01        24.7         61
2 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp       0 2023-10-14      06       quantile       0.025       28.2         61
3 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp       0 2023-10-14      06       quantile       0.05        31.2         61
4 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp       0 2023-10-14      06       quantile       0.1         34.6         61
5 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp       0 2023-10-14      06       quantile       0.15        37.0         61
6 CADPH-FluCAT_Ensemble 2023-10-14     wk inc flu hosp       0 2023-10-14      06       quantile       0.2         38.9         61

We can use scoringutils::check_forecasts to confirm that we have things set up correctly.

data_for_su |>
  scoringutils::check_forecasts()

I subset the output here:

The following messages were produced when checking inputs:
1.  371588 values for `true_value` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected.
Your forecasts seem to be for a target of the following type:
$target_type
[1] "integer"

and in the following format:
$prediction_type
[1] "quantile"

The unit of a single forecast is defined by:
$forecast_unit
[1] "model"           "reference_date"  "target"          "horizon"         "target_end_date" "location"        "output_type" 

Cleaned data, rows with NA values in prediction or true_value removed:
$cleaned_data
...

Number of unique values per column per model:
$unique_values
                       model reference_date target horizon target_end_date location output_type quantile prediction true_value
 1:    CADPH-FluCAT_Ensemble              4      1       4               4        1           1       23        230          4
 2:         CEPH-Rtrend_fluH              4      1       4               4       53           1       23       1080         69
 3:           CMU-TimeSeries              4      1       4               4       53           1       23      11233         69
...

$messages
[1] "371588 values for `true_value` are NA in the data provided and the corresponding rows were removed. This may indicate a problem if unexpected."

For evaluation, you can first compute some "raw" scores:

scores_raw <- data_for_su |>
  scoringutils::score()

head(scores_raw)

                   model reference_date          target horizon target_end_date location output_type range interval_score dispersion underprediction overprediction coverage coverage_deviation bias
1: CADPH-FluCAT_Ensemble     2023-10-14 wk inc flu hosp       0      2023-10-14       06    quantile     0       13.99418   0.000000        13.99418              0        0                0.0 -0.9
2: CADPH-FluCAT_Ensemble     2023-10-14 wk inc flu hosp       0      2023-10-14       06    quantile    10       13.87927   1.097819        12.78145              0        0               -0.1 -0.9
3: CADPH-FluCAT_Ensemble     2023-10-14 wk inc flu hosp       0      2023-10-14       06    quantile    10       13.87927   1.097819        12.78145              0        0               -0.1 -0.9
4: CADPH-FluCAT_Ensemble     2023-10-14 wk inc flu hosp       0      2023-10-14       06    quantile    20       13.50853   1.957300        11.55123              0        0               -0.2 -0.9
5: CADPH-FluCAT_Ensemble     2023-10-14 wk inc flu hosp       0      2023-10-14       06    quantile    20       13.50853   1.957300        11.55123              0        0               -0.2 -0.9
6: CADPH-FluCAT_Ensemble     2023-10-14 wk inc flu hosp       0      2023-10-14       06    quantile    30       12.89995   2.593227        10.30673              0        0               -0.3 -0.9
   quantile ae_median quantile_coverage
1:     0.50  13.99418             FALSE
2:     0.45  13.99418             FALSE
3:     0.55  13.99418             FALSE
4:     0.40  13.99418             FALSE
5:     0.60  13.99418             FALSE
6:     0.35  13.99418             FALSE

Then you can add interval coverage measures, and effectively group by and summarize these scores as desired:

scores <- scores_raw |>
  add_coverage(ranges = c(50, 80, 95), by = c("model", "reference_date")) |>
  summarise_scores(by = c("model", "reference_date"))

head(scores)

                   model reference_date interval_score dispersion underprediction overprediction coverage_deviation       bias ae_median coverage_50 coverage_80 coverage_95
1: CADPH-FluCAT_Ensemble     2023-10-14       15.94373   3.134861       12.808869      0.0000000         -0.2330435 -0.7625000  23.60530   0.2500000   0.2500000   0.7500000
2: CADPH-FluCAT_Ensemble     2023-10-21       10.31962   3.084835        3.510395      3.7243923         -0.1533333 -0.0400000  16.36846   0.3333333   0.3333333   0.6666667
3: CADPH-FluCAT_Ensemble     2023-10-28       30.86594   2.129572       28.736364      0.0000000         -0.5591304 -1.0000000  38.34343   0.0000000   0.0000000   0.0000000
4: CADPH-FluCAT_Ensemble     2023-11-04       12.71398   2.146931       10.567053      0.0000000         -0.4721739 -0.9800000  19.57430   0.0000000   0.0000000   0.0000000
5:      CEPH-Rtrend_fluH     2023-10-14       13.11365  10.721893        1.741550      0.6502051          0.2076866 -0.1481132  14.54717   0.8773585   0.9905660   1.0000000
6:      CEPH-Rtrend_fluH     2023-10-21       12.45636   9.028824        2.761772      0.6657643          0.1999617 -0.1374214  16.95597   0.8176101   0.9622642   1.0000000

Summing up

Here are some condensed thoughts about what we might like to do to support evaluation/scoring of hub model outputs:

If it's still required after a scoringutils redesign, it might be nice to provide a function in hubUtils that does the data merging and column renaming that's required to get from hub data formats to scoringutils data formats, i.e. creating the data_for_su object above.
Add support for any desired metrics for quantile and sample outputs, e.g. one-sided quantile coverage and per-capita WIS. Ideally we could just add these things to scoringutils.
Think about how we want to support pmf and cdf forecast evaluation. It seems like it should be possible to add pmf evaluation to scoringutils, and then maybe get to cdf evaluation by converting cdf to pmf, as bin probabilities?

nickreich · 2023-11-14T17:14:21Z

nickreich
Nov 14, 2023
Maintainer

This is a great pass at showing how some simple scoring functionality would work. Given what @sbfnk has said to us about the plans that @nikosbosse has to revamp scoringutils, including possibly not backwards compatible changes, you are thinking about suggesting this as something to tackle after the scoringutils redesign?

1 reply

elray1 Nov 14, 2023
Maintainer Author

I agree that it doesn't make sense to build anything until the scoringutils API is more finalized

nikosbosse · 2023-11-14T22:19:40Z

nikosbosse
Nov 14, 2023

What exactly does the pmf and cdf format look like?

One plan of the redesign is that users can provide their own custom functions to score() - so the WIS per capita should not be a problem at all. One-sided quantile coverage is already implemented in scoringutils (and will also be supported in the next version). Are you interested in that for diagnostic plots? Or as a metric for specific quantile levels?

Another plan of the redesign is to make it as easy as possible for users to recreate the scoring functionality for new formats. Maybe we could just add support for the pmf and cdf format.

Writing a general function to convert from one format to the other would be nice. Potentially we could even support that within scoringutils (but would have to think about whether that's the right place).

9 replies

aaronger Nov 20, 2023
Collaborator

Just wanted to chime in here since it seems like there's a chance that this set-up relates to a suggestion I made early on last year in the data format working group. Specifically, I do think it makes sense to have observable variable values in the output_type_id column since such a value is what "identifies" which probability forecast is being elicited. Such a value is, on the other hand, what a quantile forecast outputs.

nikosbosse Nov 20, 2023

I'd also agree that it feels very natural for the pmf. Only between cdf and quantile it feels like a bit of tension

aaronger Nov 20, 2023
Collaborator

Hmm... I guess I would actually not see it as so natural for the pmf since it is not inverting the quantile function and potentially needs a second value to define a bin. But as long as a directional orientation has been established for the outcome, elicitation of a quantile for a probability level and elicitation of a probability for a outcome level are essentially symmetric processes insofar as we are asking the forecaster to describe the same curve by giving either cdf or quantile "y values" over some grid of "x levels".

elray1 Nov 21, 2023
Maintainer Author

Here's the way I think about it: Consistently, the value produced by a modeler goes in the value column. The thing that specifies what value we are asking a modeler to produce, or that identifies what was produced by the modeler, goes in the output_type_id column.

For output type sample, the modeler produces a sample, on the scale of the response variable. The quantity in output_type_id is an index or other label for the sample.
For output type pmf, the value produced by the modeler is a value of the pmf function, f(x). The quantity in output_type_id is the point at which the pmf should be evaluated, x.
For output type cdf, the value produced by the modeler is a value of the cdf function, F(x). The quantity in output_type_id is the point at which the cdf should be evaluated, x.
For output type quantile, the value produced by the modeler is a value of the quantile (i.e. inverse cdf) function, F^{-1}(p). The quantity in output_type_id is the point at which the quantile function should be evaluated, p, a probability in the interval (0, 1).

Fundamentally, the reason that a value of the target variable ends up in the output_type_id column for output types cdf and pmf, but in the value column for output type quantile is that the pmf and cdf take values of the response variable as an input and produce probabilities, but the quantile function takes a probability level as an input and outputs a value of the response variable. Since we want the number in the value column to be the modeler's predicted output of whatever function (d, p, q, or r) the hub specified, the scale of the numbers that show up in the value column differs for these different functions.

nikosbosse Nov 21, 2023

Agree with the general principle of having the value produced by the modeller in the value column. ~~Is it the case that for the cdf the task of the modeller is to provide the cdf value F(x) and you give them a grid of x values?~~ I guess if you swapped what you ask modellers for then you'd be back at eliciting quantile predictions, so the entire point of the cdf forecast is that things are the other way around compared to quantile forecasts.
I think that solves my confusion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for forecast evaluation #125

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Support for forecast evaluation #125

elray1 Nov 14, 2023 Maintainer

Introduction/Overview

What evaluation using scoringutils currently looks like

Summing up

Replies: 2 comments · 10 replies

nickreich Nov 14, 2023 Maintainer

elray1 Nov 14, 2023 Maintainer Author

nikosbosse Nov 14, 2023

aaronger Nov 20, 2023 Collaborator

nikosbosse Nov 20, 2023

aaronger Nov 20, 2023 Collaborator

elray1 Nov 21, 2023 Maintainer Author

nikosbosse Nov 21, 2023

elray1
Nov 14, 2023
Maintainer

Replies: 2 comments 10 replies

nickreich
Nov 14, 2023
Maintainer

elray1 Nov 14, 2023
Maintainer Author

nikosbosse
Nov 14, 2023

aaronger Nov 20, 2023
Collaborator

aaronger Nov 20, 2023
Collaborator

elray1 Nov 21, 2023
Maintainer Author