Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of the Brier Score and its consistency test. #232

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pabloitu
Copy link
Collaborator

@pabloitu pabloitu commented Jun 6, 2023

Added new module that makes a rough calculation, derived from the Binary LL test. Added a unit test that evaluates a forecast and plots the results.
pending:

  • Check about if this test should be a conditional test (e.g. to the total number of events)
  • Check if the simulation scheme is appropriate
  • Re-do unit tests

@wsavran @Serra314 This is a preliminary implementation and we can start reviewing the code it in this same PR.

pyCSEP Pull Request Checklist

Type of change:

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

"""

# Array-masking that avoids log singularities:
forecast_data = numpy.ma.masked_where(forecast_data <= 0.0, forecast_data)
Copy link
Contributor

@mherrmann3 mherrmann3 Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware of the potential side effects of filtering away zero-rate bins, see #226 (tl;dr: allows a modeler to cheat).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Brier score zero-rates bins are not problematic and should not be removed. One of the advantages of the Brier is indeed that it is always finite and we do not need to filter out or modify any of the rates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about negative rates (just in case)?

Copy link

@Serra314 Serra314 Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rates should be always non-negative. If we estimate them from catalogue-based forecasts as the average (per synthetic catalog) number of points per bin we should never get negative rates. If the rates come from a grid-based forecast then it should be an indication that there is a problem with the model on how these rates are generated. More specifically, the Brier score does not work directly with the rates, it works by estimating the probability that a bin is active (percentage of synthetic catalogues with at least 1 event in that bin), so zero-rates bins have zero probability, but bins with the same (positive) rate may have different probabilities depending on the variance. The rate completely determines the probability only in the case of grid-based forecasts assuming an independent Poisson distribution for the number of events in each bin. In both cases, (I think) it should be impossible to get negative rates, if the forecast is catalogue-based this is by construction, if the forecast is grid-based with Poisson counts then the rate should always be greater or equal to zero by definition of the Poisson distribution. The same holds if we consider a Negative Binomial distribution which has both parameters, and the mean, greater or equal zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick explainer, Francesco!

This reminds me that we should clarify that this PR is currently for grid-based forecasts only.

I only ask about negative rates to think about necessary "safety" measures (currently there are none, apart the silent masking here and in binomial_evaluations). Since ≤ 0 rates could in principle be used for cheating in LL-based tests when masking them away (#226), I wanted to inquire whether similar things are possible with the Brier score.

The first sentence should be "The rates should be always non-negative"

Ah, sure; I mentally replaced positive with non-negative when I read your first sentence ;-) (Btw: you can always edit your posts)

Copy link

@Serra314 Serra314 Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited and removed the last comment :) Anyway, I don't know what would happen with the Brier score when using negative rates, it depends on how we calculate the probability of activity given a negative rate. I think that a possible solution could be to impose them to zero and rise a warning saying that some of the rates were negative. This would be similar to setting the negative rates to the minimum possible (not forecasted) value. Otherwise, we can return -inf (which is not attainable by any "valid" forecast with the Brier score) and it would be a clear indication that something is wrong in the forecast. I'd still throw a specific warning though to let the user know what is going on.

Excluding them may be used to cheat by placing them in bins that a malicious user wants to exclude from the computations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, spitting out at least a Warning in those cases seems a common desire in #226. But there are good reasons to not "touch"/"fix" the original forecasts - at least the zero rates. In any case, we should continue this discussion there.

return brier


def _simulate_catalog(sim_cells, sampling_weights, sim_fore, random_numbers=None):
Copy link
Contributor

@mherrmann3 mherrmann3 Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to pass sim_fore, as it is created in this function.
Instead, move L73 inside here (sim_fore = numpy.zeros(sampling_weights.shape).
(Same applies to core/poisson_evaluations._simulate_catalog(), from which this function derives from.)

loc = numpy.searchsorted(sampling_weights, random_num, side='right')
if sim_fore[loc] == 0:
sim_fore[loc] = 1
num_active_cells = num_active_cells + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_active_cells += 1

from csep.core.exceptions import CSEPCatalogException


def bier_score_ndarray(forecast, observations):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not prepending an underscore (_) to the function name?

… test. It is implemented only using a Binary calculation for spatial-magnitude bins.

tests: Implemented brier score tests.
@codecov-commenter
Copy link

Codecov Report

Attention: 25 lines in your changes are missing coverage. Please review.

Comparison is base (083c5ca) 54.95% compared to head (0b9ea62) 55.05%.

❗ Current head 0b9ea62 differs from pull request most recent head a50d1e3. Consider uploading reports for the commit a50d1e3 to get more accurate results

Files Patch % Lines
csep/core/brier_evaluations.py 60.93% 20 Missing and 5 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #232      +/-   ##
==========================================
+ Coverage   54.95%   55.05%   +0.09%     
==========================================
  Files          23       24       +1     
  Lines        3963     4027      +64     
  Branches      578      587       +9     
==========================================
+ Hits         2178     2217      +39     
- Misses       1648     1668      +20     
- Partials      137      142       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pabloitu
Copy link
Collaborator Author

@Serra314 had an important comment here which I copy here:

Hi Pablo, nice work implementing the Brier score. I only have one doubt with it, why we use poisson.cdf to define the probabilities? I foresee two problems with this, 1. the usual problem with models that do not assume a Poisson distribution in each bin, and 2. if we plan to use this function also to score the magnitude distribution then using Poisson to calculate probabilities is not correct and we would need to implement a different function for that task. My suggestion would be to create a function that takes in input the probabilities (however calculated) and the observations and just calculate the Brier score, then we can apply it to different cases just creating the probability vector differently. This could be also useful to study the difference in the results depending on how we calculate these probabilities. This is just a thought, I am keen to discuss further (sorry to have missed last meeting, probably you already discussed this there).

We definitely have to discuss this. Since we started implementing new metrics, there is an subtle entanglement in pycsep code-base between scores, tests and the probabilistic models, and I have no idea how to address them without a major refactoring. For instance, I agree that the Brier score should accept other distributions than Poisson, but so should the log-likelihood metric (e.g., already implemented as bernoulli and negative binomial models in the binomial_evaluations.py module). Additionally, the Brier score should admit spatial/magnitude/cl consistency tests, and also a non-binary form. Rankings should embrace also different scores other than log-likelihoods, and perhaps t/w-tests when allowed.

I would stick with the Poisson model for the preliminary implementation (as it was the one published by you and soon by me), and then implement stuff forward once we think how to categorize score/tests/rankings in the codebase.

@pabloitu
Copy link
Collaborator Author

@Serra314 Would it be possible for you to test my latest implementation with the forecasts and brier results of your manuscript? If okay with you Francesco, we could include your pycsep2 manuscript's brier paragraph into the documentation.

@pabloitu pabloitu changed the title Preliminary implementation of the Brier Score and its consistency test. Implementation of the Brier Score and its consistency test. Oct 22, 2024
@pabloitu pabloitu mentioned this pull request Nov 25, 2024
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants