Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add and validate GEV preprocessing step #31

Open
perrette opened this issue Oct 14, 2024 · 15 comments
Open

Add and validate GEV preprocessing step #31

perrette opened this issue Oct 14, 2024 · 15 comments
Assignees
Labels
preprocessing Processing data to feed into rime(X) rimeX

Comments

@perrette
Copy link
Collaborator

After the NGFS workshop preparation call, and preliminary discussion with @NiklasSchwind and Carl, we'll want to do GEV fitting inside the 21-year window instead of calculating the climatic average. This would require some validation and possible variations such as detrending the data inside the window, to avoid artificially distorting the GEV distribution.

@perrette perrette added rimeX preprocessing Processing data to feed into rime(X) labels Oct 14, 2024
@perrette perrette self-assigned this Oct 14, 2024
@NiklasSchwind
Copy link
Collaborator

Good idea! However, as the GEV median and the mean won't exactly match and this introduces another assumption I would keep it an optional preprocessing step :)

@perrette
Copy link
Collaborator Author

perrette commented Oct 14, 2024

Does it come to mind because of the importance of skewness in GEV fitting? On the other hand, precisely because of its importance, we might want to correct for it. Anyway, we can do sensitivity tests, see how strong the impact is, and decide then what the default should be. On a more general note, the processing of climate model data is not an exact science...

@perrette
Copy link
Collaborator Author

But I do agree detrending comes with tradeoffs and it needs careful assessment of whether the cure is better than the ill. E..g probably not a good idea to detrending selectively on 21 years worth of local data. If anything, we'd need to model the trend in a robust manner that does not add to the variability. A reasonably elegant (because self consistent) idea would be to use our model for the 21-year mean, perhaps (but not necessarily) calibrated on a per model basis, to remove the mean.

@NiklasSchwind
Copy link
Collaborator

NiklasSchwind commented Oct 15, 2024

I wouldn't put it as default as I think that doing that would limit the applicability of the emulator to GEV-distributed extreme indicators. E.g. rx1day is distributed with a GEV but tas is probably distributed normally, and flood depth probably has a completely different underlying distribution. So using the GEV per default would limit the applicability of our emulator to rx1day in this case while keeping it optional adds to the applicability.

I even wouldn't count the GEV fit/mean as a part of the emulator per se but as a part of the definition of the underlying indicator. (So we emulate indicators like e.g. "1-in-20 year event of rx1day" or "21-year-mean precipitation").

@NiklasSchwind
Copy link
Collaborator

Note about an idea by Carl (will elaborate further tomorrow):
Instead of fitting one GEV on every 20-year window, one could try to fit a (linear?) function predicting the parameters of the GEV distribution from GMT to each simulation.

@perrette
Copy link
Collaborator Author

perrette commented Oct 15, 2024

Here I am only talking about these variables that we assume follow a GEV distribution, like 1-in-X year events.
The reason why I propose a dedicated processing for these variables, instead of treating them as a regular variable, is because they are already a statistics over time. There would not be a justification to take another 21-year, "climatological" mean of these variables, because they are already a climatological indicator. The question remains of whether they need detrending or not, and whether that should be done by default, but I tend to thing they do, and I maintain the line put forward above that we could use our classical emulator for the "mean" climatological variable as the trend, and use the GEV fitting on the daily variable minus the trend (I'd probably use a smooth GMT value to compute the trend). I don't see exactly what you find problematic here, nor the difference with predicting the parameters of the GEV. To predict the parameters of the GEV, you need to fit the GEV first, right? So accordintg to my understanding, in every 21-yr period, you'd do a GEV fitting on the (probably detrended) time-series derive the 1-in-X year event values, thus obtaining new time-series for each of the desired return periods. Later on this would not be fundamentally different from the other indicators, except that wouldn't take the 21-year mean of these values, you'd just use them directly. We can discuss that and other ideas on a call perhaps. Also with Carl if needed.

@NiklasSchwind
Copy link
Collaborator

NiklasSchwind commented Nov 11, 2024

My preferred way to go here would be to
(1) calculate a warming-level-dependent GEV for each input ISIMIP simulation.
(2) extract the indicator (e.g. 1-in-20 year event of rx1day) per warming level from every warming-level dependent GEV extracted from the simulation
(3) calculate quantiles of the indicator values weighted by the probability of the warming level

Of course this would require some kind of detrending, let me check if I find a paper that does something similar after lunch, I think they do similar GEV fits in attribution studies. I think detrending with our own method would likely not work, as we calculate medians from all models, but maybe I am also misunderstanding the approach.

@perrette
Copy link
Collaborator Author

I guess we're on the same line. About the details, in my understanding (1) would mean doing this in a 21-year moving window: a 100-year time-series is split up into 80 overlapping sets of 21-years each (2000-2020, 2001-2021, ..., 2080-2100) and the GEV is fitted for every set. What is the input data frequency here, daily?

Regarding detrending, I don't have a strong opinion. Simplest is probably to just fit a linear trend and remove it within each 21-year period (keep the mean though). Not sure there could be side effects (like a noisy trend over time).

@perrette
Copy link
Collaborator Author

Yeah I'd do a linear detrending in every 21-year window. That would tend to minimize the GEV range, so it's conservative, so it's good.

@perrette
Copy link
Collaborator Author

perrette commented Dec 9, 2024

The GEV approach may be powerful but it is limited to very specific indicators, by nature, and can not be applied to all indicators in the table for which a percentile is required*.

Perhaps a simpler, and more general approach would be to use the empirical percentiles on our 21-year sample. That would be limited to max the 95th percentile, and may not be very robust at all, but it has the merit of being straightforward to calculate:

$> np.percentile(np.arange(21), [.9, .95, 1])
array([0.18, 0.19, 0.2 ])

For moderate warming levels with many ESMs available, the expected value of the 95th percentile could become more robust though. Of course that also would need detrending. For the trend I continue to think the per-model fit of the emulator (GMT -> 21-year mean local impact) would be the most elegant predictor, despite cutting-off some of the end of the window. Otherwise a linear regression GMT -> local impact would also be OK.

*GEV deals with time-series of max values (e.g. seasonal or annual max of tasmax: 21 values -- a small sample but OK). The resulting percentiles / return values will be a statement about that: e.g. the max temperature experienced in a year). The GEV approach will not work with statistics that are no maxima. There are also related statistics such as the generalized Pareto distributions that deal with peaks above a threshold, which can include more data points, and could match some of the more specific climate indicators, but it is also not a general way of calculating percentiles.

@perrette
Copy link
Collaborator Author

perrette commented Dec 9, 2024

And related to that: should this 95th percentile be computed on the monthly/annual mean, or on the monthly/annual max? I'm just wondering what is most useful here. And there are also the climate indices (e.g. heat stress indicators) that already focus on extracting the extremes.
At this point, I feel these percentiles in the indicators table create more problems than they offer responses, not so much for technical difficulties but rather to understand what we want from them.

Perhaps we should only keep them for the climate indices? (e.g. the heat stress indicator), where a 21-year mean is useful but a 1-in-20-years event may tell a whole different story.

@NiklasSchwind
Copy link
Collaborator

I think this has to be calculated for a seasonal or annual max values, otherwise, we couldn't use GEVs, and we would have to start testing which distributions would fit.

To clarify: I think by 95th percentile of an indicator as written in the table, e.g., the 95th percentile of the annual maximum of daily maximum temperature, basically refers to a new indicator, in our example, the 1-in-20-year event of annual maximum of daily maximum temperature.

We can derive it like that:

  • Calculate the indicator for the regions from the simulations, e.g. annual maximum of daily maximum temperature.
  • Create regional averages, so e.g. average of annual maximum of daily maximum temperature over Germany.
  • For every simulation, year, and region: Fit GEV on the values of the indicator in a 20-year window around the year and calculate the percentile from it. Record this value and the GMT in the same year.
  • Input those records into the binning procedure and in the rime emulator.

And I agree for lots of indicators where percentiles are listed in the document, calculating this 1-in-X-year event version of it would be uninformative or unfitting to use a GEV. You are thus welcome to leave them out, in my opinion. Also considering our time constraints and that this would probably be challenging to implement, I want to stress that this is a cool to have not a must have in my opinion.

@perrette
Copy link
Collaborator Author

perrette commented Dec 9, 2024

The procedure to calculate those is quite clear, yes, the question is indeed how sensible these are. Considering these are different indicators already answers part of the question.

However, this does not entirely address the (many if not most) cases where the indicators are not time-series of maxima, because they are already elaborate climate indices, or mean daily values such as mean surface wind. At least in these case, what about the second question? I.e. would an empirical calculation without GEV fit could make an OK replacement ?

Otherwise, what would be the few variables I should focus on? A few come to mind: tasmax, wet_bulb_temperature, rx5day ? I could do GEV on these three. And you could then work out later whether you like to extend the approach (or play around with the sibling the Generalized Pareto distribution to deal with peaks above threshold indicators)

Alternatively, I could just use the 95th empirical percentile of the 21-year period in a generic fashion, to cover any variable without any assumption about an underlying distribution, until someone dedicates time to study the various indicators more specificially (and use GEV, Pareto or other distributions according to what's more appropriate) ? Programmatically, you'd already have the framework ready, and would just need to intervene at this particular location of the code when the 21-year window (21 data points, really), is reduced to one statistic, and update that statistic.

@perrette
Copy link
Collaborator Author

perrette commented Dec 9, 2024

This does not need to be decided now. I think I'll simply program the various options in the code, to be decided at the time of execution. I propose something like:

[indicator.wet_bulb_temperature_q95]
_copy = "wet_bulb_temperature"
time_aggregation = "max"
running_window_statistic = "gev"  # gev or empirical or mean
running_window_quantile = 0.95
running_window_detrend = true  # can be omitted, since it is the default for "gev"
running_window_detrend_method = "gmt-rime"  # "gmt-rime" (default) or "gmt-linear"

where running_window_statistic defaults to "mean", and the other "gev" and "empirical" are short for "quantile from a GEV distribution fit" and "quantile from an empirical distribution"; running_window_quantile is undefined by default (and required when the statistic is "gev" or "empirical"), and running_window_detrend defaults to true for "gev" and "empirical", and to "false" for "mean". The running_window_detrend_method sets how the time-series is detrended, and would feature mostly a) a rime-like GMT-driven trend trained on the same model and b) a simple linear regression versus GMT (other methods like local linear trends are not great because they would likely be strongly influenced by the extremes is they occur at either end of the window, and would artificially damp the extremes as a result)

@NiklasSchwind
Copy link
Collaborator

Hey Mahé,

Thanks for your excellent thoughts on this!

Otherwise, what would be the few variables I should focus on? A few come to mind: tasmax, wet_bulb_temperature, rx5day ? I could do GEV on these three. And you could then work out later whether you like to extend the approach (or play around with the sibling the Generalized Pareto distribution to deal with peaks above threshold indicators)

This sounds good to me. Whenever you calculate one of those indicators, we should load them into the CIE and test the outcome before we do much more work.

Alternatively, I could just use the 95th empirical percentile of the 21-year period in a generic fashion, to cover any variable without any assumption about an underlying distribution, until someone dedicates time to study the various indicators more specificially (and use GEV, Pareto or other distributions according to what's more appropriate) ? Programmatically, you'd already have the framework ready, and would just need to intervene at this particular location of the code when the 21-year window (21 data points, really), is reduced to one statistic, and update that statistic.

This is also a good idea, and in my opinion worth a try, especially because it sounds like a reasonable effort to implement it. I see the problem that probably many values will be reused in many records, but I can't yet predict how this will influence the outcome. On the other hand I really like that it makes assuming a distribution unnecessary so we would keep our method as general as possible, so checkpot if it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessing Processing data to feed into rime(X) rimeX
Projects
None yet
Development

No branches or pull requests

2 participants