-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add and validate GEV preprocessing step #31
Comments
Good idea! However, as the GEV median and the mean won't exactly match and this introduces another assumption I would keep it an optional preprocessing step :) |
Does it come to mind because of the importance of skewness in GEV fitting? On the other hand, precisely because of its importance, we might want to correct for it. Anyway, we can do sensitivity tests, see how strong the impact is, and decide then what the default should be. On a more general note, the processing of climate model data is not an exact science... |
But I do agree detrending comes with tradeoffs and it needs careful assessment of whether the cure is better than the ill. E..g probably not a good idea to detrending selectively on 21 years worth of local data. If anything, we'd need to model the trend in a robust manner that does not add to the variability. A reasonably elegant (because self consistent) idea would be to use our model for the 21-year mean, perhaps (but not necessarily) calibrated on a per model basis, to remove the mean. |
I wouldn't put it as default as I think that doing that would limit the applicability of the emulator to GEV-distributed extreme indicators. E.g. rx1day is distributed with a GEV but tas is probably distributed normally, and flood depth probably has a completely different underlying distribution. So using the GEV per default would limit the applicability of our emulator to rx1day in this case while keeping it optional adds to the applicability. I even wouldn't count the GEV fit/mean as a part of the emulator per se but as a part of the definition of the underlying indicator. (So we emulate indicators like e.g. "1-in-20 year event of rx1day" or "21-year-mean precipitation"). |
Note about an idea by Carl (will elaborate further tomorrow): |
Here I am only talking about these variables that we assume follow a GEV distribution, like 1-in-X year events. |
My preferred way to go here would be to Of course this would require some kind of detrending, let me check if I find a paper that does something similar after lunch, I think they do similar GEV fits in attribution studies. I think detrending with our own method would likely not work, as we calculate medians from all models, but maybe I am also misunderstanding the approach. |
I guess we're on the same line. About the details, in my understanding (1) would mean doing this in a 21-year moving window: a 100-year time-series is split up into 80 overlapping sets of 21-years each (2000-2020, 2001-2021, ..., 2080-2100) and the GEV is fitted for every set. What is the input data frequency here, daily? Regarding detrending, I don't have a strong opinion. Simplest is probably to just fit a linear trend and remove it within each 21-year period (keep the mean though). Not sure there could be side effects (like a noisy trend over time). |
Yeah I'd do a linear detrending in every 21-year window. That would tend to minimize the GEV range, so it's conservative, so it's good. |
The GEV approach may be powerful but it is limited to very specific indicators, by nature, and can not be applied to all indicators in the table for which a percentile is required*. Perhaps a simpler, and more general approach would be to use the empirical percentiles on our 21-year sample. That would be limited to max the 95th percentile, and may not be very robust at all, but it has the merit of being straightforward to calculate: $> np.percentile(np.arange(21), [.9, .95, 1])
array([0.18, 0.19, 0.2 ]) For moderate warming levels with many ESMs available, the expected value of the 95th percentile could become more robust though. Of course that also would need detrending. For the trend I continue to think the per-model fit of the emulator (GMT -> 21-year mean local impact) would be the most elegant predictor, despite cutting-off some of the end of the window. Otherwise a linear regression GMT -> local impact would also be OK. *GEV deals with time-series of max values (e.g. seasonal or annual max of tasmax: 21 values -- a small sample but OK). The resulting percentiles / return values will be a statement about that: e.g. the max temperature experienced in a year). The GEV approach will not work with statistics that are no maxima. There are also related statistics such as the generalized Pareto distributions that deal with peaks above a threshold, which can include more data points, and could match some of the more specific climate indicators, but it is also not a general way of calculating percentiles. |
And related to that: should this 95th percentile be computed on the monthly/annual mean, or on the monthly/annual max? I'm just wondering what is most useful here. And there are also the climate indices (e.g. heat stress indicators) that already focus on extracting the extremes. Perhaps we should only keep them for the climate indices? (e.g. the heat stress indicator), where a 21-year mean is useful but a 1-in-20-years event may tell a whole different story. |
I think this has to be calculated for a seasonal or annual max values, otherwise, we couldn't use GEVs, and we would have to start testing which distributions would fit. To clarify: I think by 95th percentile of an indicator as written in the table, e.g., the 95th percentile of the annual maximum of daily maximum temperature, basically refers to a new indicator, in our example, the 1-in-20-year event of annual maximum of daily maximum temperature. We can derive it like that:
And I agree for lots of indicators where percentiles are listed in the document, calculating this 1-in-X-year event version of it would be uninformative or unfitting to use a GEV. You are thus welcome to leave them out, in my opinion. Also considering our time constraints and that this would probably be challenging to implement, I want to stress that this is a cool to have not a must have in my opinion. |
The procedure to calculate those is quite clear, yes, the question is indeed how sensible these are. Considering these are different indicators already answers part of the question. However, this does not entirely address the (many if not most) cases where the indicators are not time-series of maxima, because they are already elaborate climate indices, or mean daily values such as mean surface wind. At least in these case, what about the second question? I.e. would an empirical calculation without GEV fit could make an OK replacement ? Otherwise, what would be the few variables I should focus on? A few come to mind: Alternatively, I could just use the 95th empirical percentile of the 21-year period in a generic fashion, to cover any variable without any assumption about an underlying distribution, until someone dedicates time to study the various indicators more specificially (and use GEV, Pareto or other distributions according to what's more appropriate) ? Programmatically, you'd already have the framework ready, and would just need to intervene at this particular location of the code when the 21-year window (21 data points, really), is reduced to one statistic, and update that statistic. |
This does not need to be decided now. I think I'll simply program the various options in the code, to be decided at the time of execution. I propose something like: [indicator.wet_bulb_temperature_q95]
_copy = "wet_bulb_temperature"
time_aggregation = "max"
running_window_statistic = "gev" # gev or empirical or mean
running_window_quantile = 0.95
running_window_detrend = true # can be omitted, since it is the default for "gev"
running_window_detrend_method = "gmt-rime" # "gmt-rime" (default) or "gmt-linear" where |
Hey Mahé, Thanks for your excellent thoughts on this!
This sounds good to me. Whenever you calculate one of those indicators, we should load them into the CIE and test the outcome before we do much more work.
This is also a good idea, and in my opinion worth a try, especially because it sounds like a reasonable effort to implement it. I see the problem that probably many values will be reused in many records, but I can't yet predict how this will influence the outcome. On the other hand I really like that it makes assuming a distribution unnecessary so we would keep our method as general as possible, so checkpot if it works. |
After the NGFS workshop preparation call, and preliminary discussion with @NiklasSchwind and Carl, we'll want to do GEV fitting inside the 21-year window instead of calculating the climatic average. This would require some validation and possible variations such as detrending the data inside the window, to avoid artificially distorting the GEV distribution.
The text was updated successfully, but these errors were encountered: