-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prototype of class structure #109
base: main
Are you sure you want to change the base?
Conversation
@mathause this is sort of what I was thinking (I didn't get up to actually writing tests so that will have to wait for another day). There's a lot of boilerplate code around the actual calibration. If we can move to some sane classes then it might become much clearer what is actually doing calibration and what is just being copied around because there's not enough utility code available (e.g. this loop mesmer/mesmer/calibrate_mesmer/train_gv.py Line 174 in d73e8f5
mesmer/mesmer/calibrate_mesmer/train_lv.py Line 239 in d73e8f5
|
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #109 +/- ##
==========================================
+ Coverage 87.88% 88.80% +0.91%
==========================================
Files 40 42 +2
Lines 1742 1902 +160
==========================================
+ Hits 1531 1689 +158
- Misses 211 213 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@mathause I implemented an actual test that passes. Have a look and see if it makes any sense to you. The idea is that almost everything in the legacy implementation is io and reshaping. If we can get our classes setup properly, we can hopefully make it easy to see how the regression is actually done using the |
Thanks a lot for getting this started & implementing an example - that helps to understand your idea! Some preliminary comments after a quick look.
LinearRegression().calibrate(esm_tas, predictors={...})
LinearRegression(esm_tas).calibrate(predictors={...})
LinearRegression(predictors={...}).calibrate(esm_tas)
|
I played with the code for a bit and understand it a bit better (& can now answer many of my questions ;-)). I see now that you construct a Here is how the flattened arrays look like: <xarray.DataArray (gridpoint: 2, stacked_coord: 7)>
array([...])
Coordinates:
* gridpoint (gridpoint) int64 0 1
lat (gridpoint) int64 -60 60
lon (gridpoint) int64 120 240
* stacked_coord (stacked_coord) MultiIndex
- scenario (stacked_coord) object 'hist' 'hist' ... 'ssp126' 'ssp126'
- time (stacked_coord) int64 1850 1950 2014 2015 2050 2100 2300
<xarray.DataArray 'emulator_tas' (stacked_coord: 7, predictor: 4)>
array([[...)
Coordinates:
* stacked_coord (stacked_coord) MultiIndex
- scenario (stacked_coord) object 'hist' 'hist' ... 'ssp126' 'ssp126'
- time (stacked_coord) int64 1850 1950 2014 2015 2050 2100 2300
* predictor (predictor) MultiIndex
- variable (predictor) object 'emulator_tas' ... 'global_variability' Thus, you could only call def _regress_single_group(target_point, predictor, weights=None):
# this is the method that actually does the regression
args = [predictor.T, target_point.reshape(-1, 1)]
if weights is not None:
args.append(weights)
reg = sklearn.linear_model.LinearRegression().fit(*args)
a = np.concatenate([reg.intercept_, *reg.coef_])
return a
xr.apply_ufunc(
_regress_single_group,
target_flattened,
predictors_flattened,
input_core_dims=[["stacked_coord"], ["predictor", "stacked_coord"]],
output_core_dims=(("pred",),),
vectorize=True,
) |
I think you've got it!
Maybe a comment on this would help e.g. "Make data a flat array with two dimensions so sklearn behaves"
A little but the performance improvements are probably worth it! |
Yep, I'm not sure if this is the smartest way though or if there should be an extra layer i.e. should the layer I've just written assume that things are already stacked, then we add an extra layer which handles the stacking for the user or should we use what we have here. I am tempted to add an extra layer because I think it will provide us with greater control as we add new features. |
This is true. I think it's something to think about once we have a few more pieces in place. I think at this point we're so low level that we shouldn't worry about emulation just yet because the process for how calibration and emulation fit together is a bit complicated (you have to calibrate multiple different models, and then make sure they join together properly, before you can actually make emulations). |
@mathause I just pushed an attempt to also do the global variability calibration (we can always just pick the commits of interest once we decide on a good direction). It was a good learning exercise, but it's not completely clear to me how we can make a coherent structure out of all this. One to discuss this morning. |
The notes I made on where we landed with train_lv (so we don't lose them and have to work it out again):
|
I am not saying it's a good idea but if you want to get rid of duplicated loops you can use def _loop(target, group):
for _, scenario_vals in target.groupby(group)
scenario_parameters = {k: [] for k in parameters}
for _, em_vals in scenario_vals.groupby(ensemble_member_level):
yield em_vals
def _select_auto_regressive_process_order(...):
for em_vals in _loop(target, ensemble_member_level):
em_orders = AutoRegression1DOrderSelection().calibrate(
em_vals, maxlag=maxlag, ic=ic
) |
Nice, I tidied up a bit. Still the reimplementation of training local variability to go, let's see if that happens this week or not |
I am not sure where to add this comment so I add it here. I think one thing that bugs me is that we need stacked coords of |
Yes I don't love this either but I don't have a solution either given that the scenarios have different number of time points... Options I've thought about (and sadly none have jumped out as great):
|
Based on this comment of @mathause #106 (comment), I guess you've moved beyond @znicholls original option proposals but I just want to stress nevertheless, that I'm very against option 3, i.e.,
as @znicholls already points out himself: this really goes against the general idea of MESMER to be calibrated on a broad range of scenarios simultaneously to ensure that the resulting parameters are able to successfully emulate a broad range of scenarios too i.e., what we analysed in the GMD paper... how good or bad the single scenario approach would be of course always depends on what scenario is used for calibration again & so on... but please don't kill MESMER's overall capability to be trained on multiple scenarios at once in this whole refactoring exercise. 😅 |
Lessons learnt so far from this re-write:
I have train local variability left to sort out, then I'll close this PR and start with some much smaller steps. I think doing this has given me enough experience to have a sense of how to start with a new structure. |
Haven't properly been keeping up with all the refactoring progress that has been going on lately, but in case I could be of help with this point, I assume you'd let me know? ^^' |
isort . && black . && flake8
CHANGELOG.rst