Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prototype of class structure #109

Draft
wants to merge 28 commits into
base: main
Choose a base branch
from
Draft

Add prototype of class structure #109

wants to merge 28 commits into from

Conversation

znicholls
Copy link
Collaborator

  • Closes #xxx
  • Tests added
  • Passes isort . && black . && flake8
  • Fully documented, including CHANGELOG.rst

@znicholls
Copy link
Collaborator Author

@mathause this is sort of what I was thinking (I didn't get up to actually writing tests so that will have to wait for another day). There's a lot of boilerplate code around the actual calibration. If we can move to some sane classes then it might become much clearer what is actually doing calibration and what is just being copied around because there's not enough utility code available (e.g. this loop

for run in np.arange(nr_runs):
and this loop
for run in np.arange(nr_runs):
are the same idea but it's really hard to see at the moment). That will hopefully make it much easier to see how to scale things, how to add new things and where utility code is required instead.

@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2021

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (89a9c20) 87.88% compared to head (8f19cd5) 88.80%.

Files Patch % Lines
mesmer/prototype/calibrate.py 89.36% 5 Missing ⚠️
mesmer/prototype/calibrate_multiple.py 95.57% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #109      +/-   ##
==========================================
+ Coverage   87.88%   88.80%   +0.91%     
==========================================
  Files          40       42       +2     
  Lines        1742     1902     +160     
==========================================
+ Hits         1531     1689     +158     
- Misses        211      213       +2     
Flag Coverage Δ
unittests 88.80% <93.75%> (+0.91%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@znicholls znicholls requested a review from mathause October 24, 2021 08:47
@znicholls
Copy link
Collaborator Author

znicholls commented Oct 24, 2021

@mathause I implemented an actual test that passes. Have a look and see if it makes any sense to you. The idea is that almost everything in the legacy implementation is io and reshaping. If we can get our classes setup properly, we can hopefully make it easy to see how the regression is actually done using the _regress_single_point (or whatever name we end up with) method. I think some of the wrapping could be done better, but the test at least gives us a starting point to see what we're replacing and make sure the answer is still the same.

@mathause
Copy link
Member

Thanks a lot for getting this started & implementing an example - that helps to understand your idea! Some preliminary comments after a quick look.

  • Your MesmerCalibrateTargetPredictor defines the interface (calibrate) but all other methods are @staticmethod - so there is not so much advantage for having a class. The call could just as well be linear_regression.calibrate(...).
  • What we should think about how the signatures of our functions should look like. Once we know this the rest is easy ;-) I'll try to do this later.
  • E.g. which of these should it be
LinearRegression().calibrate(esm_tas, predictors={...})
LinearRegression(esm_tas).calibrate(predictors={...})
LinearRegression(predictors={...}).calibrate(esm_tas)
  • If we define LinearRegression().calibrate() should we also define LinearRegression().emulate()? I see why we would want to separate them. Yet, keeping them together would enforce to implement both.

  • I have not had the idea to use an outer join of hist and proj as a single DataArray. Looks elegant but also a bit wasteful (in my analysis pipeline I just concatenate them - so I am not really allowed to say anything ;-))
  • How we do this will very much depend on internal data structure #106

  • Do I understand your idea correctly: you flatten your input arrays and then loop over each gridpoint?
  • Could we use xr.apply_ufunc instead?
    • I think this could simplify some of the logic - internally and for users: if there is a vectorized function in xarray - let's use it, if not users have to write a function that works on one gridpoint and xr.apply_ufunc takes care of the rest. Admittedly xr.apply_ufunc is not trivial either.
    • Or is there something I am missing?

@mathause
Copy link
Member

I played with the code for a bit and understand it a bit better (& can now answer many of my questions ;-)). I see now that you construct a stacked_coord over scenario x time and only then loop over the points.

Here is how the flattened arrays look like:

<xarray.DataArray (gridpoint: 2, stacked_coord: 7)>
array([...])
Coordinates:
  * gridpoint      (gridpoint) int64 0 1
    lat            (gridpoint) int64 -60 60
    lon            (gridpoint) int64 120 240
  * stacked_coord  (stacked_coord) MultiIndex
  - scenario       (stacked_coord) object 'hist' 'hist' ... 'ssp126' 'ssp126'
  - time           (stacked_coord) int64 1850 1950 2014 2015 2050 2100 2300

<xarray.DataArray 'emulator_tas' (stacked_coord: 7, predictor: 4)>
array([[...)
Coordinates:
  * stacked_coord  (stacked_coord) MultiIndex
  - scenario       (stacked_coord) object 'hist' 'hist' ... 'ssp126' 'ssp126'
  - time           (stacked_coord) int64 1850 1950 2014 2015 2050 2100 2300
  * predictor      (predictor) MultiIndex
  - variable       (predictor) object 'emulator_tas' ... 'global_variability'

Thus, you could only call apply_ufunc after flatting the arrays in some way. I tried how this last step could be implemented with apply_ufunc. For this I use your target_flattened and predictors_flattened. Looks more complicated than your solution...

def _regress_single_group(target_point, predictor, weights=None):

    # this is the method that actually does the regression
    args = [predictor.T, target_point.reshape(-1, 1)]
    if weights is not None:
        args.append(weights)
    reg = sklearn.linear_model.LinearRegression().fit(*args)
    a = np.concatenate([reg.intercept_, *reg.coef_])

    return a


xr.apply_ufunc(
    _regress_single_group,
    target_flattened,
    predictors_flattened,
    input_core_dims=[["stacked_coord"], ["predictor", "stacked_coord"]],
    output_core_dims=(("pred",),),
    vectorize=True,
)

@znicholls
Copy link
Collaborator Author

I think you've got it!

target_point.reshape(-1, 1)

Maybe a comment on this would help e.g. "Make data a flat array with two dimensions so sklearn behaves"

Looks more complicated than your solution...

A little but the performance improvements are probably worth it!

@znicholls
Copy link
Collaborator Author

I see now that you construct a stacked_coord over scenario x time and only then loop over the points.

Yep, I'm not sure if this is the smartest way though or if there should be an extra layer i.e. should the layer I've just written assume that things are already stacked, then we add an extra layer which handles the stacking for the user or should we use what we have here. I am tempted to add an extra layer because I think it will provide us with greater control as we add new features.

@znicholls
Copy link
Collaborator Author

  • If we define LinearRegression().calibrate() should we also define LinearRegression().emulate()? I see why we would want to separate them. Yet, keeping them together would enforce to implement both.

This is true. I think it's something to think about once we have a few more pieces in place. I think at this point we're so low level that we shouldn't worry about emulation just yet because the process for how calibration and emulation fit together is a bit complicated (you have to calibrate multiple different models, and then make sure they join together properly, before you can actually make emulations).

@znicholls
Copy link
Collaborator Author

@mathause I just pushed an attempt to also do the global variability calibration (we can always just pick the commits of interest once we decide on a good direction). It was a good learning exercise, but it's not completely clear to me how we can make a coherent structure out of all this. One to discuss this morning.

@znicholls
Copy link
Collaborator Author

znicholls commented Oct 26, 2021

The notes I made on where we landed with train_lv (so we don't lose them and have to work it out again):

  1. calculating distances between points and these Gaspari-Cohn correlation functions
  2. autoregression for each gridpoint (weighting each scenario, ensemble member choices)
  3. calculating localization radius, empirical cov matrix, and localized ecov matrix
  4. calculating innovations (something like realisations from spatial covariance matrix/random element of the draws)

@mathause
Copy link
Member

I am not saying it's a good idea but if you want to get rid of duplicated loops you can use yield:

def _loop(target, group):
    for _, scenario_vals in target.groupby(group)
        scenario_parameters = {k: [] for k in parameters}
        for _, em_vals in scenario_vals.groupby(ensemble_member_level):
            yield em_vals

def _select_auto_regressive_process_order(...):

    for em_vals in _loop(target, ensemble_member_level):
            em_orders = AutoRegression1DOrderSelection().calibrate(
                em_vals, maxlag=maxlag, ic=ic
            )

@znicholls
Copy link
Collaborator Author

I am not saying it's a good idea but if you want to get rid of duplicated loops you can use yield

Nice, I tidied up a bit. Still the reimplementation of training local variability to go, let's see if that happens this week or not

@mathause mathause mentioned this pull request Oct 27, 2021
@mathause
Copy link
Member

I am not sure where to add this comment so I add it here.

I think one thing that bugs me is that we need stacked coords of time x scenario because of the different times of hist and proj. If all had the same time vector we could do a nice 3D array... gridpoint x time x scenario.

@znicholls
Copy link
Collaborator Author

znicholls commented Oct 27, 2021

I think one thing that bugs me is that we need stacked coords of time x scenario because of the different times of hist and proj. If all had the same time vector we could do a nice 3D array... gridpoint x time x scenario.

Yes I don't love this either but I don't have a solution either given that the scenarios have different number of time points...

Options I've thought about (and sadly none have jumped out as great):

  • concatenate history to every scenario (cons: duplicates data which can be memory intensive and also gives history extra weight/forces us to then drop history internally i.e. we end up doing the same operation anyway)
  • force the user to do the flattening first (cons: seems against the spirit of 'just working' with CMIP6)
  • calibrate one scenario at a time (doesn't work I don't think because the regressions need everything at once)

@leabeusch
Copy link
Collaborator

Based on this comment of @mathause #106 (comment), I guess you've moved beyond @znicholls original option proposals but I just want to stress nevertheless, that I'm very against option 3, i.e.,

calibrate one scenario at a time (doesn't work I don't think because the regressions need everything at once)

as @znicholls already points out himself: this really goes against the general idea of MESMER to be calibrated on a broad range of scenarios simultaneously to ensure that the resulting parameters are able to successfully emulate a broad range of scenarios too i.e., what we analysed in the GMD paper... how good or bad the single scenario approach would be of course always depends on what scenario is used for calibration again & so on... but please don't kill MESMER's overall capability to be trained on multiple scenarios at once in this whole refactoring exercise. 😅

@znicholls
Copy link
Collaborator Author

Lessons learnt so far from this re-write:

  1. Keeping calibration and emulation next to each other would make it much easier to understand what is going on with each part of the model/bit of code
  2. Ensuring that methods/functions are no more than 10 lines (with very few exceptions) would make it much easier to see what is going on. Functions with more than 10 lines have too much context, which makes it hard to focus on what is actually going on.
  3. Using long, descriptive names also makes a massive difference. Abbreviations make it really hard to keep track of what is going on (and, when combined with long functions, become almost impossible to keep track of).
  4. Doing a re-write alongside the original writer would probably be much simpler as they know the theory (I had to do a lot of googling to work out what I was actually looking at)

I have train local variability left to sort out, then I'll close this PR and start with some much smaller steps. I think doing this has given me enough experience to have a sense of how to start with a new structure.

mesmer/prototype/utils.py Outdated Show resolved Hide resolved
mesmer/prototype/utils.py Outdated Show resolved Hide resolved
@leabeusch
Copy link
Collaborator

Doing a re-write alongside the original writer would probably be much simpler as they know the theory (I had to do a lot of googling to work out what I was actually looking at)

Haven't properly been keeping up with all the refactoring progress that has been going on lately, but in case I could be of help with this point, I assume you'd let me know? ^^'

@mathause mathause mentioned this pull request Jun 10, 2022
@mathause mathause mentioned this pull request Sep 25, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants