Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Important Maintenance: improve gather_dimensions #715

Open
rob-taggart opened this issue Oct 28, 2024 · 4 comments
Open

Important Maintenance: improve gather_dimensions #715

rob-taggart opened this issue Oct 28, 2024 · 4 comments

Comments

@rob-taggart
Copy link
Collaborator

A number of improvements requested:

  1. The gather_dimensions docstring needs updating. Current arguments are iterables, but the docstring indicates they are xarray objects.
  2. Improve handling of weights argument. At the moment if weights=None then we cannot pass on weights.dims to gather_dimensions. So we need to use the following logic prior to every gather_dimensions call: if weights is not None then pass on weights.dims else pass on None. It would be good if this logic was handled by gather_dimensions itself. Probably the easiest solution is to allow gather_dimensions to accept either iterables or xarray objects as inputs (at least for weights if not for fcst or obs).
  3. gather_dimensions has a kwarg score_specific_fcst_dims. It would be good to have an analogous kwarg score_specific_obs_dims. I have at least one use case for it in a new metric I am writing. I don't think this concept needs to be expanded beyond fcsts and obs.
  4. Code for many (most? all?) legacy metrics don't model calling gather_dimensions early as part of arg checks. It would be good to fix this up so that good coding is modelled well for new metrics. This issue has already been noted: Ensure gather_dimensions occurs early in score calculation APIs #299

@tennlee , @nicholasloveday

@nikeethr
Copy link
Collaborator

All the improvements sound good.

  1. is valid, but it will never be perfect. Optional types a) necessitate branches b) can be reduced to their base value only if there is some sort of unifying operation that works for both null and "some value" consistently. Either that or the stack permanently branches to either handling null or handling some value. Otherwise, its unavoidable to have these comparisons here and there. I'm not familiar enough with the code to comment on it further though.

@nicholasloveday
Copy link
Collaborator

Regarding 2. Currently before calling gather_dimensions, to get it to work when weights could be None or an xarray object, you need to have something like the following (as per the crps_for_ensemble function).

weights_dims = None
if weights is not None:
    weights_dims = weights.dims

I agree with Rob that instead of repeating this logic everywhere, it should just be implemented once within gather_dimensions.

Note that most metrics aren't even using the weights_dims arg in gather_dimensions when they should be. I think this was an oversight from back when gather_dimensions and gather_dimensions2 were merged into the one function.

@tennlee
Copy link
Collaborator

tennlee commented Oct 30, 2024

Regarding (2), I disagree that it should just be in gather_dimensions, but of course I'm open to discussing the pros and cons. This code isn't user-facing code, so the focus should be on testatibility and repeatability. It is substantially easier to write well-targetted unit tests if the weights_dims are passed in separately rather than being extracted from the xarray. This is because the tests for the actual weights logic can then be written using strings and simple objects that are easy to construct, without any risk of side-effects. While it's slightly longer calling code as a result, it's not user-facing calling code so I think it's reasonable to put the responsibility inside the score rather than in gather_dimensions. The code in the score could possible be simplified further to a shorter pattern. I could look into it, but I don't think it's so hard to have that code appear. One approach could be to write a simple function called extract_weights_dim containing the if statement, then all the calling code could pass extract_weights_dim(xarray) saving two lines of code.

@nikeethr
Copy link
Collaborator

nikeethr commented Oct 30, 2024

I agree with Rob that instead of repeating this logic everywhere, it should just be implemented once within gather_dimensions.

Again not familiar with the code, so someone else may answer this - the question is: Does the code need to retain the information that weights is an optional type OR does everything past gather_dimensions only depend on the results of gather_dimensions (i.e. weights is consumed and implicit in the result)?

If it's the former, branching is unavoidable since that is by design of the type - otherwise don't use an optional at all. If its the latter then all metrics should use gather_dimensions consistently; in order for this to be guaranteed gather_dimensions should be handled much higher level prior to dispatching calls to metrics; e.g. as some kind of decorator or otherwise a boxed structure/mixin that encompasses this paradigm i.e. (gather [immutable] -> compute [metric dependent]). But the latter will only work if gather_dimensions can guarantee type purity. (i.e output type/structure is consistent for all metrics and doesn't re-introduce branching.)

The question isn't about if the checks should be in gather_dimensions, it should, but imo without sufficient abstraction, it may also need to be outside it.

@tennlee tennlee changed the title improve gather_dimensions Internal Improvement: improve gather_dimensions Nov 5, 2024
@tennlee tennlee changed the title Internal Improvement: improve gather_dimensions Important Maintenance: improve gather_dimensions Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants