-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal data structure #106
Comments
@znicholls same comment as for #105 but probably even more important to get your input on this at some point! |
I think it depends on the intended access pattern. In my head, the access pattern for MESMER is always (and if this assumption is wrong, ignore everything I've written):
Given this, calibrating multiple models is an embarrassingly parallel problem (you can calibrate to multiple CMIP models at once, you don't need information from one calibration to inform another one). Hence I would build the internal calibration data structure for the one model/calibration case (so things like needing to store the model as the dimension disappear, and we can just use scenario/experiment as the dimension). That should help simplify that scope. However, we will then need a second (and maybe third) 'outer-layer' structure which can handle the case where we want to calibrate multiple things at once. That outer-layer will need to handle things like parallelisation, knowing which functions to call, how to pass parameters etc. However, having multiple layers means that we separate the concern of each layer, which should make implementing them easier. The other-layer we'd probably want is an output storage layer (i.e. some way to store the outputs from multiple calibrations yet still be able to search them). I suspect that would be fairly easy to build though because the storage layer would only have one concern (so doing some slightly sneaky things is probably fine). If all the above is correct, then for our internal layer I would use either a For the outer layer (i.e. the one that handles multiple model calibrations) then For the storage layer I would go with class MesmerOutputStore
def __init__(self):
self._meta = # load a 2D table which stores simple metadata info about the data
self._calibrations = # load from a netCDF file which actually has all the calibrations
def get_calibration(**kwargs):
"""Get calibrations""""
selected = self._meta.copy()
# setting it up this way makes it super easy to then go and filter/search
for k, v in kwargs.items():
selected = selected[selected[k] == v]
# storing metadata in the table means that xarray doesn't have to store the metadata
# so we don't run into the multi-dimensional sparsity issue
return self._calibrations.sel(id=selected["id"]) We've actually got this kind of wrapper in https://github.com/openscm/scmdata. It basically has a meta table (https://github.com/openscm/scmdata/blob/029707c57427608026b77f1e263947d8b2a06fac/src/scmdata/run.py#L951) and then a pandas dataframe (cause we only deal with timeseries data). Filtering can then be done really fast (https://github.com/openscm/scmdata/blob/029707c57427608026b77f1e263947d8b2a06fac/src/scmdata/run.py#L1074) because you only search meta before deciding what data to return. Storing the meta as a 2D table means that things don't explode even if you have super sparse data. We wouldn't want all of ScmRun here, but the filtering stuff could be really handy, and we just put a different data model with it. |
I agree with this access pattern, it is clear to me.
If by For the calibration, i slightly prefer the huge DataArray to the DataList. I consider the former easier to navigate and use rather than a list. In terms of RAM, it can be handled, from my estimation, it is about 10Gb if we keep the gridpoint axis only, not the maps. Besides, by not IF we go for a huge DataArray, I think that we can simply load an empty array using IF we stick to DataLists for each layer, I would definitely like such a wrapper to gather more easily the correct items. And actually, such a structure may be closer to what scm does, then that is another plus. |
I refer to #109 (comment) from @znicholls (sorry my comments are all over the place). I thought a bit more how we can avoid concatenating hist + proj and having very sparse arrays and also how a dl = DataList(
(da, {scen="historical", model="CESM", ...}),
(da, {scen="ssp126", model="CESM", ...}),
(da, {scen="ssp585", model="CESM", ...}),
) i.e. one entry per scenario and model, where da has dims 1. Calculate anomaliesdef calculate_anomalies(dl : DataList, period=slice("1850", "1900"), dim="time"):
out = DataList()
# loop all simulations
for ds, meta in dl:
# find the historical simulation
hist = dl.select(model=meta["model"], scen="historical")
if how = "individual":
# given this makes an inner join it will work
ds = ds - hist.sel({dim: period}).mean(dim)
elif how = "all":
ds = ds - hist.sel({dim: period}).mean()
else:
raise ValueError()
out.append(ds, meta) We loop over all elements in dl, select the corresponding historical simulation and subtract it. For 2. Flatten and concatenate the arraysdef expand_dims(ds, meta):
return ds.expand_dims(scen=meta["scen"])
def stack_except(ds, dim):
dim = [dim] if isinstance(dim, str) else dim
dims = set(ds.dims) - set(dim)
return ds.stack(stacked=dims)
dl = dl.map(expand_dims, pass_meta=True)
dl = dl.map(stack_except, dim="cell")
dl = dl.concat(along="stacked") We add |
I would say this looks absolutely beautiful! Very nice indeed and a great solution to a nasty problem (yes, maybe we find some reason to have to adjust things in future but I think the benefits of trying such a great solution now outweigh the risks of having to adjust in future). |
Yes, aligning hist with projections is a clean solution to provide us with the common members for ESM x scenarios. However, MESMER may use two variables, eg tas and hfds. Then later in the code, when using the da, we must still check for common members to tas and hfds, for they may have different sets. |
Thanks 😊 I do still have a lot of open questions....
Yes this is indeed an issue. I see three options.
IIRC @leabeusch would prefer to have one DataList for all variables. My preference would be to have one per variable. What I imagine is something of the sort: tas = mesmer.io.cmip6_ng.load(variable="tas")
hfds = mesmer.io.cmip6_ng.load(variable="hfds")
tas, hfds = dl.align(tas, hfds) *I think we could allow to pass strings to dl.map("sel", time=slice("1850", "1900"))
# would be turned into
for data, meta in dl:
getattr(data, "sel")(time=slice("1850", "1900")) **The best I can usually come up with is to name the package the same as the class which is no good, because then the following happens: import datalist as dl
dl = dl.DataList(...)
# aahrg B.t.w. thanks for all your input! That helps a lot. |
My two cents on the questions
Maximise code reuse as much as possible. I'm very happy to do some refactoring in scmdata so that we can use the filtering capability here without picking up the rest of scmdata's assumptions.
The only one I know is scmdata or pyam (they both have a similar idea for filtering). I would go for scmdata as I have more control. Maybe someone else knows of other options for this sort of filtering wrapping though.
I like it and if there are no other similar packages being worked on I think it's a good start. You could make clear that it's xarray specific by calling it something like
I think from datalist import DataList
dl = DataList(...) A different option would be
Start within mesmer, then split out once we've got the interface stable enough.
I would lean towards this (all data in one place). I would write # assume dl contains both tas and hfds
# this would work fine
dl.filter(variable="tas").align(dl.filter(variable="hfds"), "variable")
# this would raise an error because there is more than one variable in dl
dl.filter(variable="tas").align(dl, "variable")
ValueError("self must have only one variable in the dimension being ignored during the alignment") |
Actually, why not use the attributes of the DataArray? With that, we dont separate the data from its information, and whenever we need to align/concatenate/else, we can simply adapt accordingly the attributes of the new variable.
I prefer as well having all data into a single file. Intuitively, I was thinking simply about something like that:
And inside |
I haven't understood everything going on here from a code perspective yet, so in case some of my statements don't make any sense, this could really just be because of a stupid misunderstanding -> sorry in advance for that. 😅 but I'm confident with some more time / explanations I'll catch up again eventually. 😄 I'll try to add a few things nevertheless already at this stage:
I just really want to keep the option for MESMER to go multivariate. Meaning: I'd like to be able to pass e.g., temperature and precip fields from a single ESM (but from various emission scenarios) at once to whatever forced response module (e.g., regression or sth fancier) & internal variability module (e.g., multivariate Gaussian after some transformations) I have available. Thus it seems weird to me to have those two variables in a different datalist. It would e.g., make more sense to have separate datalists per ESM rather than per variable to me. But maybe I have misunderstood what would need to be included in the same datalist to be able to achieve the functionality I describe?
|
I had a chat with Sarah on her mesmer temperature/ precipitation analysis. She currently organizes her data in one large
|
Yes, particularly how it would handle the very annoying history/scenario split |
But it's of course super convenient because you only need to schlep one data structure around. An alternative to a |
A pointer to cf_xarray and it's accessor that seems to wrap various methods and classes of xarray (as a potential inspiration how to do this for a https://github.com/xarray-contrib/cf-xarray/blob/main/cf_xarray/accessor.py edit: forgot the link |
Some updates:
|
I had a play with data tree and really liked it. For the kind of data we have here, it seemed the right choice (at least for container, then you just build on top of it as needed). Looking at your links, putting some of the data list ideas onto data tree (or a class that wraps a data tree) seems the best choice to me... filefinder looks sick by the way, can't wait to use that. @mathause I am hoping to build some sort of CMIP data processing thing this year (we need it for MAGICC calibration). Would you be interested in collaborating or is finding time for this an ongoing dream that is never realised? |
@veni-vidi-vici-dormivi (Victoria) started playing with datatree and it indeed looks promising
😊
Yes, please keep me in the loop! I assume you have heard of jbusecke/xMIP? |
Yep. We want to be able to download our own data though so our use of it depends a bit on how tightly coupled it is to the pangeo data archives. |
NOTE: this issue is very much a draft at the moment.
We should give some thought on the internal data structure in mesmer. IMHO this is one of the more important but also difficult things to decide on. Generally the idea is to move to a xarray-based data structure. However, it is unclear how we want to carry metadata around (e.g. model name, experiment, etc.).
Current status
Currently a triply nested dictionary of numpy arrays. Something that roughly looks like:
Ideas
1. Keep as is
Pros/ Cons
data["CESM"]["ssp585"]["tas"]
)2. One huge DataArray
We could potentially create one enormous DataArray that encapsulates all necessary information as coordinates (and non-dimension coordinates). It could look something like this:
Pros/ Cons
3. DataList
DataList is a data structure inspired by ESMValTool that I used extensively for my analysis of the CMIP6 data for IPCC. It is used as a flat data container to store model data (e.g.
xr.DataArray
objects) and its metadata. It is a list of data, metadata tuples:where
ds
is a data store (e.g. anxr.DataArray
) and meta is adict
containing metadata, e.g.meta = {"model": "CESM", exp: "ssp585", ...}
. This allows to (1) work with a convenient flat data structure (no nested loops), (2) store arbitrary data (e.g.xr.DataArray
objects with different grids), (3) carry around metadata without having to alter the data store.DataList structure could be turned into a class, which would allow for a convenient interface. E.g. it could allow for things like
An implementation could look like (outside from mesmer):
Pros/ Cons
4. DataTree
WIP implementation of a tree-like hierarchical data structure for xarray, see https://github.com/TomNicholas/datatree
Pros/ Cons
5. Any other ideas?
The text was updated successfully, but these errors were encountered: