Dimension names as core array metadata #73

alimanfoo · 2020-05-21T07:48:36Z

Several domains make use of named dimensions, i.e., for a given array with N dimensions, each of those N dimensions is given a human-readable name.

Given the broad utility of this, should we include this within the core array metadata in the v3 protocol? E.g., add a dimensions property within the array metadata document, whose value should be a list of strings:

    "shape": [10000, 1000],
    "dimensions": ["space", "time"],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

One question this raises is how to handle the case where no names are provided, or only some dimensions are named but not others. I.e., dimension names should probably be optional.

The alternative is that we leave this to the community to define a usage convention to store dimension names in the user attributes, e.g., similar to what xarray currently does using the "_ARRAY_DIMENSIONS" attribute name.

The text was updated successfully, but these errors were encountered:

meggart · 2020-05-25T11:57:33Z

I would very much appreciate having an "official" way to define dimension names. Currently I mimic the xarray conventions in my Julia code but this feels a bit risky since these conventions are not properly versioned so if there is a change in the future in how these conventions are handled this could lead to unexpected bugs. So I don't mind if this is in the core protocol or in some extension as long as there is a clean way to find out programmatically after which convention dimension names are defined.

rabernat · 2020-06-03T20:06:34Z

I agree with this proposal.

It seems like we definitely want to synchronize this with whatever @DennisHeimbigner, @WardF, and the rest of the Unidata crew decide to do about dimension names.

DennisHeimbigner · 2020-06-03T20:51:29Z

This crosses a problem discussed in the meeting today. There is a strong feeling that the v3 spec should support asyncronous read and write to the degree possible. This is driven by cloud storage models. One consequence is that it should be possible for a process to directly create and write a variable without having to synchronize with any other process. However, it is unclear how this applies to shared dimensions. Should asynchronous creation of a named dimension by a process be allowed? =Dennis Heimbigner Unidata

…

On 6/3/2020 2:06 PM, Ryan Abernathey wrote: I agree with this proposal. It seems like we definitely want to synchronize this with whatever @DennisHeimbigner <https://github.com/DennisHeimbigner>, @WardF <https://github.com/WardF>, and the rest of the Unidata crew decide to do about dimension names. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#73 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG47W4YVMW3OWCKF7ZWN5TRU2UNRANCNFSM4NGT647A>.

alimanfoo · 2020-06-03T21:13:23Z

I would suggest that, if we support dimension names in the v3 spec, then they are simply string labels for the dimensions of an array. Nothing else is implied. I.e., if two arrays happen to use the same name for a particular dimension, then at the level of the v3 protocol, that does not imply anything. It could mean that the two arrays have a "shared dimension" in the netCDF sense, it could just be coincidence, at least as far as a vanilla implementation of the v3 protocol is concerned.

A library that supports the full netCDF data model might then choose to treat these dimension names as names for shared dimensions, that would be fine and up to the netCDF layer implementation to manage.

Hope that makes sense.

DennisHeimbigner · 2020-06-03T21:56:27Z

However, the dimension name and size must be stored in the metadata independent of any variable. So adding a dimension may interfere with asynchronicity. =Dennis Heimbigner Unidata

…

On 6/3/2020 3:13 PM, Alistair Miles wrote: I would suggest that, if we support dimension names in the v3 spec, then they are simply string labels for the dimensions of an array. Nothing else is implied. I.e., if two arrays happen to use the same name for a particular dimension, then at the level of the v3 protocol, that does not imply anything. It could mean that the two arrays have a "shared dimension" in the netCDF sense, it could just be coincidence, at least as far as a vanilla implementation of the v3 protocol is concerned. A library that supports the full netCDF data model might then choose to treat these dimension names as names for shared dimensions, that would be fine and up to the netCDF layer implementation to manage. Hope that makes sense. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#73 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG47WY3COBTE55ODXJSAXLRU24IBANCNFSM4NGT647A>.

alimanfoo · 2020-06-04T09:43:41Z

However, the dimension name and size must be stored in the metadata independent of any variable. So adding a dimension may interfere with asynchronicity.

I may need some help from @rabernat here, there's a few different "dimensions" to this problem (sorry for the very bad pun :-)

Note that in this proposal I am simply proposing a metadata property for giving names to the dimensions (axes) of an array. Perhaps the property should be called dimension_names to make that clear. In any case, there is no implication that these dimensions are shared with any other arrays.

E.g., with this feature I could create an array with shape (10, 5) and name the dimensions ("foo", "bar"). In the zarr protocol, it would be totally fine to create another array with shape (100, 5) and name the dimensions ("foo", "qux"). I.e., creating each of these arrays is an independent operation, and the names are just labels for the axes of the arrays, not necessarily shared.

I.e., a vanilla zarr implementation would just offer the ability to provide names for the dimensions (axes) of an array, and might show those names when providing a visual representation of the array, but that would be it.

Now, a higher-level library implementing the netCDF data model might choose to interpret these as names for shared dimensions, under certain circumstances. I.e., if two arrays within the same group both have the name "foo" for one of their dimensions, then assume they are referring to a shared dimension.

This is similar to what xarray does currently. The main difference is that xarray uses an attribute called _ARRAY_DIMENSIONS, whereas this proposal offers a standard metadata property called dimensions (or dimension_names) which might be used for that purpose. There is a slight difference though, in that xarray knows that the _ARRAY_DIMENSIONS attribute is always supposed to indicate names for shared dimensions. I.e., there is stronger semantics for _ARRAY_DIMENSIONS than for the proposed dimensions array metadata property.

Perhaps it would be easier to avoid potential confusion, and for zarr to not try to cross into the netCDF space, and rather allow that to be dealt with via a set of usage conventions that properly deal with the netCDF semantics, such as the xarray approach or the nzcarr approach.

alimanfoo · 2020-06-04T09:49:10Z

However, the dimension name and size must be stored in the metadata independent of any variable.

Also noting that IIUC this is not necessarily true, e.g., the xarray approach does not separately store dimension names and sizes. This is different from the nczarr proposal. Note that I have no opinion on which of these two approaches is best, just noting the difference.

rabernat · 2020-06-04T18:40:00Z

Note that in this proposal I am simply proposing a metadata property for giving names to the dimensions (axes) of an array. Perhaps the property should be called dimension_names to make that clear. In any case, there is no implication that these dimensions are shared with any other arrays.

👍 This is how I have been thinking of it. Rather than calling the axes 0, 1, 2, we can call them time, lat, lon. Additional extensions or application could decide to interpret this in different ways, such as in the netCDF data model.

However, the dimension name and size must be stored in the metadata independent of any variable.

I don't see why. The dimension size is the determined by the shape of the array.

DennisHeimbigner · 2020-06-04T19:09:10Z

I am glad we have these kinds of discussions; I am to some degree captive of the historical development of netcdf and its assumptions. Does this interpretation seem reasonable WRT the xarray model? 1. the definition of a named dimension is distributed (an important word) to all of the variables which use it. There is no single centralized definition as in netcdf. 2. The costs for the xarray approach are: a. inconsistency between the distributed named dimension definitions is possible b. the cost in storing the named dimension info in multiple variables. The cost for (2b) seems very small and so is not a big issue. The (2a) case is no different than any other hidden data used in, say, netcdf. Presumably the inconsistencies can only occur if the dataset is modified outside of the library. Since in netcdf, dimensions are scoped by groups, one would need to use the fully qualified names (FQNs) for named dimensions: e.g.. /g1/g2/dim1. It would seem that some kind of search is needed to guarantee dimension name uniqueness. It potentially requires looking at all variables within the group part of the FQN of the new dimension to ensure that the name is unique. Does xarray do a similar search when a client defines a new dimension? In any case, the distributed approach is attractive because it potentially allows asynchronous definition of dimensions if certain constraints can be met so that search can be avoided or minimized. Comments? =Dennis Heimbigner Unidata

…

On 6/4/2020 3:49 AM, Alistair Miles wrote: However, the dimension name and size must be stored in the metadata independent of any variable. Also noting that IIUC this is not necessarily true, e.g., the xarray approach <http://xarray.pydata.org/en/latest/internals.html#zarr-encoding-specification> does not separately store dimension names and sizes. This is different from the nczarr proposal <https://drive.google.com/file/d/1UUGcQMpWqKllMdRFCu97CoL7fB_GWXvg/view>. Note that I have no opinion on which of these two approaches is best, just noting the difference. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#73 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG47WY3CMRUWUZ5I2VT2BLRU5U2JANCNFSM4NGT647A>.

joshmoore · 2020-06-05T11:29:03Z

Thinking out loud somewhat, I wonder if restricting dimension_names to [a-zA-Z0-9_] for the moment wouldn't be prudent. That would allow nice Python referencing and would allow a potential future extension to pathed (/) or dotted (.) nomenclature for looking up named dimensions in the future?

Carreau · 2020-09-25T16:33:55Z

Update RFC to say this is something we'd like input on.

jbms · 2022-02-08T18:51:33Z

I would also like to see built-in support for dimension names, and would also suggest that, for simplicity, the zarr specification itself make no assumptions about "shared dimensions" between multiple arrays.

Aside from possible constraints on the allowed characters, I think that empty labels should be allowed (and indicate an unnamed dimension), and non-empty labels must be distinct. Not specifying the dimension names at all would be equivalent to specify all empty strings as the dimension labels.

d-v-b · 2022-02-08T19:06:41Z

What's the advantage of allowing empty labels?

jbms · 2022-02-08T19:13:14Z

Given that dimension names would be optional, it seems natural to me to allow that optionality on a per-dimension basis. E.g. maybe you are computing some sort of multiplication or partial reduction between two zarr arrays A and B, where A has labels and B does not. If the result has some dimensions corresponding to dimensions of A and some dimensions corresponding to dimensions of B, we would like to preserve the dimension labels from A without having to invent fake labels for B.

However, I don't feel too strongly about allowing empty labels.

DennisHeimbigner · 2022-02-08T19:50:46Z

I assume that this would operate like _ARRAY_DIMENSIONS
in that the size of the named dimension is determined from the corresponding
position in the "shapes" key. This of course can lead to inconsistency in the size of a
named dimension. Not surprisingly, I prefer the netcdf approach where the name and size
are declared separately from any variable so that inconsistency is not possible.

DennisHeimbigner · 2022-02-08T19:53:42Z

Another point. Unless you require all dimension names to be "global",
then you will need to use fully qualified names (fqn) for dimension names.
So one might have something like this.

"dimensions":` ["/dim1", "/grp1/grp2/dim2"]

DennisHeimbigner · 2022-02-08T19:55:42Z

WRT anonymous dimensions. One approach is to merge the shape and dimension keys
and make dimension names be JSON strings and anonymous dimensions be integers.
This avoids empty labels.

jbms · 2022-02-08T22:24:22Z

If we allow anonymous dimensions, then I would say they indeed have to be specified by their index rather than name, but of course named dimensions could also be specified by index.

And in many contexts, e.g. for display to a user, I agree that it would be very natural to display just the index in place of the name for anonymous dimensions.

Although the dimension names could be quoted to avoid ambiguity, it might also be good to disallow dimension names that consist only of digits 0-9.

However, I'm unclear exactly what you are proposing as far as having dimension names be either strings or integers. Would that just be a concern of a specific implementation, rather than the zarr spec itself?

Also as far as referencing dimensions by path, as far as I can tell nothing in the current spec requires referencing dimensions; I suppose you are thinking from the context of an extension like ome-zarr or a version of netcdf built on top of zar

While I agree that the netcdf data model makes a lot of sense in many cases, I'm not sure how well the unique dimension names constraint / consistent size for every named dimension constraint fits with all intended uses of zarr v3. I guess users could always work around that issue by putting each zarr array in a separate zarr repository, but users might wish to get other data organizational advantages of having multiple arrays in a single zarr repository without constraining themselves to the netcdf data model.

DennisHeimbigner · 2022-02-08T22:38:00Z

Although the dimension names could be quoted to avoid ambiguity, it might also be good to disallow dimension names that consist only of digits 0-9.

That is the reason I made the string vs number distinction. And the fact that netcdf allows
dimension names that are all digits.

DennisHeimbigner · 2022-02-08T22:46:34Z

Also as far as referencing dimensions by path, as far as I can tell nothing in the current spec requires referencing dimensions; I suppose you are thinking from the context of an extension like ome-zarr or a version of netcdf built on top of zar

I do not understand this comment.
I was referring to a case where we have a variable v1
defined in a group /g1 (i.e just below the root group)
something like this:

"shape": [ 1, 17]
"dimensions" ["dim1", "dim17"]

Suppose we have another variable v2 in group /g2.

"shape": [ 17]
"dimensions" ["dim17"]

How do we know that the two dim17's refer to the same dimension?
I would prefer that "dim17" be replaced with "/g1/dim17"
so that it is clear that the same dimension is being used.

Of course, this assumes one wants the shared dimension name semantics
to matter, but that, of course, is the whole point of named dimensions.

jbms · 2022-02-08T23:01:17Z

It seems like just using a unique dimension name might be more natural than specifying a dimension by reference to another array, but I am not sure.

Certainly netcdf shared dimension semantics are applicable in some applications, but I think there are other applications where dimension names are useful but the constraint that all dimensions with a given name should have the same extent is not useful. For example:

multiscale dataset, where you have arrays storing the data at multiple scales. Here the dimension names could indicate the correspondence between the dimensions of the arrays at different scales, but the extents will of course be different.
a large collection of images, with dimensions x, y, c, and a convolutional neural network model with input dimensions x, y, c. All of the images may have different x, y dimensions but you want to apply the neural network model to them, and be sure you aren't accidentally transposing x and y.

DennisHeimbigner · 2022-02-08T23:05:47Z

t seems like just using a unique dimension name might be more natural than specifying a dimension by reference to another array, but I am not sure.

In a sense I agree which is why netcdf declares dimensions separately from variables.
But it appears that this community would rather declare the dimensions as part of the
variable declaration.

DennisHeimbigner · 2022-02-08T23:09:48Z

Your examples still prove my point. You are assuming that the dimensions with the same
name are semantically the same. The issue is being able to use the same simple name (e.g. "x")
in multiple places with different extents. But you still need to disambiguate those
multiple declarations and using the fqn is IMHO the best way to do that.

DennisHeimbigner · 2022-02-09T21:16:07Z

I think that coordinate variables are important in this discussion.

Suppose we have the following:

dimensions:  lat=5, lon=4;
variables:
float temp(lat,lon);
float lat(lat);
float lon(long);

The temp variable represents the temperature at a given latitude and longitude.

The longitude values are, say, -1deg. thru 2deg.
and the latitude values are, say, -0.5deg. thru 1.5deg.
However the lat dimension runs from 0 thru 4 and lon runs from 0 thru 3.
The so-called coordinate variables map the raw indices to
the actual lat and lon values of the coordinates.
So we have:

lat = -0.5, 0.0, 0.5, 1.0, 1.5 ;
lon = -1.0, 0.0, 1.0, 2.0 ;

This concept of coordinate variables is extremely useful but it relies on
the use of shared names to indicate shared semantics.

jbms · 2022-02-09T21:21:24Z

I agree that shared names to indicate "shared semantics" in some sense is the point of named dimensions, but I think exactly what those "shared semantics" are depends on the application.

If zarr were to use the netcdf data model, where shared name means shared domain, then how do you propose to deal with the use case of a single zarr repository where the root group contains a collection of arrays named sample0, sample1, ..., sampleN. Each of these samples are 3-d xyc images but they don't all have the same x and y dimensions. How would we assign dimension names in this case?

DennisHeimbigner · 2022-02-09T21:29:56Z

In netcdf, you put the various dimensions in different groups (possibly with the relevant
variables).

This adds support for dimension names (zarr-developers#73) and non-zero origins (zarr-developers#122).

jstriebel · 2022-11-24T15:37:53Z

Crosslinking #149 (comment)

jstriebel · 2023-02-02T14:17:04Z

Resolved via #162.

alimanfoo added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label May 21, 2020

Carreau added the todo pre-rfc label Sep 25, 2020

jbms added a commit to jbms/zarr-specs that referenced this issue May 31, 2022

Revise how the domain of an array is specified

80a7ef9

This adds support for dimension names (zarr-developers#73) and non-zero origins (zarr-developers#122).

jbms mentioned this issue May 31, 2022

Revise how the domain of an array is specified #144

Closed

jstriebel mentioned this issue Nov 16, 2022

Add dimension_names array metadata field #162

Merged

jstriebel added this to ZEP1 Nov 16, 2022

jstriebel moved this to In Review in ZEP1 Nov 16, 2022

This was referenced Nov 18, 2022

ZEP0001 - Core v3.0 spec for review #149

Closed

TBD: Extensions to develop with v3 spec. #89

Closed

jstriebel closed this as completed Feb 2, 2023

github-project-automation bot moved this from In Review to Done in ZEP1 Feb 2, 2023

jbms mentioned this issue Mar 8, 2023

Drop dimension_names from v3 #219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dimension names as core array metadata #73

Dimension names as core array metadata #73

alimanfoo commented May 21, 2020

meggart commented May 25, 2020

rabernat commented Jun 3, 2020

DennisHeimbigner commented Jun 3, 2020 via email

alimanfoo commented Jun 3, 2020

DennisHeimbigner commented Jun 3, 2020 via email

alimanfoo commented Jun 4, 2020

alimanfoo commented Jun 4, 2020 •

edited

Loading

rabernat commented Jun 4, 2020

DennisHeimbigner commented Jun 4, 2020 via email

joshmoore commented Jun 5, 2020

Carreau commented Sep 25, 2020

jbms commented Feb 8, 2022

d-v-b commented Feb 8, 2022

jbms commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022 •

edited

Loading

jbms commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

jbms commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 9, 2022

jbms commented Feb 9, 2022

DennisHeimbigner commented Feb 9, 2022

jstriebel commented Nov 24, 2022

jstriebel commented Feb 2, 2023

Dimension names as core array metadata #73

Dimension names as core array metadata #73

Comments

alimanfoo commented May 21, 2020

meggart commented May 25, 2020

rabernat commented Jun 3, 2020

DennisHeimbigner commented Jun 3, 2020 via email

alimanfoo commented Jun 3, 2020

DennisHeimbigner commented Jun 3, 2020 via email

alimanfoo commented Jun 4, 2020

alimanfoo commented Jun 4, 2020 • edited Loading

rabernat commented Jun 4, 2020

DennisHeimbigner commented Jun 4, 2020 via email

joshmoore commented Jun 5, 2020

Carreau commented Sep 25, 2020

jbms commented Feb 8, 2022

d-v-b commented Feb 8, 2022

jbms commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022 • edited Loading

jbms commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

jbms commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 8, 2022

DennisHeimbigner commented Feb 9, 2022

jbms commented Feb 9, 2022

DennisHeimbigner commented Feb 9, 2022

jstriebel commented Nov 24, 2022

jstriebel commented Feb 2, 2023

alimanfoo commented Jun 4, 2020 •

edited

Loading

DennisHeimbigner commented Feb 8, 2022 •

edited

Loading