-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical Array #8463
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
@pydata/xarray Ilan's question seems interesting and potentially impactful if we can support his use case, but I don't know eough about categorical arrays to weigh in. Can anyone provide guidance? |
It would be great to reuse the Arrow Dictionary concept here somehow. |
@rabernat / @TomNicholas thanks for the replies :) @rabernat What concept here are you referring to? |
Alternatively, could you write an array-api compliant container instead? This would wrap a (potentially lazy) numpy-like array of "codes" with We could "decode" netcdf Enum dtypes to this new container and also allow converting from CF-style "Flag Variables". |
Supporting both netcdf enum and pandas.Categorical seem sensible to me, although I don't know if it will be easy to do so via a single array wrapper. On a related note, I'm wondering if we shouldn't start thinking about how to generalize the approach based on
I guess we will need to address this if we want to keep Xarray up to date with Pandas switching backends from Numpy to Arrow? The question about how to work with Arrow arrays in Xarray has also been asked multiple times (AFAIK in Xarray geospatial extensions). |
Yes I will try to do that. If I run into any roadbumps, I will post here.
I do think this would be good. I think what I have makes sense, but making it more explicit/public to end-users would be cool. |
This is not the place for this but it is related: how do I encode the necessary information for encoding/decoding? From what I understand, this belongs in In the docs, it seems that |
Relatedly, how strictly should the information in categories = zarr.open_array('unique_categories')
encoding = {
...
'enum_dict': {
'from': list(range(len(categories))),
'to': categories
} instead of encoding = {
...
'enum_dict': dict(zip(zarr.open_array('unique_categories'), list(range(len(categories)))))
} whch is what netcdf has (i.e., an on-disk dictionary, instead of a stored array) and would force the user to read in all the categories data as well as the code data. But this might be too big of an ask. |
So I have started working on a branch, but have come across a bit of an issue. When the import netCDF4 as nc
import xarray as xr
import numpy as np
ds = nc.Dataset("mre.nc", "w", format="NETCDF4")
cloud_type_enum = ds.createEnumType(int,"cloud_type",{"clear":0, "cloudy":1}) #
ds.createDimension("time", size=(10))
x = np.arange(10)
ds.createVariable("x", np.int32, dimensions=("time",))
ds.variables["x"][:] = x
# {'cloud_type': <class 'netCDF4._netCDF4.EnumType'>: name = 'cloud_type', numpy dtype = int64, fields/values ={'clear': 0, 'cloudy': 1}}
ds.createVariable("cloud", cloud_type_enum, dimensions=("time",))
ds["cloud"][:] = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]
ds.close()
# -- Open dataset with xarray
xr_ds = xr.open_dataset("./mre.nc") with my branch lacks any reference to This is something that was quite surprising, actually - there were 4 or more reconstructions of the In terms of investigating, there seems to be something going on here where TL;DR I will keep digging but if anyone has any quick suggestions of how to keep a reference to the underlying object import numpy as np
import xarray as xr
codes = np.array([0, 1, 2, 1, 0])
categories = {0: 'foo', 1: 'jazz', 2: 'bar'}
cat_arr = xr.coding.variables.CategoricalArray(codes=codes, categories=categories)
v = xr.Variable(("time,"), cat_arr, fastpath=True)
ds = xr.Dataset({'cloud': v})
ds['cloud']._variable._data
# CategoricalArray(codes=..., categories={0: 'foo', 1: 'jazz', 2: 'bar'}, ordered=False) So perhaps it would make sense to release this as a general feature first instead of focusing on NetCDF. Perhaps that is a good place to start for a PR |
@ilan-gold Sorry, just saw this popping up in my inbox. There is work to read/write netCDF4.EnumType in #8147. This PR is close to be finished. The proposed solution is to add metadata to `encoding["dtype"]: if isinstance(var.datatype, netCDF4.EnumType):
encoding["dtype"] = np.dtype(
data.dtype,
metadata={
"enum": var.datatype.enum_dict,
"enum_name": var.datatype.name,
},
) This follows what |
For background, these various arrays will all get lost on data load, which I think is what you're discovering. I would instead work on a |
@dcherian That sounds like a plan to me. I will check that out. |
@kmuehlbauer Thanks for this, that is good to see. |
So I have spent some time looking into this. @kmuehlbauer Thanks for directing me to that PR, not sure how I missed it. This PR is interesting and contains some good stuff. It's definitely a step in the right direction. @dcherian I don't think an Separately, it still seems strange to me that one can write code that declares a import numpy as np
import xarray as xr
codes = np.array([0, 1, 2, 1, 0])
categories = {0: 'foo', 1: 'jazz', 2: 'bar'}
cat_arr = xr.coding.variables.CategoricalArray(codes=codes, categories=categories)
v = xr.Variable(("time,"), cat_arr)
ds = xr.Dataset({'cloud': v})
# Some simple relevant operations, there are probably others but these are what I came up with quickly to test
ds['cloud']._variable._data # This is a CategoricalArray
ds.where(ds.cloud == 1) # all nans as expected even though this is the correct code
ds.where(ds.cloud == "jazz") # correct masking
ds.where(ds.cloud.isin(["jazz", "bar"])) # more correct masking, but more complicated
ds.groupby("cloud") # three groups as expected The current status of Would a PR that tests this array and its behavior with Concretely, I could see a PR containing Does this seem feasible? |
P.S Another step along this path could be adding #5287 for pandas extension array support, which would give us Categorical support because pandas' categorical array is an extension array. But I'm not sure this really changes the roadmap I've laid out beyond changing the return type of our categorical array wrapper. It's tough to say without seeing #5287 implemented. |
AFAICT you're subclassing from If you satisfy this duck array check, you won't need to do that: Lines 260 to 270 in 33d51c8
I agree that Array API compliance probably makes no sense. Would it make sense to support I agree that writing a NEP-18 wrapper for Pandas Extension arrays may be quite valuable but it doesn. It would allow |
We discussed this at a meeting today.
|
Thanks @dcherian . I will look into these. I think 1+3 from your above points go together, and form one concrete path forward. Point 4 is a separate path forward from what I can tell. I think I will look into the fourth point closely then because it seems to offer the best way forward IMO in terms of generalization, unless this is not an issue (#5287) that is seen as important. If point 4 also goes belly up, we will need to fall back to a combination of 1+3. And I am strongly opposed to point 2 for a variety of reasons, especially now that I have looked into it. |
Is your feature request related to a problem?
We are looking to improve compatibility between
AnnData
andxarray
(see scverse/anndata#744), and so categoricals are naturally on our roadmap. Thus, I think some sort of standard-use categoricals array would be desirable. It seems something similar has come up with netCDF, although my knowledge is limited so this issue may be more distinct than I am aware. So what comes of this issue may solve two birds with one stone, or it may work towards some common solution that can at least help both use-cases (AnnData
andnetCDF
ENUM
).Describe the solution you'd like
The goal would be a standard-use categorical data type
xarray
container of some sort. I'm not sure what form this can take.We have something functional here that inherits from
ExplicitlyIndexedNDArrayMixin
and returnspandas.CategoricalDtype
. So let's say this implementation would be at least a conceptual starting point to work from (it also seems not dissimilar to what is done here for new CF types).Some issues:
xarray
categorical array should be (i.e.,numpy
with the categories applied,pandas
, something custom etc.). So I'm not sure if usingpandas.CategoricalDtype
type is acceptable as In do in the linked implementation. Relatedly....pandas.CategoricalDtype
really helps with the already existing CF Enum need if you want to have the return type be some sort ofnumpy
array (although again, not sure about the return type). As I understand it, though, the whole point of categoricals is to useintegers
as the base type and then only show "strings" outwardly i.e., printing, the API for equality operations, accessors etc., while the internals are based on integers. So I'm not really surenumpy
is even an option here. Maybe we roll our own solution?Variable
? I don't think so, but I am just a beginner here 😄 )It seems you may want, in addition to the array container, some sort of i/o functionality for this feature (so maybe some on-disk specification?).
Describe alternatives you've considered
I think there is some route via
VariableCoder
as hinted here i.e., usingencode
/decode
. This would probably be more general purpose as we could encode directly to other data types if usingpandas
is not desirable. Maybe this would be a way to support bothnetCDF
and returning apandas.CategoricalDtype
(again, not sure what thenetCDF
return type should be forENUM
).Additional context
So just for reference, the current behavior of
to_xarray
withpandas.CategoricalDtype
isobject
dtype
fromnumpy
:And as stated in the
netCDF
issue, for that use-case, the information aboutENUM
is lost (from what I can read).Apologies if I'm missing something here! Feedback welcome! Sorry if this is a bit chaotic, just trying to cover my bases.
The text was updated successfully, but these errors were encountered: