-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment with xarray #744
Comments
cc @jpivarski (who may be interested in the Awkward Array connection) |
Supporting Awkward Arrays would likely prevent full reimplementation of anndata with xarray alone, since xarrays can't contain Awkward Arrays or vice-versa. Even the "tree-like data structure" on xarray's road map (experimentally implemented by Datatree), is not quite the same thing, as Datatrees are more like nested groups in an HDF file (as seen in these docs): a small number of nested objects, which can each be large. Awkward Arrays represent a large number of nested objects. The comparison is like "AoS vs SoA" (just an analogy). This comment, pydata/xarray#4118 (comment), seems to be spelling out out the difference, and I'm following up with the author on scikit-hep/awkward#1396. As a side note, it looks like there could be some benefit to xarrays containing Awkward Arrays (and not the other way around). That's something I should probably ask the xarray developers someday. Datatree is extending Dataset in a bigger way than it would probably take to wrap an Awkward Array. Unless/until we actually do that, implementation of anndata with xarray would have to have some way to handle the fact that Awkward Arrays are not included within xarray's data model. |
My mental model here was a 1d Random thought: storing an arrow |
Can you put Arrow data in xarray? Arrow is interchangeable with Awkward Array, so having Arrow can be seen as equivalent to having Awkward. The ak.to_arrow and ak.from_arrow functions are usually zero-copy, too. If that's already a possibility, it's more than part way there. The main way in which Awkward Arrays differ from all the other array types is that Awkward Arrays do not have A single ak.Array can be split apart into a small number of buffers of different sizes, each of which can be an xr.DataArray, along with some metadata to put them back again. That was the idea for using Awkward Array in Zarr: one ak.Array becomes one Zarr group of datasets. Since xarray Datatree is like Zarr and HDF5 groups, one ak.Array could be decomposed into a Datatree using ak.to_buffers and reconstituted using ak.from_buffers. |
Bit of a tangent, but it might be worthwhile to write up a Data Array API issue about the Awkward Array use case. |
We already talked about it here: data-apis/consortium-feedback#6. It sounded pretty clear that Awkward (and by extension, Arrow) are out of scope for Data Array API, and it's understandable that the scope would have to cut off somewhere. |
If anyone is looking for more confusion, I'd like to mention scipp, and in particular its Binned data feature. This is somewhat similar to a |
@SimonHeybrock, thanks for pointing that out! From my initial look, the API for scipp looks quite nice. It does seem to cater to some use-cases we're looking at more than the more geospatial focus of xarray. However, I really like that xarray can hold various types of python arrays. For instance, sparse arrays are very important to us – and I'd expect dask will become important as well. |
@ivirshup The two things you point out (holding other Python arrays, dask support) are indeed somewhat sore points for us. We would like to do both, but currently have no funding to do so. We have serialization compatible with dask, so a number of the dask multi-processing APIs can be used, but we do not have an implementation of the dask collections interface, i.e., we currently do not support chunking and operations in the style of xarray's dask support. |
Another potential ask here: not reading the |
👋 Hi folks! Xarray dev here. Just wanted to drop a note to say that we'd be happy to help move this issue forward if/when it becomes a priority. We've been making lots of progress toward flexible indexes and array backends that I assume would be of interest here. |
Hey @jhamman! I think it's pretty close to becoming a priority. Figuring out how heavy of a lift sparse arrays will be is the main thing here. Could you point me to any recent developments around array backends? Are we even talking like a-couple-hours-ago recent? |
Yes "couple of hours" recent. We will refactor out that NamedArray piece over the next couple of months to a new library with minimal dependencies (no pandas!) and support for any array API (+ other array protocols) compliant object. Please read the design doc and let us know what you think. Your input will be very valuable!
From the list in your initial post though, it seems like NamedArray isn't entirely what you want.
|
@dcherian You can see here roughly what we have working at the moment for categoricals: https://github.com/scverse/anndata/pull/947/files#diff-3593f379977a83708f011798996a4e97ec3cf87f11055e3f93651a9718ae4db2R34 We also have something for nullable data types as well. Feedback welcome! |
Follow up on this topic at scikit-hep/ragged#6 |
Just as a note, the scope of the |
Right—sorry for the confusion. If all the conversations linked to the new one, this one is perhaps the least related. I know that you've used missing data and even unions, which will not be supported by the ragged library. Also, it's no minor thing that you've adapted AnnData to use Awkward: the work has been done. I think the users of the Ragged library would be wanting to make smaller changes to adopt something that looks like a normal array. |
All good! Thanks for keeping us in the loop of that discussion! |
I'm opening this issue to track and discuss how our data structure differs from xarray. Ideally I would close it when AnnData could easily be implemented via xarray.
Some previous discussion: #308
The idea
I often think of AnnData as a kind of "special case" of xarray Datasets. We just improve convenience by specializing on the 2d case, plus a few other features. It would be nice if I didn't just think of it that way, and we could actually just use their code here.
sgkit basically accomplishes this. It basically uses a very "anndata shaped"1 xarray Dataset2 for representing genomics data. These data structures and our goals with them are so similar that searching for open issues by the sgkit devs on the xarray repository is a great way to find compatibility issues for anndata.
Additionally, zarr and OME-zarr are quite aligned with xarray.
What's missing
Some things we need, which xarray does not currently provide:
obsp
,varp
) Repeated coordinates leads to unintuitive (broken?) indexing behaviour pydata/xarray#3731Footnotes
Since we're in the same language, working with biological data, and using many of the same technologies it would make a lot of sense for us to have greater alignment with sgkit. ↩
More context: https://github.com/single-cell-data/matrix-api/issues/11#issuecomment-1072533371 ↩
The text was updated successfully, but these errors were encountered: