Alignment with xarray #744

ivirshup · 2022-03-23T18:30:30Z

I'm opening this issue to track and discuss how our data structure differs from xarray. Ideally I would close it when AnnData could easily be implemented via xarray.

Some previous discussion: #308

The idea

I often think of AnnData as a kind of "special case" of xarray Datasets. We just improve convenience by specializing on the 2d case, plus a few other features. It would be nice if I didn't just think of it that way, and we could actually just use their code here.

sgkit basically accomplishes this. It basically uses a very "anndata shaped"¹ xarray Dataset² for representing genomics data. These data structures and our goals with them are so similar that searching for open issues by the sgkit devs on the xarray repository is a great way to find compatibility issues for anndata.

Additionally, zarr and OME-zarr are quite aligned with xarray.

What's missing

Some things we need, which xarray does not currently provide:

We have support for the fast sparse array library (ideally we can get pydata/sparse to become fast)
We support categorical variables
We support repeated dimensions (e.g. obsp, varp) Repeated coordinates leads to unintuitive (broken?) indexing behaviour pydata/xarray#3731
We have a nested structure (though it's on the roadmap with datatree being implemented)
We are actively working on support for awkward arrays (first attempt to support awkward arrays #647 Awkward array backend? pydata/xarray#4285, https://github.com/pystatgen/sgkit/issues/643)

Since we're in the same language, working with biological data, and using many of the same technologies it would make a lot of sense for us to have greater alignment with sgkit. ↩
More context: https://github.com/single-cell-data/matrix-api/issues/11#issuecomment-1072533371 ↩

The text was updated successfully, but these errors were encountered:

jakirkham · 2022-04-04T18:39:33Z

cc @jpivarski (who may be interested in the Awkward Array connection)

jpivarski · 2022-04-04T20:07:00Z

Supporting Awkward Arrays would likely prevent full reimplementation of anndata with xarray alone, since xarrays can't contain Awkward Arrays or vice-versa. Even the "tree-like data structure" on xarray's road map (experimentally implemented by Datatree), is not quite the same thing, as Datatrees are more like nested groups in an HDF file (as seen in these docs): a small number of nested objects, which can each be large. Awkward Arrays represent a large number of nested objects. The comparison is like "AoS vs SoA" (just an analogy). This comment, pydata/xarray#4118 (comment), seems to be spelling out out the difference, and I'm following up with the author on scikit-hep/awkward#1396.

As a side note, it looks like there could be some benefit to xarrays containing Awkward Arrays (and not the other way around). That's something I should probably ask the xarray developers someday. Datatree is extending Dataset in a bigger way than it would probably take to wrap an Awkward Array.

Unless/until we actually do that, implementation of anndata with xarray would have to have some way to handle the fact that Awkward Arrays are not included within xarray's data model.

ivirshup · 2022-04-06T16:39:53Z

Supporting Awkward Arrays would likely prevent full reimplementation of anndata with xarray alone, since xarrays can't contain Awkward Arrays or vice-versa.
...
As a side note, it looks like there could be some benefit to xarrays containing Awkward Arrays

My mental model here was a 1d xr.DataArray containing an ak.Array. This seem's fairly doable to me since you really only need labels -> positional indices. Figuring out the merging/ concatenation semantics here could take some more doing, but also strikes me as possible.

Random thought: storing an arrow ListArray inside an xr.DataArray could get us part way here.

jpivarski · 2022-04-06T18:05:43Z

Can you put Arrow data in xarray? Arrow is interchangeable with Awkward Array, so having Arrow can be seen as equivalent to having Awkward. The ak.to_arrow and ak.from_arrow functions are usually zero-copy, too. If that's already a possibility, it's more than part way there.

The main way in which Awkward Arrays differ from all the other array types is that Awkward Arrays do not have shape and dtype. (Same for Arrow arrays, for the same reason.) That's usually the first thing that we find when we attempt to put Awkward Arrays into Pandas or Dask naively. It's also why we can't participate in the Python array API standard.

A single ak.Array can be split apart into a small number of buffers of different sizes, each of which can be an xr.DataArray, along with some metadata to put them back again. That was the idea for using Awkward Array in Zarr: one ak.Array becomes one Zarr group of datasets. Since xarray Datatree is like Zarr and HDF5 groups, one ak.Array could be decomposed into a Datatree using ak.to_buffers and reconstituted using ak.from_buffers.

jakirkham · 2022-04-06T18:19:31Z

The main way in which Awkward Arrays differ from all the other array types is that Awkward Arrays do not have shape and dtype. (Same for Arrow arrays, for the same reason.) That's usually the first thing that we find when we attempt to put Awkward Arrays into Pandas or Dask naively. It's also why we can't participate in the Python array API standard.

Bit of a tangent, but it might be worthwhile to write up a Data Array API issue about the Awkward Array use case.

jpivarski · 2022-04-06T18:26:21Z

Bit of a tangent, but it might be worthwhile to write up a Data Array API issue about the Awkward Array use case.

We already talked about it here: data-apis/consortium-feedback#6. It sounded pretty clear that Awkward (and by extension, Arrow) are out of scope for Data Array API, and it's understandable that the scope would have to cut off somewhere.

SimonHeybrock · 2022-05-31T05:12:19Z

If anyone is looking for more confusion, I'd like to mention scipp, and in particular its Binned data feature. This is somewhat similar to a DataArray containing an Awkward Array of records. Happy to share more info if someone is interested.

ivirshup · 2022-06-07T15:27:27Z

@SimonHeybrock, thanks for pointing that out! From my initial look, the API for scipp looks quite nice. It does seem to cater to some use-cases we're looking at more than the more geospatial focus of xarray.

However, I really like that xarray can hold various types of python arrays. For instance, sparse arrays are very important to us – and I'd expect dask will become important as well.

SimonHeybrock · 2022-06-08T06:35:53Z

@ivirshup The two things you point out (holding other Python arrays, dask support) are indeed somewhat sore points for us. We would like to do both, but currently have no funding to do so.

We have serialization compatible with dask, so a number of the dask multi-processing APIs can be used, but we do not have an implementation of the dask collections interface, i.e., we currently do not support chunking and operations in the style of xarray's dask support.

ilan-gold · 2023-07-29T17:18:00Z

Another potential ask here: not reading the dims (like indices of a dataframe) into memory Dataset declaration.

jhamman · 2023-09-27T16:32:24Z

👋 Hi folks! Xarray dev here. Just wanted to drop a note to say that we'd be happy to help move this issue forward if/when it becomes a priority. We've been making lots of progress toward flexible indexes and array backends that I assume would be of interest here.

ivirshup · 2023-09-27T17:40:23Z

Hey @jhamman! I think it's pretty close to becoming a priority. Figuring out how heavy of a lift sparse arrays will be is the main thing here. Could you point me to any recent developments around array backends? Are we even talking like a-couple-hours-ago recent?

initial refactor for NamedArray pydata/xarray#8075

dcherian · 2023-09-27T21:33:35Z

Yes "couple of hours" recent. We will refactor out that NamedArray piece over the next couple of months to a new library with minimal dependencies (no pandas!) and support for any array API (+ other array protocols) compliant object.

Please read the design doc and let us know what you think. Your input will be very valuable!

Figuring out how heavy of a lift sparse arrays will be is the main thing here.

pydata/sparse is supported. scipy.sparse needs to become array API compliant (which I think is on the cards? you'll know more!). Bottom line is we want to support any standards-conforming array library.

From the list in your initial post though, it seems like NamedArray isn't entirely what you want.

For hierarchies you'd want datatree (as noted), but that pulls xarray, which will pull pandas.
We haven't considered repeated dims yet, but I bet we could support some set of reasonable cases.
Categorical variables are interesting. Again, if there was some array standard compliant container, we'd want to be able to wrap that too.

ilan-gold · 2023-09-28T15:28:36Z

@dcherian You can see here roughly what we have working at the moment for categoricals: https://github.com/scverse/anndata/pull/947/files#diff-3593f379977a83708f011798996a4e97ec3cf87f11055e3f93651a9718ae4db2R34 We also have something for nullable data types as well. Feedback welcome!

jpivarski · 2023-12-30T18:47:42Z

Follow up on this topic at scikit-hep/ragged#6

grst · 2024-01-03T08:40:36Z

Just as a note, the scope of the ragged library does not cover what we are currently doing in scirpy (heavy use of RecordTypes), nor for what @Zethson is planning in ehrapy (arbitrary nesting). So we'd likely need support for the full awkward array anyway.

jpivarski · 2024-01-03T13:09:07Z

Right—sorry for the confusion. If all the conversations linked to the new one, this one is perhaps the least related. I know that you've used missing data and even unions, which will not be supported by the ragged library.

Also, it's no minor thing that you've adapted AnnData to use Awkward: the work has been done. I think the users of the Ragged library would be wanting to make smaller changes to adopt something that looks like a normal array.

grst · 2024-01-03T13:14:04Z

All good! Thanks for keeping us in the loop of that discussion!

ivirshup added upstream meta labels Mar 23, 2022

ivirshup mentioned this issue Mar 23, 2022

Per element metadata #745

Open

ivirshup added this to Distributed and Large AnnData Jun 21, 2022

TomNicholas mentioned this issue Aug 10, 2022

Awkward array backend? pydata/xarray#4285

Open

github-actions bot added the stale label Jun 19, 2023

flying-sheep added enhancement and removed stale labels Jun 19, 2023

scverse deleted a comment from github-actions bot Aug 1, 2023

ilan-gold mentioned this issue Nov 17, 2023

Categorical Array pydata/xarray#8463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment with xarray #744

Alignment with xarray #744

ivirshup commented Mar 23, 2022 •

edited

Loading

jakirkham commented Apr 4, 2022

jpivarski commented Apr 4, 2022

ivirshup commented Apr 6, 2022

jpivarski commented Apr 6, 2022

jakirkham commented Apr 6, 2022 •

edited

Loading

jpivarski commented Apr 6, 2022

SimonHeybrock commented May 31, 2022

ivirshup commented Jun 7, 2022

SimonHeybrock commented Jun 8, 2022

ilan-gold commented Jul 29, 2023

jhamman commented Sep 27, 2023

ivirshup commented Sep 27, 2023

dcherian commented Sep 27, 2023 •

edited

Loading

ilan-gold commented Sep 28, 2023 •

edited

Loading

jpivarski commented Dec 30, 2023

grst commented Jan 3, 2024

jpivarski commented Jan 3, 2024

grst commented Jan 3, 2024

Alignment with xarray #744

Alignment with xarray #744

Comments

ivirshup commented Mar 23, 2022 • edited Loading

The idea

What's missing

Footnotes

jakirkham commented Apr 4, 2022

jpivarski commented Apr 4, 2022

ivirshup commented Apr 6, 2022

jpivarski commented Apr 6, 2022

jakirkham commented Apr 6, 2022 • edited Loading

jpivarski commented Apr 6, 2022

SimonHeybrock commented May 31, 2022

ivirshup commented Jun 7, 2022

SimonHeybrock commented Jun 8, 2022

ilan-gold commented Jul 29, 2023

jhamman commented Sep 27, 2023

ivirshup commented Sep 27, 2023

dcherian commented Sep 27, 2023 • edited Loading

ilan-gold commented Sep 28, 2023 • edited Loading

jpivarski commented Dec 30, 2023

grst commented Jan 3, 2024

jpivarski commented Jan 3, 2024

grst commented Jan 3, 2024

ivirshup commented Mar 23, 2022 •

edited

Loading

jakirkham commented Apr 6, 2022 •

edited

Loading

dcherian commented Sep 27, 2023 •

edited

Loading

ilan-gold commented Sep 28, 2023 •

edited

Loading