-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding sparse array support #245
Comments
I also meant to include @joshua-gould |
I think this is both a great idea and pretty straightforward to implement. Before thinking about any spec changes or extension, I would work on a prototype implementation. I think this can be implemented in two steps in zarr-python relatively simply.
Based on what is learned from this exercise, we can think about what, if any, spec changes or extensions would be required to "formally" support sparse. |
p.s. I actually think that the sparse GCXS format is an excellent sparse encoding, superseding both CSR and CSC. |
Hey! I will try and write up a longer reponse later, but am at a hackathon. Just wanted to add a few points
This is not the case, anndata sparse storage works well with zarr. The limitation here has been dask support for sparse arrays, which is also something that is being worked on. However, some promising developments: https://scverse.zulipchat.com/#narrow/stream/315789-data-structures/topic/dask.20in.20anndata/near/355745689 @rabernat, I would be interested in talking about this in more detail, but don't think sparse arrays will work well with a codec. A lot of this has previously been discussed in #62. My current opinion is that structures like I'd also note that anndata is collaborating with graphblas team (and others) to generate a more general binary sparse representation built on top of things like zarr and hdf5. You can check that out here. v1 won't contain n-dimensional CSF/ GCXS support, but a spec for those kind of arrays is looking pretty mature. You can find this here: |
I think this is one of the key problem with having sparse as a codec. Codecs make a lot of sense for arrays which are one contiguous chunks of memory that needs to be read with a single dtype. A CSR array is really 3 distinct chunks of memory, each of which can need to be interpreted with a different dtype. I think this will be a problem for any codec more complicated than "decompress arbitrary buffer" which seems like the directions codecs are headed in v3, e.g. endianness, transpose, delta... |
I see your point @ivirshup and I recognize you have thought much more deeply about this than me! 😀
In the ratified version of the V3 spec there are three types of codecs (see #222): zarr-specs/docs/v3/core/v3.0.rst Lines 1084 to 1095 in a9abaa1
The spec does not explicitly say that |
Yes, I think the sparse codec could be handled as an "array -> bytes" codec under the current v3 spec with no changes required. The in-memory representation of arrays is an implementation detail and not part of the spec. Exactly how easily this can be implemented in zarr-python or zarrita remains to be seen, though. One potential challenge with simply using |
@ivirshup you are indeed the expert on this subject. While I'm very new to this, I'm happy to help where I can.
News to me. The website really makes it sound like
I agree completely though that's a much taller task; I posted here thinking this was a good place to start. binsparse-specification looks promising. Is there a repo for writing/reading arrays to/from zarr or hdf5 files with this spec yet? @rabernat, naively writing a codec seems approachable to me as the process seems to have good documentation. There are questions on implementation though that raises the question whether it is actually doable:
|
Also @ivirshup, I'm still processing "anndata sparse storage works well with zarr". I need to play with it more and dive deeper into the code. |
In zarr v3, a sparse "array -> bytes" codec would be responsible for generating a sequence of bytes to represent the sparse array. It might do this by first converting the sparse array into a collection of dense arrays using one of the standard sparse array representations, then encoding each of these dense arrays, and then concatenating them together. It may make sense for this sparse array codec to have as configuration options separate sequences of codecs that can be applied to each of the dense arrays. Zarr v3 also allows you to specify additional "bytes -> bytes" codecs at the top-level that would apply to the concatenated result. For example: "codecs":
[{"name": "sparse_gcxs",
"configuration": {
"ro_codecs": [{"name": "endian", "endian": "little"}, {"name": "blosc"}],
"co_codecs": [...],
"vl_codecs": [...]
}
}] Normally when working with a sparse array you will need all of the dense arrays, so storing them all together in a single chunk makes sense. However, I suppose in some cases you might only need to access e.g. the values array and not the coordinates array, for example if applying the same operation to every value independently. That would not be possible if they are all stored in a single chunk. To be able to split the chunk over multiple keys in the underlying storage would require some changes to the zarr v3 codec model, though.
We already have chunks taking up a variable amount of space on disk when using compression with regular dense arrays.
To be clear, there is a difference between in-memory representation and on-disk representation. The existing sparse array formats are well-suited for on-disk representation, since even with dense arrays, zarr has to re-write chunks in their entirety when they are modified (assuming any sort of compression is used, and in practice likely will even if no compression is being used). The in-memory representation that makes sense would depend on specific details of the implementation. I don't think there are any fundamental challenges there, but it may be necessary to make changes to the zarr-python code to support it. |
Just a [slightly off topic] thought...it would be very cool if pydata/sparse could implement a compressed-sparse-blocks format. There are several examples of this out there, although it looks like they are always assuming 2D matrices:
This would map very well to a Zarr, with each chunk forming a block. |
@ivirshup can say more but my understanding from SciPy was that https://github.com/GraphBLAS/binsparse-specification should be ready for investigations in Zarr-land soon. |
Has there been any progress made towards this in the past several months? @rabernat @joshua-gould @elyall @jbms Based on the above links to https://github.com/ivirshup/binsparse-python, there hasn't been a binsparse-python commit in ~6 months, so it may be wise to proceed ahead without it. As it's already been mentioned above, there is good dask support for sparse arrays which is a starting point. While there are complications with variable sizing of some sparse arrays (CSR for example) it make sense to at least start with an implementation to support I'm hoping to take a stab at this over the coming weeks. |
You should check out https://github.com/graphblas/binsparse-specification, which is more up to date. I haven't had time recently to update my initial python code, but hopefully will get to it in March. Currently we are working more on doing stuff with sparse in dask, but I would not call the level of support "good". |
As an update to this thread, I've been working with @ivirshup on updating the initial binsparse-py code for this use case |
Has there been an update on this issue? Or is there a current workaround for saving sparse matrices on zarr files stores or using them with Dask? |
The current workaround is to use https://github.com/ivirshup/binsparse-python for reading / writing sparse arrays from Scipy Sparse CSR, CSC, and COO formats |
It would be nice if there was also support for the pydata sparse arrays as they are compatible with xarrays and therefore Dask, which is not directly possible with the scipy sparse arrays. Everytime I try to use Dask to convert the written scipy arrays, I have to densify the arrays, and then make them pydata sparse, which kind of defeats the purpose. Is this something that is being worked on or is planned on the horizon? Perhaps there is a better way to do this, but I don't know how. |
Zarr would benefit from formal support of storing sparse arrays, though I'm not sure what this would look like and the process for getting there (e.g. placing the functionality in the core spec, creating a numcodec, and/or creating a ZEP to get there). But from some of the linked comments below it does seem very achievable.
If I'm wrong anywhere, please correct me. I'm not an expert on these packages.
Example use case: parallel processing of single cell RNA sequencing (scRNA-seq) data across a cluster.
Single cell sequencing data produces large arrays, often sparse, and often too large to store in memory. They would benefit from chunked, sparse, on-disk storage to empower chunked parallel processing on a Dask or Ray cluster in an efficient manner.
The current scRNA-seq analysis stack utilizes annData to wrap the data and scanpy to analyze it. annData was built upon HDF5 files but there has been an effort to utilize zarr and dask (with the current caveat seemingly that the central matrix is a dense np.ndarray due to zarr's limitation). scanpy doesn't fully support dask arrays at the moment (e.g. scverse/scanpy#2491) but since dask arrays follow the numpy API that shouldn't be hard to fix. Dask supports sparse arrays and Ray can be used as a plug-in scheduler for dask. In practice I've found anndata is able to load in a scipy sparse csr matrix from an hdf5 file.
Some previous efforts and discussions I've found
meta_array
argument to specify array typeProposed test code to set goalpost
We can try to implement all five of the array formats I've come across (included below), or stick to
sparse.COO
which follows the numpy API most completely, is more powerful by being n-dimensional, and is easy to chunk as the coordinates are exact (though less space efficient than csr or csc). Alsosparse.COO
is easily converted in memory to other sparse array types.My ultimate use case
I'd like to analyze this dataset on a (local/remote) ray cluster using scanpy's default workflow at the start.
Some relevant stakeholders
@ivirshup, @alimanfoo, @rabernat, @jakirkham, @daletovar, @MSanKeys963
The text was updated successfully, but these errors were encountered: