-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZEP0003 Variable chunks #18
Conversation
cc @rabernat , who has alternative ideas of how this functionality can be achieved. |
Another draft of the same ideas, to be merged into here: https://hackmd.io/NaMo9YnBSFiZiO-ds1SFMA cc @ivirshup |
Hi @martindurant. Thanks for working on this. Here's my suggestion: We should complete this PR (ZEP0003) and merge it. After merging, we can start the discussion in PR against this ZEP in the |
## Detailed description | ||
|
||
Currently, array metadata looks something like | ||
```json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct for zarr v2 but does not match the current state of zarr v3. I think it would be helpful to revise this to apply to zarr v3, since I assume your intent here is to propose this as a zarr v3 extension.
The current zarr v3 spec seems to be designed to make the chunking scheme itself an extension point:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid
However, that extension point design has not really been discussed at all, nor have any alternative grid types been proposed, other than this one.
This proposal could simply be a new chunk_grid "type". But I am not sure if that is the best fit --- this proposal allows the non-uniform chunking to be specified for just some of the dimensions. Additionally, the v3 spec has "separator" as a field of the chunk grid type, but it equally applies to both regular and rectilinear grids.
It might be worth considering if there are any other grid types that would be useful, and if so, how they might interact with this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, totally agree that, as far as the zarr-specs goes, a new subsection in that area is the way to go - with the regular grid being a special case. However, the spec does not go as far as to explicitly define how the chunks are specified in the metadata, it is only descriptive text. Is there a jsonschema somewhere I should propose a change to? When coming to write this ZEP, that's what I had assumed would be the case, but I cannot find one.
The separator is not material in this proposal - any separator can be used equally with regular or irregular grids.
if there are any other grid types that would be useful
Certainly worth thinking about. Sharding, for instance, would be thought of as hierarchical chunking. I don't know how we can fill it into this proposal, though. Ideas very welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no json schema. The zarr v3 spec does define the json metadata representation, though. For example this is where the existing regular chunk grid is specified:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid
The current spec organization is a bit confusing, though, in that in many cases there is 1 section that directly describes a given portion of the array metadata, and another separate section that provides more detailed information but does not specifically discuss the metadata representation. For example, we have the following two sections in the current spec related to the chunk grid:
Array metadata representation:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid
More general information:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grids
I think it would make sense to consolidate those sections together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly worth thinking about. Sharding, for instance, would be thought of as hierarchical chunking. I don't know how we can fill it into this proposal, though. Ideas very welcome!
I think it could certainly make sense to use sharding in conjunction with a rectilinear grid --- in fact we could allow both the chunk grid and the shard grid to be rectilinear rather than regular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jstriebel Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have referenced that link and stated that that's where the change will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for answering so late. I think there's nothing that blocks using variable chunk sizes together with sharding in general, just the sharding extension would need to have support for this or a generalized abstract indexing schema. We also discussed briefly if sharding should allow to combinae a flexible number of chunks, but that might be added later if the need arises, and it seems unnecessary complexity for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if sharding should allow to combinae a flexible number of chunks, but that might be added later if the need arises
This need already arose and has been implemented in the special handling in preffs, an alternate implementation of referenceFS for kerchunked data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, what is the use-case for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is representing data that has this internal structure. Being kerchunk, you are stuck with whatever the original has, rather than being able to rechunk.
cc @d70-t
Regarding the TODO item in the hackmd document:
In practice an implementation will likely need the cumulative values (i.e. partition points) to do a binary search to map array coordinates to chunks. However, the JSON metadata will anyway need to be fully decoded; it isn't practical to operate directly on the raw JSON representation anyway. Therefore I don't think it matters for implementation efficiency whether the stored representation is cumulative or not, and instead we should just decide based on other considerations. Points in favor of cumulative representation:
Points in favor of non-cumulative representation:
|
Another point in favour of non-cumulative: Note that for dask, the chunks are usually specified by size (step/delta) but stored as cumulative internally. |
Community decision please: shall we merge this to make it an official "draft" and start the formal ZEP discussions? I can click the ready-for-review button. |
Sorry I have not had time to comment yet. I am happy to see this move to draft status, where we can discuss it further. |
Hi @martindurant. I've added a small change, and the rest of the ZEP looks good. Is there anything you'd like to add/change before I merge this? |
No, let's leave those, since in the discussion phase we may well decide on different ways of going about this anyway. |
Co-authored-by: Isaac Virshup <[email protected]>
@ivirshup welcome to the zep, and thank you for writing your version! I simply didn't get around to updating admin fields. |
ZEP0003 published here: https://zarr.dev/zeps/draft/ZEP0003.html 🚀 |
Discussion on this ZEP currently going on at: https://github.com/orgs/zarr-developers/discussions/52 |
Very preliminary version for discussion. Happy to start a github discussion or have a thread here. If responses are mostly favourable, I can make a PR to zarr-specs with the change and scope out how much work it would involve in zarr-python.