Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZEP0003 Variable chunks #18
ZEP0003 Variable chunks #18
Changes from all commits
0d6ca13
3dcef6e
e738f4c
b867afa
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct for zarr v2 but does not match the current state of zarr v3. I think it would be helpful to revise this to apply to zarr v3, since I assume your intent here is to propose this as a zarr v3 extension.
The current zarr v3 spec seems to be designed to make the chunking scheme itself an extension point:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid
However, that extension point design has not really been discussed at all, nor have any alternative grid types been proposed, other than this one.
This proposal could simply be a new chunk_grid "type". But I am not sure if that is the best fit --- this proposal allows the non-uniform chunking to be specified for just some of the dimensions. Additionally, the v3 spec has "separator" as a field of the chunk grid type, but it equally applies to both regular and rectilinear grids.
It might be worth considering if there are any other grid types that would be useful, and if so, how they might interact with this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, totally agree that, as far as the zarr-specs goes, a new subsection in that area is the way to go - with the regular grid being a special case. However, the spec does not go as far as to explicitly define how the chunks are specified in the metadata, it is only descriptive text. Is there a jsonschema somewhere I should propose a change to? When coming to write this ZEP, that's what I had assumed would be the case, but I cannot find one.
The separator is not material in this proposal - any separator can be used equally with regular or irregular grids.
Certainly worth thinking about. Sharding, for instance, would be thought of as hierarchical chunking. I don't know how we can fill it into this proposal, though. Ideas very welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no json schema. The zarr v3 spec does define the json metadata representation, though. For example this is where the existing regular chunk grid is specified:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid
The current spec organization is a bit confusing, though, in that in many cases there is 1 section that directly describes a given portion of the array metadata, and another separate section that provides more detailed information but does not specifically discuss the metadata representation. For example, we have the following two sections in the current spec related to the chunk grid:
Array metadata representation:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid
More general information:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grids
I think it would make sense to consolidate those sections together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could certainly make sense to use sharding in conjunction with a rectilinear grid --- in fact we could allow both the chunk grid and the shard grid to be rectilinear rather than regular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jstriebel Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have referenced that link and stated that that's where the change will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for answering so late. I think there's nothing that blocks using variable chunk sizes together with sharding in general, just the sharding extension would need to have support for this or a generalized abstract indexing schema. We also discussed briefly if sharding should allow to combinae a flexible number of chunks, but that might be added later if the need arises, and it seems unnecessary complexity for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need already arose and has been implemented in the special handling in preffs, an alternate implementation of referenceFS for kerchunked data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, what is the use-case for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is representing data that has this internal structure. Being kerchunk, you are stuck with whatever the original has, rather than being able to rechunk.
cc @d70-t