Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP0003 Variable chunks #18

Merged
merged 4 commits into from
Oct 21, 2022
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 167 additions & 0 deletions draft/ZEP0003.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
layout: default
title: ZEP0003
description: Variable chunk sizes
parent: draft ZEPs
nav_order: 3
---

# ZEP 3 — Variable chunking

Authors:
* Martin Durant ([@martindurant](https://github.com/martindurant)), Anaconda, Inc.
martindurant marked this conversation as resolved.
Show resolved Hide resolved

Status: Draft

Type: Specification

Created: 2022-10-17

## Abstract

To allow the chunks of a zarr array to be rectangular grid rather than a regular grid,
with the chunk
lengths along any dimension a list of integers rather than a single chunk size.

## Motivation and Scope

Two specific use cases have motivated this, given below. However, this generalisation of Zarr's storage
model can be seen as an optional enhancement, and the same data model as currently used by dask.array.

- when producing a [kerchunked](https://github.com/fsspec/kerchunk) dataset, the native chunking of the targets
cannot be changed. It is common
to have non-regular chunking on at least one dimension, such as a time dimension with one sample per day and chunks
of one month or one year. The change would allow these datasets to be read via kerchunk, and/or converted to
zarr with equivalent chunking to the original. Such data cannot currently be represented in zarr.
- [awkward](https://github.com/scikit-hep/awkward) arrays, ragged arrays and sparse data can be represented as
a set of one-dimensional arrays, with an appropriate metadata description convention. The size of a chunks
of each component array corresponding to a logical chunk of the overall array will not, in general be equal
with each other in a single chunk, nor consistent between chunks, as each row in the matrix can have a variable number
of non-zero values
- sensor data, may not come in fixed increments; variably chunked storage would be great for parallel writing.
With variable chunk sizes, just need to make sure offsets are
correct once done. Otherwise, write locations for chunks are dependent on previous chunks.
- in some cases, parts of the overall data array may have very different data distributions, and it can
be very convenient to partition the data by such characteristics to allow, for example, for more efficient encoding
schemes.
- when filtering regular table data on one column and applying to other columns, you necessarily end up with an unequal
number of values in each chunk, which zarr does not currently handle.

## Usage and Impact

### Creation

```python
zarr.create(1000, chunks=((100, 300, 500, 100),))
```


## Backward Compatibility


This change is fully backward compatible - all old data will remain usable. However, data written with
variable chunks will not be readable by older versions of Zarr. It would be reasonable to wish to backport the
feature to v2.

## Detailed description

Currently, the array metadata specifies the chunking scheme like
(see https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid)
```json
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct for zarr v2 but does not match the current state of zarr v3. I think it would be helpful to revise this to apply to zarr v3, since I assume your intent here is to propose this as a zarr v3 extension.

The current zarr v3 spec seems to be designed to make the chunking scheme itself an extension point:

https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid

However, that extension point design has not really been discussed at all, nor have any alternative grid types been proposed, other than this one.

This proposal could simply be a new chunk_grid "type". But I am not sure if that is the best fit --- this proposal allows the non-uniform chunking to be specified for just some of the dimensions. Additionally, the v3 spec has "separator" as a field of the chunk grid type, but it equally applies to both regular and rectilinear grids.

It might be worth considering if there are any other grid types that would be useful, and if so, how they might interact with this proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, totally agree that, as far as the zarr-specs goes, a new subsection in that area is the way to go - with the regular grid being a special case. However, the spec does not go as far as to explicitly define how the chunks are specified in the metadata, it is only descriptive text. Is there a jsonschema somewhere I should propose a change to? When coming to write this ZEP, that's what I had assumed would be the case, but I cannot find one.

The separator is not material in this proposal - any separator can be used equally with regular or irregular grids.

if there are any other grid types that would be useful

Certainly worth thinking about. Sharding, for instance, would be thought of as hierarchical chunking. I don't know how we can fill it into this proposal, though. Ideas very welcome!

Copy link

@jbms jbms Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no json schema. The zarr v3 spec does define the json metadata representation, though. For example this is where the existing regular chunk grid is specified:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid

The current spec organization is a bit confusing, though, in that in many cases there is 1 section that directly describes a given portion of the array metadata, and another separate section that provides more detailed information but does not specifically discuss the metadata representation. For example, we have the following two sections in the current spec related to the chunk grid:

Array metadata representation:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid

More general information:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grids

I think it would make sense to consolidate those sections together.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly worth thinking about. Sharding, for instance, would be thought of as hierarchical chunking. I don't know how we can fill it into this proposal, though. Ideas very welcome!

I think it could certainly make sense to use sharding in conjunction with a rectilinear grid --- in fact we could allow both the chunk grid and the shard grid to be rectilinear rather than regular.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jstriebel Any thoughts on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have referenced that link and stated that that's where the change will happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for answering so late. I think there's nothing that blocks using variable chunk sizes together with sharding in general, just the sharding extension would need to have support for this or a generalized abstract indexing schema. We also discussed briefly if sharding should allow to combinae a flexible number of chunks, but that might be added later if the need arises, and it seems unnecessary complexity for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sharding should allow to combinae a flexible number of chunks, but that might be added later if the need arises

This need already arose and has been implemented in the special handling in preffs, an alternate implementation of referenceFS for kerchunked data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, what is the use-case for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is representing data that has this internal structure. Being kerchunk, you are stuck with whatever the original has, rather than being able to rechunk.

cc @d70-t

{
"type": "regular",
"chunk_shape": [10, 10],
"separator":"/"
}
```

The proposal is to allow metadata of the form
```json
{
"type": "rectangular",
"chunk_shape": [[5, 5, 5, 15, 15, 20, 35], 10],
"separator":"/"
}
```
Each element of `chunk_shape`, corresponding to each dimension of the array, may be a single integer, as before,
or a list of integers which add up to the size of the array in that dimension. In this example, the single value
of `10` for the chunks on the second dimension would be identical to `[10, 10, 10, 10, 10, 10, 10, 10, 10, 10]`.
The number of values in the list is equal to the number of chunks along that dimension. Thus, a "rectangular"
grid may be fully compatible as a "regular" grid.

The data index bounds on a dimension of each hyperrectangle is formed by a cumulative sum of the chunks values,
starting at 0.
```
bounds_axis0 = [0, 5, 10, 15, 30, 45, 65, 100]
bounds_axis1 = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
```
such that key "c0/0" contains values for indices along the first dimension (0, 5] and (0, 10] on the second dimension.
An array index of (17, 17) would be found in key "c3/1", index (2, 2).

## Related Work

### Dask

`dask.array` uses rectangular chunking internally, and is one of the major consumers of zarr data. Much of the
code translating logical slices into slices on the individual chunks should be reusable.

### Parquet/ Arrow

Arrow describes tables as a collection of record batches. There is no restriction on the size of these batches.
This is not only very flexible, but can be used as an indexing strategy for low cardinality columns within parquet.

```
dataset_name/
year=2007/
month=01/
0.parq
1.parq
...
month=02/
0.parq
1.parq
...
month=03/
...
year=2008/
month=01/
...
...
```

This feature was cited as one of the reasons parquet was chose over zarr for dask
dataframes: https://github.com/dask/dask/issues/1599

### awkward array

https://github.com/zarr-developers/zarr-specs/issues/62


## Implementation

It is to be hoped that much code can be adapted from dask.array, which already allows variable chunk sizes
on each dimension.

## Alternatives

### Just tune chunk sizes

https://github.com/zarr-developers/zarr-specs/issues/62#issuecomment-1100806513


## Discussion


## References and Footnotes

* Previous discussion:
* [Zarr Dask Table dask/dask#1599](https://github.com/dask/dask/issues/1599)
* [Protocol extensions for awkward arrays zarr-developers/zarr-specs#62](https://github.com/zarr-developers/zarr-specs/issues/62)
* [Handling arrays with non-uniform chunking zarr-developers/zarr-specs#40](https://github.com/zarr-developers/zarr-specs/issues/40)
* [Chunk spec zarr-developers/zarr-spec#7](https://github.com/zarr-developers/zarr-specs/issues/7#issuecomment-468127219)



## Copyright

This document has been placed in the public domain.