Merge pull request #18 from martindurant/var_chunk

Variable chunks
zarr-developers · Oct 21, 2022 · 634ab24 · 634ab24
2 parents 4f12f9e + b867afa
commit 634ab24
Showing 1 changed file with 168 additions and 0 deletions.
diff --git a/draft/ZEP0003.md b/draft/ZEP0003.md
@@ -0,0 +1,168 @@
+---
+layout: default
+title: ZEP0003
+description: Variable chunk sizes
+parent: draft ZEPs
+nav_order: 3
+---
+
+# ZEP 3 — Variable chunking
+
+Authors:
+* Martin Durant ([@martindurant](https://github.com/martindurant)), Anaconda, Inc.
+* Isaac Virshup ([@ivirshup](https://github.com/martindurant)), Helmholtz Munich
+
+Status: Draft
+
+Type: Specification
+
+Created: 2022-10-17
+
+## Abstract
+
+To allow the chunks of a zarr array to be rectangular grid rather than a regular grid, 
+with the chunk
+lengths along any dimension a list of integers rather than a single chunk size.
+
+## Motivation and Scope
+
+Two specific use cases have motivated this, given below. However, this generalisation of Zarr's storage
+model can be seen as an optional enhancement, and the same data model as currently used by dask.array.
+
+- when producing a [kerchunked](https://github.com/fsspec/kerchunk) dataset, the native chunking of the targets 
+cannot be changed. It is common
+to have non-regular chunking on at least one dimension, such as a time dimension with one sample per day and chunks 
+of one month or one year. The change would allow these datasets to be read via kerchunk, and/or converted to
+zarr with equivalent chunking to the original. Such data cannot currently be represented in zarr.
+- [awkward](https://github.com/scikit-hep/awkward) arrays, ragged arrays and sparse data can be represented as
+a set of one-dimensional arrays, with an appropriate metadata description convention. The size of a chunks
+of each component array corresponding to a logical chunk of the overall array will not, in general be equal
+with each other in a single chunk, nor consistent between chunks, as each row in the matrix can have a variable number
+of non-zero values
+- sensor data, may not come in fixed increments; variably chunked storage would be great for parallel writing. 
+With variable chunk sizes, just need to make sure offsets are 
+correct once done. Otherwise, write locations for chunks are dependent on previous chunks.
+- in some cases, parts of the overall data array may have very different data distributions, and it can
+be very convenient to partition the data by such characteristics to allow, for example, for more efficient encoding
+schemes.
+- when filtering regular table data on one column and applying to other columns, you necessarily end up with an unequal
+number of values in each chunk, which zarr does not currently handle.
+
+## Usage and Impact
+
+### Creation
+
+```python
+zarr.create(1000, chunks=((100, 300, 500, 100),))
+```
+
+
+## Backward Compatibility
+
+
+This change is fully backward compatible - all old data will remain usable. However, data written with
+variable chunks will not be readable by older versions of Zarr. It would be reasonable to  wish to backport the
+feature to v2.
+
+## Detailed description
+
+Currently, the array metadata specifies the chunking scheme like
+(see https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid)
+```json
+{
+  "type": "regular", 
+  "chunk_shape": [10, 10],
+  "separator":"/"
+}
+```
+
+The proposal is to allow metadata of the form
+```json
+{
+  "type": "rectangular", 
+  "chunk_shape": [[5, 5, 5, 15, 15, 20, 35], 10],
+  "separator":"/"
+}
+```
+Each element of `chunk_shape`, corresponding to each dimension of the array, may be a single integer, as before,
+or a list of integers which add up to the size of the array in that dimension. In this example, the single value
+of `10` for the chunks on the second dimension would be identical to `[10, 10, 10, 10, 10, 10, 10, 10, 10, 10]`.
+The number of values in the list is equal to the number of chunks along that dimension. Thus, a "rectangular"
+grid may be fully compatible as a "regular" grid.
+
+The data index bounds on a dimension of each hyperrectangle is formed by a cumulative sum of the chunks values, 
+starting at 0.
+```
+bounds_axis0 = [0, 5, 10, 15, 30, 45, 65, 100]
+bounds_axis1 = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
+```
+such that key "c0/0" contains values for indices along the first dimension (0, 5] and (0, 10] on the second dimension.
+An array index of (17, 17) would be found in key "c3/1", index (2, 2).
+
+## Related Work
+
+### Dask
+
+`dask.array` uses rectangular chunking internally, and is one of the major consumers of zarr data. Much of the
+code translating logical slices into slices on the individual chunks should be reusable.
+
+### Parquet/ Arrow
+
+Arrow describes tables as a collection of record batches. There is no restriction on the size of these batches. 
+This is not only very flexible, but can be used as an indexing strategy for low cardinality columns within parquet.
+
+```
+dataset_name/
+  year=2007/
+    month=01/
+       0.parq
+       1.parq
+       ...
+    month=02/
+       0.parq
+       1.parq
+       ...
+    month=03/
+    ...
+  year=2008/
+    month=01/
+    ...
+  ...
+```
+
+This feature was cited as one of the reasons parquet was chose over zarr for dask
+dataframes: https://github.com/dask/dask/issues/1599
+
+### awkward array
+
+https://github.com/zarr-developers/zarr-specs/issues/62
+
+
+## Implementation
+
+It is to be hoped that much code can be adapted from dask.array, which already allows variable chunk sizes
+on each dimension.
+
+## Alternatives
+
+### Just tune chunk sizes
+
+https://github.com/zarr-developers/zarr-specs/issues/62#issuecomment-1100806513
+
+
+## Discussion
+
+
+## References and Footnotes
+
+* Previous discussion:
+	* [Zarr Dask Table dask/dask#1599](https://github.com/dask/dask/issues/1599)
+	* [Protocol extensions for awkward arrays zarr-developers/zarr-specs#62](https://github.com/zarr-developers/zarr-specs/issues/62) 
+	* [Handling arrays with non-uniform chunking zarr-developers/zarr-specs#40](https://github.com/zarr-developers/zarr-specs/issues/40)
+	* [Chunk spec zarr-developers/zarr-spec#7](https://github.com/zarr-developers/zarr-specs/issues/7#issuecomment-468127219)
+
+
+
+## Copyright
+
+This document has been placed in the public domain.