Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBD: Extensions to develop with v3 spec. #89

Closed
Carreau opened this issue Aug 24, 2020 · 3 comments
Closed

TBD: Extensions to develop with v3 spec. #89

Carreau opened this issue Aug 24, 2020 · 3 comments
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue

Comments

@Carreau
Copy link
Contributor

Carreau commented Aug 24, 2020

Named dimensions would need to be described in "Array Metadata"

@jpivarski
Copy link

I've been discussing an extension in #62: storing and retrieving Awkward Arrays in Zarr v3 as an extension.

An Awkward Array has rich data types:

>>> import awkward1 as ak
>>> ak_array = ak.Array([[{"x": 1, "y": [1]}, {"x": 2, "y": [1, 2]}], [], [{"x": 3, "y": [1, 2, 3]}]])
>>> ak_array
<Array [[{x: 1, y: [1]}, ... y: [1, 2, 3]}]] type='3 * var * {"x": int64, "y": v...'>
>>> ak_array.type
3 * var * {"x": int64, "y": var * int64}
>>> ak_array.tolist()
[[{'x': 1, 'y': [1]}, {'x': 2, 'y': [1, 2]}], [], [{'x': 3, 'y': [1, 2, 3]}]]

and the idea is to deconstruct it into flat arrays (of different lengths) and store them as a group in Zarr with some sort of tag that prevents users from accidentally accessing the array pieces—it should complain that you need Awkward to reconstruct it, at least on the common path. (There could be a "developer's path" that lets experts access the array pieces.)

This is what the above array looks like when deconstructed:

>>> container = {}
>>> form, container, num_partitions = ak.to_arrayset(ak_array, container=container)

The container is any MutableMapping, which would likely be the Zarr backend, and the form is (convertible to/from) a JSON description of how to put the array back together from its pieces. (The num_partitions is None in this case; in general, these arrays can be chunked, which would map onto Zarr chunking.)

>>> container
{'node0-offsets': array([0, 2, 2, 3], dtype=int64),
 'node2': array([1, 2, 3]),
 'node3-offsets': array([0, 1, 3, 6], dtype=int64),
 'node4': array([1, 1, 2, 1, 2, 3])}
>>> form
{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "RecordArray",
        "contents": {
            "x": {
                "class": "NumpyArray",
                "itemsize": 8,
                "format": "l",
                "primitive": "int64",
                "form_key": "node2"
            },
            "y": {
                "class": "ListOffsetArray64",
                "offsets": "i64",
                "content": {
                    "class": "NumpyArray",
                    "itemsize": 8,
                    "format": "l",
                    "primitive": "int64",
                    "form_key": "node4"
                },
                "form_key": "node3"
            }
        },
        "form_key": "node1"
    },
    "form_key": "node0"
}

This is all the information that would be needed to reconstitute the Awkward Array, which the Zarr v3 extension handler would trigger (or complain that Awkward isn't installed, or something).

>>> reconstituted = ak.from_arrayset(form, container)
>>> reconstituted.type
3 * var * {"x": int64, "y": var * int64}
>>> reconstituted.tolist()
[[{'x': 1, 'y': [1]}, {'x': 2, 'y': [1, 2]}], [], [{'x': 3, 'y': [1, 2, 3]}]]
>>> reconstituted
<Array [[{x: 1, y: [1]}, ... y: [1, 2, 3]}]] type='3 * var * {"x": int64, "y": v...'>

That, in a nutshell, is the extension I'd like to develop for Zarr v3 and its library interface. For reference, here's the documentation for ak.to_arrayset and ak.from_arrayset.

@Carreau Carreau added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Oct 12, 2020
@DennisHeimbigner
Copy link

Why is this not solvable by an extension type?

@jstriebel
Copy link
Member

IMO this issue is covered by other issues already:

Closing this therefore, please feel free to object if I'm missing something.

@jstriebel jstriebel moved this to Done in ZEP1 Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue
Projects
Status: Done
Development

No branches or pull requests

4 participants