-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBD: Extensions to develop with v3 spec. #89
Comments
I've been discussing an extension in #62: storing and retrieving Awkward Arrays in Zarr v3 as an extension. An Awkward Array has rich data types: >>> import awkward1 as ak
>>> ak_array = ak.Array([[{"x": 1, "y": [1]}, {"x": 2, "y": [1, 2]}], [], [{"x": 3, "y": [1, 2, 3]}]])
>>> ak_array
<Array [[{x: 1, y: [1]}, ... y: [1, 2, 3]}]] type='3 * var * {"x": int64, "y": v...'>
>>> ak_array.type
3 * var * {"x": int64, "y": var * int64}
>>> ak_array.tolist()
[[{'x': 1, 'y': [1]}, {'x': 2, 'y': [1, 2]}], [], [{'x': 3, 'y': [1, 2, 3]}]] and the idea is to deconstruct it into flat arrays (of different lengths) and store them as a group in Zarr with some sort of tag that prevents users from accidentally accessing the array pieces—it should complain that you need Awkward to reconstruct it, at least on the common path. (There could be a "developer's path" that lets experts access the array pieces.) This is what the above array looks like when deconstructed: >>> container = {}
>>> form, container, num_partitions = ak.to_arrayset(ak_array, container=container) The >>> container
{'node0-offsets': array([0, 2, 2, 3], dtype=int64),
'node2': array([1, 2, 3]),
'node3-offsets': array([0, 1, 3, 6], dtype=int64),
'node4': array([1, 1, 2, 1, 2, 3])}
>>> form
{
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "RecordArray",
"contents": {
"x": {
"class": "NumpyArray",
"itemsize": 8,
"format": "l",
"primitive": "int64",
"form_key": "node2"
},
"y": {
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "l",
"primitive": "int64",
"form_key": "node4"
},
"form_key": "node3"
}
},
"form_key": "node1"
},
"form_key": "node0"
} This is all the information that would be needed to reconstitute the Awkward Array, which the Zarr v3 extension handler would trigger (or complain that Awkward isn't installed, or something). >>> reconstituted = ak.from_arrayset(form, container)
>>> reconstituted.type
3 * var * {"x": int64, "y": var * int64}
>>> reconstituted.tolist()
[[{'x': 1, 'y': [1]}, {'x': 2, 'y': [1, 2]}], [], [{'x': 3, 'y': [1, 2, 3]}]]
>>> reconstituted
<Array [[{x: 1, y: [1]}, ... y: [1, 2, 3]}]] type='3 * var * {"x": int64, "y": v...'> That, in a nutshell, is the extension I'd like to develop for Zarr v3 and its library interface. For reference, here's the documentation for ak.to_arrayset and ak.from_arrayset. |
Why is this not solvable by an extension type? |
IMO this issue is covered by other issues already:
Closing this therefore, please feel free to object if I'm missing something. |
Named dimensions would need to be described in "Array Metadata"
The text was updated successfully, but these errors were encountered: