consolidated metadata storage for Zarr v3 #136

grlee77 · 2022-03-11T02:58:59Z

In the current draft of the v3 spec in zarr-developers/zarr-python#898, consolidated metadata has been implemented similarly to v2. Currently this metadata gets stored in the group meta/root/consolidated, but this is not part of the spec. Likely we need to specify an extension for how consolidated metadata should be handled.

The text was updated successfully, but these errors were encountered:

ianliu · 2022-09-06T12:54:03Z

Is the V3 considering metadata compression? I'm working with a dataset where the consolidated metadata (v2) has 1.2GB in size, and opening it from S3 takes a very long time to open.
After gzipping the metadata, it reduces to 20MB, so it would be nice to have compressed metadata.

alimanfoo · 2022-09-07T08:45:58Z

Hi @ianliu, thanks this is very useful to know. Adding a capability for compression into a v3 extension for consolidated metadata sounds like a good idea to me.

rabernat · 2022-09-07T11:25:23Z

@ianliu I'm curious what makes the metadata so big in your case? Do you have lots of arrays? Or very large custom attrs?

V3 allows for defining of custom metadata encoding, so a compressed format could be used.

Also, a major question for V3 is whether user metadata is a separate document or not (see #149 (comment)). If you have 1GB of custom attrs in a single array, that would be a strong motivation not to put user metadata together with the core array metadata.

ianliu · 2022-09-07T13:28:07Z

@rabernat in my particular case, I have many small arrays. It could be argued that there is a better data representation that would consolidate multiple arrays into a single array, thus reducing metadata, but the way our program is setup makes this a little bit trickier (mainly because our program produces results in parallel).

joshmoore · 2022-09-12T14:13:22Z

💯 for potential consolidated metadata features but

but this is not part of the spec. Likely we need to specify an extension for how consolidated metadata should be handled.

would this suggest stripping the current consolidated metadata support in V3 until there's an extension?

TomAugspurger · 2024-08-21T18:42:39Z

I've drafted a version of this at https://github.com/TomAugspurger/zeps/blob/consolidated-metadata/draft/ZEP0006.md. A few questions before turning that into a PR:

Is this appropriate for a ZEP? Or would it go in the zarr-specs repo?

One major design decision is whether this consolidated metadata should be embedded in the zarr.json, or whether it should go in a file next to the root zarr.json (e.g. consolidated_metadata.json). Doing it in the zarr.json would make collecting all the metadata possible in a single HTTP request. With separate zarr.json and consolidated_metadata.json you would need two. The downside of putting it in zarr.json is potentially making any read of that file slow, even if you don't need the consolidated metadata.

I'm also a bit curious about how much async zarr-python reduces the pain of hierarchies with a large number of nodes. With dozens of HTTP requests in flight at the same time, maybe it's not so bad? If #284 (listing the child nodes in the root zarr.json) were implemented, then you'd be able to make all but the first HTTP request concurrently.

jhamman · 2024-08-21T22:09:38Z

I like the draft ZEP @TomAugspurger. Thanks for putting it together.

As far as I understand, the current spec does not provide a way to add arbitrary extensions to a metadata document (this could be changed though). Provided we can address this challenge, we likely will want to support both an inline and external consolidated metadata object.

// inline
"consolidated_metadata": {
    "kind": "inline":
    "metadata": {
        "air": {
          "shape": [2920, 25, 53],
          "fill_value": 0,
          ...
          "node_type": "array",
      },
      "lat": {
          "shape": [25],
          "fill_value": NaN,
          ...
          "node_type": "array",
      }
  }
}
// external
"consolidated_metadata": {
    "kind": "external":
    "metadata": "consolidated_metadata.json"
}

An alternative direction we could take is to define a Consolidated Metadata Store which could enable this functionality without needing to touch the root metadata document.

TomAugspurger · 2024-08-22T01:53:15Z

the current spec does not provide a way to add arbitrary extensions to a metadata document

In the sense that that's just not spelled out anywhere as a thing that can be done? Or that adding additional keys might break things?

we likely will want to support both an inline and external consolidated metadata object

I like the idea of future-proofing things by including a "kind" enum to indicate where to find the consolidated metadata, maybe we pick only one for the spec? This is minor, but I worry about putting too much complexity on the implementors.

jhamman · 2024-08-22T06:33:21Z

In the sense that that's just not spelled out anywhere as a thing that can be done? Or that adding additional keys might break things?

I don't think we have defined the expected behavior for what will happen if an implementation encounters and unexpected key. So yes, I think thinks could break.

edit: there is this section in the spec:

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

Now, I don't know how/where you would put "must_understand" in this case. See also #270

maybe we pick only one for the spec?

+1 one, pick one for now. We can always expand this later.

TomAugspurger · 2024-08-22T16:42:48Z

I'm not sure I see the motivation for disallowing extra keys in the metadata objects. That seems to make it hard to evolve the spec in forwards-compatible ways.

Anyway, maybe in this case it's fine to use because we're making a "future version of the spec"? And I guess we'll set must_understand: false to indicate that older versions shouldn't error, they should just ignore the extra key and fall back to the non-consolidated metadata.

joshmoore · 2024-08-23T13:55:01Z

Is this appropriate for a ZEP? Or would it go in the zarr-specs repo?

In my mind, this is a great ZEP.

from PR (TODO: is there a better way to indicate that a dataset uses an extension?)

That should definitely be some form of metadata in zarr.json. I'd very much be for using this as a way to kick the tires of the extension process. i.e., if you/we can't do it, then that needs to be fixed at the spec level.

jhamman commented 2 days ago
As far as I understand, the current spec does not provide a way to add arbitrary extensions to a metadata document (this could be changed though)

I think it's implicitly possible but not (well-)documented. So 👍 for doing what we need to do to make it possible.

TomAugspurger commented 20 hours ago
An alternative direction we could take is to define a Consolidated Metadata Store which could enable this functionality without needing to touch the root metadata document.

I'd urge caution here. This led to many issues in v2. Minimally it should be a "wrapping store" so that they can be composed, because otherwise they lead to an explosion of classes that are needed. (e.g. "NestedRemoteDirectoryStore") Additionally, I'd suggestthe metadata MUST be in the zarr.json regardless of the implementation.

Now, I don't know how/where you would put "must_understand" in this case. See also #270

👍 for having must_understand set. FWIW, I've also been considering writing a ZEP to say that "all metadata will be in a RO-Crate file" for OME-Zarr. This might could go hand in hand. My thinking was there should be a flag in zarr.json to say "all metadata loading should follow this other protocol", i.e. a loader or something that could be pip installed.

TomAugspurger · 2024-08-23T16:22:55Z

Whoops, I just pushed this as a PR at #309, rather than a ZEP, sorry.

That should definitely be some form of metadata in zarr.json. I'd very much be for using this as a way to kick the tires of the extension process. i.e., if you/we can't do it, then that needs to be fixed at the spec level.

I'll open a separate issue to discuss this. In my experience, STAC's method of each object containing a stac_extensions list with a list of extensions has worked well.

grlee77 mentioned this issue Mar 11, 2022

Implement Zarr V3 protocol zarr-developers/zarr-python#898

Merged

6 tasks

grlee77 mentioned this issue Apr 20, 2022

misc Zarr V3 bug fixes: open_group, open_consolidated and MemoryStoreV3 zarr-developers/zarr-python#1006

Merged

3 tasks

rabernat mentioned this issue Aug 10, 2022

Beyond consolidated metadata for V3: inspiration from Apache Iceberg #154

Closed

joshmoore mentioned this issue Oct 9, 2022

V3 Consolidated Metadata zarr-developers/zarr-python#1161

Closed

jhamman mentioned this issue Nov 9, 2022

Using kerchunk to reference large sets of netcdf4 files fsspec/kerchunk#240

Closed

jstriebel added this to ZEP1 Nov 10, 2022

jstriebel added the protocol-extension Protocol extension related issue label Nov 15, 2022

jstriebel removed this from ZEP1 Nov 15, 2022

This was referenced Nov 16, 2022

v2: Standardizing .zmetadata #113

Open

Proposal: group to list it's children #15

Open

rabernat mentioned this issue Jan 18, 2023

v3 core spec: Consider to drop /meta prefix, have file at URI #177

Closed

TomAugspurger linked a pull request Aug 23, 2024 that will close this issue

Added consolidated metadata to spec #309

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consolidated metadata storage for Zarr v3 #136

consolidated metadata storage for Zarr v3 #136

grlee77 commented Mar 11, 2022

ianliu commented Sep 6, 2022

alimanfoo commented Sep 7, 2022

rabernat commented Sep 7, 2022

ianliu commented Sep 7, 2022

joshmoore commented Sep 12, 2022

TomAugspurger commented Aug 21, 2024

jhamman commented Aug 21, 2024

TomAugspurger commented Aug 22, 2024

jhamman commented Aug 22, 2024 •

edited

Loading

TomAugspurger commented Aug 22, 2024

joshmoore commented Aug 23, 2024

TomAugspurger commented Aug 23, 2024

consolidated metadata storage for Zarr v3 #136

consolidated metadata storage for Zarr v3 #136

Comments

grlee77 commented Mar 11, 2022

ianliu commented Sep 6, 2022

alimanfoo commented Sep 7, 2022

rabernat commented Sep 7, 2022

ianliu commented Sep 7, 2022

joshmoore commented Sep 12, 2022

TomAugspurger commented Aug 21, 2024

jhamman commented Aug 21, 2024

TomAugspurger commented Aug 22, 2024

jhamman commented Aug 22, 2024 • edited Loading

TomAugspurger commented Aug 22, 2024

joshmoore commented Aug 23, 2024

TomAugspurger commented Aug 23, 2024

jhamman commented Aug 22, 2024 •

edited

Loading