Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3 core spec: Consider to drop /meta prefix, have file at URI #177

Closed
jstriebel opened this issue Nov 24, 2022 · 35 comments · Fixed by #200
Closed

v3 core spec: Consider to drop /meta prefix, have file at URI #177

jstriebel opened this issue Nov 24, 2022 · 35 comments · Fixed by #200
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec

Comments

@jstriebel
Copy link
Member

jstriebel commented Nov 24, 2022

Citing @jbms from #149 (comment):

As discussed on the community meeting, this naming scheme has the drawback in that there is not a good way to have a path directly to a non-root array. Additionally, it was noted that to better integrate with existing filesystem completion in editors, etc., it would be helpful if the path were a real filesystem path.

A few proposals were made:

  • Instead of "meta/root" + P + ".group.json" as the key for the metadata file, instead have it just be: P + ".zr3". This would lead to a key of just ".zr3" for the root. It was noted though that with a consolidated metadata extension these metadata files would not actually exist. The data could be placed in e.g. "_data" or something, where names beginning with an underscore could be reserved to prevent conflicts. This would still require a way to locate the root directory --- that could be done by storing the relative path inside the array metadata.
  • Alternatively, we could leave the directory structure alone but require an extension as part of the directory containing the root, e.g. "foo.zr3", and disallow any array or group names from ending in ".zr3". Then you could use: "path/to/root.zr3/path/to/array" as a pseudo-path to an individual array. The downside is that it may be confusing to use something that looks like a path, and where a portion corresponds to a real filesystem path but a portion does not. File completion in editors also wouldn't support this.
  • We could use a special syntax to combine both a path to a root and a path to an array into a single string, e.g. "path/to/root//path/to/array" or "path/to/root#/path/to/array". We would need to carefully choose the syntax to avoid conflicts with e.g. fsspec, and file completion in editors also wouldn't support this.

Some more comments from discussion rounds I remember:

  • .json suffix is useful to have a correct mimetype by default for many stores, e.g. S3
  • Having a URI (TBD, see v3: Define standard "URL" syntax for referencing a specific array, group, attribute within a zarr repository #132) to an array or group, it would be great to actually find a directory or file there. E.g. s3://bucket-name/key-name/name-of-the-zarr-path.zarr/hierarchy/path/my-data.array.json could be a URI to point to the my-data array at the path hierarchy/path/my-data of the zarr hierarchy which is placed under s3://bucket-name/key-name/name-of-the-zarr-path.zarr/. (Just made up a URI here as an example, feel free to discuss this in v3: Define standard "URL" syntax for referencing a specific array, group, attribute within a zarr repository #132). Using such a URI schema and dropping the /meta prefix, one could find the relevant file (at least for filesystem or http stores or using appropriate clients for other stores).
  • The original motivation to have /meta and /data separate is to be able to list all meta keys without also listing the chunk files for efficiency reasons. If it's possible to exclude directories for key-listings for most relevant stores, only using a prefix for the chunk files would still give this efficiency, but it's unclear if that's the case.
  • It might be useful to be able to place chunk-files in arbitrary locations (possibly even other stores). This could be added as an extension, but can also be considered for the core spec.

Pinging discussion participants I remember so far: @joshmoore @jbms @rabernat @WardF

@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 24, 2022
@jstriebel jstriebel added this to ZEP1 Nov 24, 2022
@jstriebel jstriebel moved this to In Discussion in ZEP1 Nov 24, 2022
@jbms
Copy link
Contributor

jbms commented Nov 28, 2022

Due to the global storage transformers proposal #182 from @rabernat, my opinion on this has changed a bit:

  • With a global storage transformer in use, there is no guarantee that the array metadata exists as a file in the underlying store. Therefore, the URL syntax should not depend on that.
  • Having something that looks like a file path, "path_to_zarr.zarr/group/array", but where "path_to_zarr.zarr" is real directory, but "group/array" may not be contained as a file/directory within it, may be confusing and lead users unfamiliar with the special convention to believe files are missing. Instead, I think it would be better to use a more explicit syntax, like "path_to_zarr.zarr#/path/to/array" since that avoids any confusion about which part is a real path, and will work even if users do not use the suggested ".zarr" file extension.
  • The "meta" and "data" prefixes were intended to allow efficiently listing all of the metadata files when using stores like s3, gcs, and other "pure" key-value stores that do not have real directories. However, it provides no benefit for filesystem-backed stores that do have directories. Furthermore: both s3 and gcs support for emulating directories via the "delimiter" option to the list API, which allows efficient directory-like listing. Therefore, it is not clear that this naming scheme is really needed for the most common use case for which it seems to have been designed.
  • This naming scheme still doesn't work that well if you have a very large number of arrays --- for example if you have 1 million arrays within a hierarchy of groups (not implausible for some use cases), if the underlying store doesn't support directory-based listing, then to list all of the top-level groups would require listing all million arrays. There are alternative naming schemes that would add directory support on top of a pure key-value store, that could be implemented as a storage transformer. For example, each directory is assigned a uuid (similar to how directories are implemented in filesystems).
  • A natural extension of global storage transformers is per-group storage transformers --- this could be used to link an entire group from some external location. Per-group storage transformers are much more naturally defined if each group corresponds to a single prefix in the key space of the underlying store.

In summary, I think zarr v3 should just use a zarr v2-like naming scheme by default, and rely on a storage transformer extension to provide an alternative naming scheme if it is desired.

This would also allow the root metadata file, and the concept of a root, to be eliminated entirely --- instead just the group metadata file could be used.

@rabernat
Copy link
Contributor

There's a lot to unpack here, but let me just respond to one specific point first.

  • The "meta" and "data" prefixes were intended to allow efficiently listing all of the metadata files when using stores like s3, gcs, and other "pure" key-value stores that do not have real directories. However, it provides no benefit for filesystem-backed stores that do have directories.

Even on traditional POSIX filesystems, there is a strong benefit to separating metadata and chunks into separate directories. Here's an example: we frequently use Zarr on NASA's Pleiades supercomputer. This system has a large shared filesystem which performs well for some operations. However, listing directories with a large number of files in them can be extremely slow. Separating metadata from chunks allows us to quickly discover the structure of a Zarr hierarchy without having to list millions of chunks.

@jstriebel
Copy link
Member Author

Interesting idea, I guess that's also where the motivation for #184 is coming from, @jbms?

  • With a global storage transformer in use, there is no guarantee that the array metadata exists as a file in the underlying store. Therefore, the URL syntax should not depend on that.

True, especially with something like consolidated metadata.

  • Having something that looks like a file path, "path_to_zarr.zarr/group/array", but where "path_to_zarr.zarr" is real directory, but "group/array" may not be contained as a file/directory within it, may be confusing and lead users unfamiliar with the special convention to believe files are missing. Instead, I think it would be better to use a more explicit syntax, like "path_to_zarr.zarr#/path/to/array" since that avoids any confusion about which part is a real path, and will work even if users do not use the suggested ".zarr" file extension.

There's a similar discussion going on for OME-NGFF: 144

@jbms
Copy link
Contributor

jbms commented Nov 29, 2022

There's a lot to unpack here, but let me just respond to one specific point first.

  • The "meta" and "data" prefixes were intended to allow efficiently listing all of the metadata files when using stores like s3, gcs, and other "pure" key-value stores that do not have real directories. However, it provides no benefit for filesystem-backed stores that do have directories.

Even on traditional POSIX filesystems, there is a strong benefit to separating metadata and chunks into separate directories. Here's an example: we frequently use Zarr on NASA's Pleiades supercomputer. This system has a large shared filesystem which performs well for some operations. However, listing directories with a large number of files in them can be extremely slow. Separating metadata from chunks allows us to quickly discover the structure of a Zarr hierarchy without having to list millions of chunks.

While it might be tricky to accomplish with regular shell utilities like GNU find, I think it can be done from zarr implementations relatively easily: start at the root, check if the current directory is an array based on presence of array metadata, if not, list it and recurse on subdirectories.

@rabernat
Copy link
Contributor

rabernat commented Dec 1, 2022

  • With a global storage transformer in use, there is no guarantee that the array metadata exists as a file in the underlying store. Therefore, the URL syntax should not depend on that.

I agree with this.

  • Instead, I think it would be better to use a more explicit syntax, like "path_to_zarr.zarr#/path/to/array" since that avoids any confusion about which part is a real path, and will work even if users do not use the suggested ".zarr" file extension.

I like this idea in general. However, I have found it to be problematic that the entry point to a zarr store is a directory, not an actual file, because in some contexts (e.g. S3) it is impossible to tell whether a directory exists or not. Given that the first thing that all V3 stores must do is open the root metadata document, wouldn't it make more sense to use that as the actual file / URI identifying the zarr store, e.g.

protocol://path/to/store/zarr.json#/path/to/array

This opens the question of whether the root group actually needs to be named zarr.json or could have any name.

  • Therefore, it is not clear that this naming scheme is really needed for the most common use case for which it seems to have been designed.

This is an interesting suggestion which I think we should examine closely in today's call. It would be particularly helpful if anyone could recall the original use cases that motivated the new layout proposal. I am trying to think of one (e.g. a case where you are forced to list a directory with millions of chunks in it just to find a metadata document), but I am not coming up with anything. 🤔

  • This naming scheme still doesn't work that well if you have a very large number of arrays --- for example if you have 1 million arrays within a hierarchy of groups (not implausible for some use cases), if the underlying store doesn't support directory-based listing, then to list all of the top-level groups would require listing all million arrays.

What is this hypothetical store that doesn't support directory-based listing? As you mentioned above, it's not actually S3 / GCS, since they effectively support directory listing. So is it even an important case to consider? In your previous point, you used the fact that S3 does support directory listing as an argument for reverting to the unified storage layout. If your million arrays are sufficiently nested, traversing the directory tree should be feasible, no?

  • A natural extension of global storage transformers is per-group storage transformers --- this could be used to link an entire group from some external location. Per-group storage transformers are much more naturally defined if each group corresponds to a single prefix in the key space of the underlying store.

This makes sense to me. Basically a generalization of the global storage transformers idea (#182) to act at the group level.


Perhaps one path forward would be to go back to the unified storage layout as the default, and then redefine the separate data / meta layout as a global (or possible group-level) storage transformer (here called storage_layout).

{
  "storage_transformers": [
    {
      "extension": "https://purl.org/zarr/spec/storage_transformers/storage_layout/1.0",
      "configuration": {
        "metadata_path": "./meta/root/",
        "chunk_path": "./data/root/"
      }
    }
  ]
}

This could be used to provide all sorts of redirection to different storage locations, e.g. pointing at a completely different service. (I would actual prefer this over being able to specify a seperate chunkstore object at runtime.)

@jstriebel
Copy link
Member Author

Crosslinking the discussion from the ZEP meeting yesterday:
https://hackmd.io/ZilORe8AQvyqH6ArqDw0Cg?view

Two action items regarding this issue:

  • @rabernat to do some performance benchmarking
  • @jbms to propose a new storage layout for v3

The agreement seemed to be to have somthing similar to the v2 format, but with an extra folder for the chunk-data per array. For an array foo in group mygroup it could look similar to:

mygroup/foo/array.json
mygroup/foo/chunks/0/0
mygroup/foo/chunks/0/1
…

@jbms
Copy link
Contributor

jbms commented Dec 5, 2022

Before I create a PR, I wanted to discuss a few options for the new storage layout:

The basic assumptions I'm making for the default storage layout are:

  • No special root. The root is just a group that happens to not have a parent group.
  • Arrays can be standalone.
  • Arrays and groups are each fully contained within a single prefix/directory.

In my mind, the key question to consider is whether the directory name should encode whether a group member is an array, a group, or perhaps something else, like a bare key-value store prefix to be used by some extension, such as for storing meshes.


If don't encode the group member type in the name (as in zarr v2), then we have a layout like:

Option 1:

(Bare array)
zarr.json
0/0/0

(Group)
zarr.json
subgroup/zarr.json
subgroup/myarray/zarr.json
subgroup/myarray/0/0/0
myarray/zarr.json
myarray/0/0/0

Pros:

  • Simplest mapping of zarr logical namespace to the underlying store
  • Prevents a group and array having the same name, in a race-free way assuming the underlying store supports atomic operations on a single key.
  • A query to determine if a given member is present, "myarray" in group, efficiently maps to an existence check on the underlying store.

Cons:

  • When listing a group's members, determining whether a group member is a subgroup or an array requires one extra operation per member (reading the zarr.json key). For a large group (e.g. 1 million members), this operation becomes extremely expensive.

If we encode the group member type in the name, then we have:

Option 2:

(Bare array)
zarr.json
0/0/0

(Group)
subgroup.zgroup/zarr.json
subgroup.zgroup/myarray.zarray/zarr.json
subgroup.zgroup/myarray.zarray/0/0/0
myarray.zarray/zarr.json
myarray.zarray/0/0/0

Option 2A: (use a file extension only for arrays, not for subgroups, or vice versa)

(Bare array)
zarr.json
0/0/0

(Group)
subgroup/zarr.json
subgroup/myarray.zarray/zarr.json
subgroup/myarray.zarray/0/0/0
myarray.zarray/zarr.json
myarray.zarray/0/0/0

Pros:

  • When listing a group's members, simply listing the group's "directory" is sufficient to determine the type of each group member.

Cons:

  • The added extensions add extra "noise" in the keys used in the underlying store.
  • With Option 2A, we need to impose additional requirements on group member names (namely that they cannot end in ".zarray") to avoid ambiguity.
  • Preventing the creation of a subgroup and an array with the same name requires an extra check on every array creation within a group (which is not an insignificant cost), and is racy if the underlying store does not support multi-key transactions (most stores in current use do not support such transactions.
  • Checking if a given group member is present, if we don't know the type of group member, may require multiple queries on the underlying store.

With this storage layout, there are two possible solutions to the issue of avoiding having a group and array with the same name:
i. Expose the file extensions to the user as part of the group member name. For example, it would be an error for a user to attempt to create/open a "myarray" array within a group; instead, they would need to create a "myarray.zarray" array. It would be allowed to have a "myarray" subgroup and a "myarray.array" array within the same group. These special naming requirements may create challenges for code intended to work with multiple hierarchical storage formats, e.g. zarr v2, zarr v3, n5, hdf5.
ii. Leave the prevention of a group and array with the same name up to users; we can require in the spec that users avoid this, but not require implementations to check for this when creating a new group or array. If a group and array with the same name are found when listing a group, the implementation could return a warning or error, but should still allow opening both the array and the group using appropriate APIs.
iii. Explicitly allow a group and array to have the same name, and merely recommend against it. APIs for listing a group would need a way to handle the duplicate names, e.g. subgroups might be returned as "subgroup/" while arrays are returned as "array".


It is not clear to me how important the "list group members including their type" operation is; if that operation is important, then Option 2 is better.

But currently I lean towards Option 1 since it is simpler, and Option 2 could potentially be provided via an extension.


Note that the chunks of an array could optionally be nested within a chunks/ prefix, e.g. for Option 1

(Bare array)
zarr.json
chunks/0/0/0

(Group)
zarr.json
subgroup/zarr.json
subgroup/myarray/zarr.json
subgroup/myarray/chunks/0/0/0
myarray/zarr.json
myarray/chunks/0/0/0

This naming choice appears to be orthogonal to the choice between Option 1 and Option 2.

I don't think it provides any benefit to a zarr implementation, but may provide a better user experience for users interactively browsing a store, e.g. using path completion --- without a "chunks" prefix, it is easy for a user to accidentally list all the chunks of an array. The "chunks" prefix would serve as a warning, and users could avoid accidentally listing the chunks.

I don't have a strong opinion on whether to use the "chunks/" prefix.

@jstriebel
Copy link
Member Author

Thanks for the great write-up @jbms!

But currently I lean towards Option 1 since it is simpler, and Option 2 could potentially be provided via an extension.

+1 from me.

I don't have a strong opinion on whether to use the "chunks/" prefix.

I'd prefer to use a prefix (maybe rather _chunks, if we prohibit leading underscores in filenames, see #56 (comment)).
This could also be used to signal if an entry is a group or array by explicitly specifiying that groups must not contain a _chunks subdirectory. Then only the existance of a _chunks prefix in a directory entry is enough and arrays and groups can be distinguished via key listings.

PS: I just openened a separate issue about dropping the entrypoint metadata / explicit root: #192 I think it's fair to discuss those two things separately, but 👍 for waiting on a decision there before preparing a PR.

@jbms
Copy link
Contributor

jbms commented Dec 7, 2022

Thanks for the great write-up @jbms!

But currently I lean towards Option 1 since it is simpler, and Option 2 could potentially be provided via an extension.

+1 from me.

Another argument in favor of this is that you still have to read each individual metadata file to determine such information as data type, shape, etc., which in practice is likely to be more useful even than checking for group vs. array, and chicken's consolidated metadata solves group vs array also.

I don't have a strong opinion on whether to use the "chunks/" prefix.

I'd prefer to use a prefix (maybe rather _chunks, if we prohibit leading underscores in filenames, see #56 (comment)).
This could also be used to signal if an entry is a group or array by explicitly specifiying that groups must not contain a _chunks subdirectory. Then only the existance of a _chunks prefix in a directory entry is enough and arrays and groups can be distinguished via key listings.

You could have an empty array with no chunks, and on s3/gcs where there are no real directories, there would be no _chunks directory. Also due to the layout, to avoid also listing chunks on s3 and gcs, a separate list request would be required for each array, and list operations cost 10x as much as read operations.

PS: I just openened a separate issue about dropping the entrypoint metadata / explicit root: #192 I think it's fair to discuss those two things separately, but 👍 for waiting on a decision there before preparing a PR.

@rabernat
Copy link
Contributor

rabernat commented Dec 8, 2022

As we discussed at the last ZEP meeting, a central tradeoff when deciding whether to keep the split /data /meta layout for V3 is related to how stores are browsed.

layout pros cons
split (original V3) because metadata are in a separate hierarchy from the data, it is feasible for a client to simply list all objects in the /meta path (list_prefix) when opening a store and thereby discover the entire nested structure of the store copying a group or array requires copying from both trees; need a root group; more complicated
unified (this proposal) No special root. The root is just a group that happens to not have a parent group. Arrays can be standalone performing a list_prefix operation at the root of a large store could be extremely slow / expensive, since it would return both metadata nodes and chunk nodes; the only feasible strategy for exploring a large store is iteratively doing listdir on each child group

To understand this tradeoff, I did a benchmarking experiment. In this experiment, I create very large hierarchies of documents in S3 using different levels of nesting and compare the time it takes to list all documents vs. traversing the hierarchy. The code is all async and is probably as performant as we can get without a lot more work.

Here are some of the interesting results:

Listing time as a function of number of objects

image

  • A good number to have in mind: we can list 10000 objects in ~1 s.
  • The list_flat operation has less spread because it doesn't care whether the hierarchy is deeply nested or not.
  • The spread in the list_recursive reflects the sensitivity to the amount of nesting

Listing time sensitivity to nesting

In the next figures, we have the same number of objects nested in different ways, from flat (depth = 1) to as deep as possible (using a binary tree). We can compare how the different strategies perform.

image

image

  • The list_flat operation doesn't care whether the hierarchy is deeply nested or not.
  • The list_recursive operation is very sensitive to nesting
  • For deep hierarchies, list_recursive is roughly 10x slower or more
  • The worst case scenario is 200s to list the 65536 object binary tree

Conclusions

The concern about the cost of recursive listing is real. However, I am not convinced that it is really necessary for a client to discover the entire hierarchy immediately upon opening a store. There is no way to present such a large hierarchy to a user all at once anyways. Instead, it would be better if clients would use a lazy strategy, only performing listdir operations on demand as needed

Based on this, I think we can move forward with the proposal to drop the separate /meta prefix in the core spec; however, we should make sure to allow it to be implemented as a storage transformer for use cases where this performance cost is important.

@jstriebel
Copy link
Member Author

Thanks a ton for the experiments, @rabernat!

Based on this, I think we can move forward with the proposal to drop the separate /meta prefix in the core spec; however, we should make sure to allow it to be implemented as a storage transformer for use cases where this performance cost is important.

Very good point! I find the argument that the /meta /data split can still be done as a (group storage transformer?) extension most convincing, this would make the unified tree a superset. The default would be less performant for large hierarchies, but I find it fair to expect more manual tuning for those cases and keep the simple use-cases simple.

@rabernat
Copy link
Contributor

rabernat commented Dec 8, 2022

  • Arrays and groups are each fully contained within a single prefix/directory.

Can we talk about this a bit more? I think I understand the reason why this is desirable: you can copy and array or group by copying a directory. On the other hand, I find myself quite liking the V3 layout for other reasons

/foo.group.json
/foo/bar.array.json
/foo/bar/c0

Having the metadata document just above the directory in the hierarchy resolves many of the issues related to exploring and concurrently writing stores raised in #177 (comment).

The main pros:

  • The entry point to a group or array is an actual file, not a directory. Opening a group or array is just opening that file.
  • Very easy to determine the presence of groups and arrays at any level of the hierarchy (similar to option 2 above) - no need to open any files or try / except around the possible existence of metadata documents in a directory. Just list a directory and look at the suffixes.
  • Arrays can still be standalone

Cons:

  • Arrays and groups are not fully contained in a single directory, making them less portable

  • Potential contention of writers trying to create a group and array at the same time. This might be mitigated by using a chunk subdirectory, which in theory would permit the simultaneous existence of an array and group of the same name, e.g.

    (foo group with a child array)
    /foo.group.json
    /foo/bar.array.json
    /foo/bar/_chunks/0/0
    
    (foo array at the same location -- no conflict)
    /foo.array.json
    /foo/_chunks/0/0 
    

    This is similar to Option 2.iii. It could be very confusing, but it's technically possible

I feel like we should have a bit more discussion about these tradeoffs before abandoning this layout.


I have serious concerns about Option 1:

  • When listing a group's members, determining whether a group member is a subgroup or an array requires one extra operation per member (reading the zarr.json key).

The requirement that the metadata document must be read to discover the contents of a store, together with the V3 change that array / group metadata and user attributes are all stored in the same file I believe can lead to unacceptable performance degradations in very common use cases. It's typical for NetCDF type datasets to be stored as a flat group with 100s of arrays. Each of these arrays can have a lot of user metadata. This proposal would require that all of that be read in just to open the store.

So if we are going with one of the proposal above, I would favor option 2.

But I actually think I favor option 0 (sticking with the existing V3 layout), minus the /meta / /data separation.

@jbms
Copy link
Contributor

jbms commented Dec 8, 2022

  • Arrays and groups are each fully contained within a single prefix/directory.

Can we talk about this a bit more? I think I understand the reason why this is desirable: you can copy and array or group by copying a directory. On the other hand, I find myself quite liking the V3 layout for other reasons

/foo.group.json
/foo/bar.array.json
/foo/bar/c0

Having the metadata document just above the directory in the hierarchy resolves many of the issues related to exploring and concurrently writing stores raised in #177 (comment).

In regards to exploring and concurrently writing a zarr hierarchy, it seems to me that it has similar trade-offs as my option 2 --- we can distinguish groups and arrays when listing a group, but we then have to worry about a group and array having the same name. Are there things I'm missing?

The main pros:

  • The entry point to a group or array is an actual file, not a directory. Opening a group or array is just opening that file.

I guess the idea here is that when listing the location of a dataset, e.g. on a website, you would provide a URL to the metadata file, rather than a URL to the directory? That way if they attempt to just navigate to the URL in a browser, it will return the JSON metadata rather than either a directory listing or an error?

This can already be done with zarr v2 and with both my proposed option 1 and 2, but I guess the difference here is that under your proposed scheme zarr v3 implementations might expect to be passed the path to the metadata file rather than the path to the directory?

I think there is a risk that this could be more confusing to users not familiar with zarr v3: specifying a directory very clearly indicates that the group/array is represented as a collection of files. If you specify a URL to just a single json file, someone may visit it in their browser, download it to their local machine, and then later open it and find an empty group or empty array since they did not download the rest of the files.

  • Very easy to determine the presence of groups and arrays at any level of the hierarchy (similar to option 2 above) - no need to open any files or try / except around the possible existence of metadata documents in a directory. Just list a directory and look at the suffixes.
  • Arrays can still be standalone

One slightly awkward aspect of this approach is that even a standalone array or the root group, needs to have a "name". For example, with zarr v2 or my proposed option 1 and option 2, we can have a cloud storage bucket or a zip file that contains just an array or a group at its root. With this approach we instead would need to give it a name, like "array" or "root".

Cons:

  • Arrays and groups are not fully contained in a single directory, making them less portable

We also lose the ability to atomically rename an array or group on stores, such as the filesystem, that support atomic renames of individual files/directories. With two things to rename, if the program crashes in the middle, we end up with a corrupt store.

  • Potential contention of writers trying to create a group and array at the same time. This might be mitigated by using a chunk subdirectory, which in theory would permit the simultaneous existence of an array and group of the same name, e.g.
    (foo group with a child array)
    /foo.group.json
    /foo/bar.array.json
    /foo/bar/_chunks/0/0
    
    (foo array at the same location -- no conflict)
    /foo.array.json
    /foo/_chunks/0/0 
    


    
      
    

      
    

    
  
This is similar to Option 2.iii. It could be very confusing, but it's technically possible

I feel like we should have a bit more discussion about these tradeoffs before abandoning this layout.

I have serious concerns about Option 1:

> * When listing a group's members, determining whether a group member is a subgroup or an array requires one extra operation per member (reading the zarr.json key).

The requirement that the metadata document _must be read_ to discover the contents of a store, together with the V3 change that _array / group metadata and user attributes are all stored in the same file_ I believe can lead to unacceptable performance degradations in very common use cases. It's typical for NetCDF type datasets to be stored as a flat group with 100s of arrays. Each of these arrays can have a lot of user metadata. This proposal would require that all of that be read in just to open the store.

Merely opening a group should not necessarily involve listing its contents. Furthermore, it seems that in many cases where you do want to list the contents of a group, you would also care about other information, such as the data types and shapes of each array, not merely the names of the arrays. If the default representation makes that too expensive, consolidated metadata would allow you to both efficiently determine the names of the arrays, and also determine their shapes, etc.

However, you have a good point that in some cases the metadata file may become large and then it is problematic to read the whole thing just for one piece of information. That issue also applies more generally, though --- we may only wish to read the array, and not care about its user-defined attributes. Or we may only wish to determine its shape and data type and not read it at all. Splitting the user-defined attributes from the zarr-defined metadata, as in v2, would be one solution, but there may be others.

So if we are going with one of the proposal above, I would favor option 2.

But I actually think I favor option 0 (sticking with the existing V3 layout), minus the /meta / /data separation.

@jstriebel
Copy link
Member Author

minus the /meta / /data separation.

Just to summarize, I think this is set, right? We are in favor of droppping the /meta and /data prefixes and have all files in the same shared tree.

I have no strong opinion about the filenames. I just think we should include a clear argument for any deviations from v2 in the ZEP.

@rabernat
Copy link
Contributor

rabernat commented Dec 9, 2022

I am 👍 on dropping the separation, as along as we create a group-level storage transformer / extension that allows you to bring it back.

However, given the fundamental nature of this change, I would love to hear a few more options from e.g. @zarr-developers/python-core-devs. Does anyone strongly want to to keep the separation as the default in V3?

@martindurant
Copy link
Member

If it's possible to exclude directories for key-listings for most relevant stores, only using a prefix for the chunk files would still give this efficiency, but it's unclear if that's the case.

I wasn't sure what this means exactly, so I don't know if it is or isn't possible. Certainly, traversing the sets of directories and/or possibly listing big directories is not something we want to do. In fact, it we can do without any directory listing actions ever, that's the best.

@martindurant
Copy link
Member

From an implementation point of view, and kerchunk's interaction with this, a unseparated layout more similar to v2 is more convenient. That, by itself, is not a great motivator.

@rabernat
Copy link
Contributor

rabernat commented Dec 15, 2022

At today's ZEP meeting, @jstriebel, @joshmoore, @jbms and myself all seemed to be in agreement that option 1 is preferable.

I also favor placing chunks in a separate subdir within an array directory.

We also agreed it would be useful to sketch out the algorithm one would use to recursively browse / explore a store.

@martindurant
Copy link
Member

Option 1 means a zarr.json file in every group and array alongside the chunks, and that subgroups or array members of a group are subdirectories? I can support this model, it is very similar to v2. I assume the plan is to have the option for an extension to implement things like consolidated metadata.

Question: does a group metadata list its submembers, or is this still done by directory listing?

@jbms
Copy link
Contributor

jbms commented Dec 16, 2022

Option 1 means a zarr.json file in every group and array alongside the chunks, and that subgroups or array members of a group are subdirectories? I can support this model, it is very similar to v2. I assume the plan is to have the option for an extension to implement things like consolidated metadata.

Question: does a group metadata list its submembers, or is this still done by directory listing?

Still done by directory listing, otherwise adding members to a group concurrently from multiple machines may be problematic. An extension/storage transformer/storage adapter like kerchunk
used beneath zarr could potentially be used to avoid list operations on the underlying store.

@jstriebel
Copy link
Member Author

jstriebel commented Dec 22, 2022

To summarize, we agreed to drop the /meta and /data prefix, and use a naming scheme similar to v2, where one array is completely contained in it's parent folder, with addition of a folder for subchunks. This could look similar to

/foo/zarr.json
/foo/bar/zarr.json
/foo/bar/chunks/0/0/0

for a group foo including an array bar.

One important point @joshmoore emphasized is to add a prefix for the chunks folder and possibly also to the zarr.json files. Any opinions about this prefix? There were ideas such as _, _z_, __.
This prefix should then also be forbidden for any node-names in the hierarchy, and could be used for any additional files that extensions might add. Should we also use it for zarr.json? It's most important for folder-names, since those can not be easily distinguished from nested groups or arrays.

@martindurant
Copy link
Member

A group or array called exactly "zarr.json" seems unlikely, but one called "chunks" is definitely possible.
Why not stick with ".z" from V2, then?

@jstriebel
Copy link
Member Author

Why not stick with ".z" from V2, then?

. as a prefix often means "hidden", and many backup systems ignore these files by default, which was brought up by @jbms. Therefore . as a prefix-start seems to be a dangerous and might lead to users missing those files on a filesystem.

@martindurant
Copy link
Member

OK, fair point. "_" is also used by OSs, I think, no? And we shouldn't use capitals, URL-sensitive chars or non-ascii, just in case that's a problem :)

@jstriebel
Copy link
Member Author

"_" is also used by OSs, I think, no?

Not that I'm aware of, but that doesn't mean much ;)

And we shouldn't use capitals, URL-sensitive chars or non-ascii

👍

@jstriebel
Copy link
Member Author

@martindurant Do you have an example where _ is used by an OS? If not, the current proposal would be _z_. Let's finalize this in the ZEP meeting tomorrow.

@martindurant
Copy link
Member

I thought the DS_Store, but I see it is a dot. Maybe windows does this for its directory config files? Since I can't find it, assume I am wrong and _ is fine!

@jbms
Copy link
Contributor

jbms commented Jan 18, 2023

Following the discussion at the ZEP meeting on 2023-01-12, @jstriebel and I discussed this further over email and have the following proposal:

  • The primary array and group metadata will be stored as _zarr.json under the directory/prefix of the array/group.
  • Any other names beginning with _zarr are reserved for other zarr-standard metadata.
  • There are no restrictions on group member names, other than disallowing and empty string and "/" characters. If the group member name starts with a _, we escape it with an extra _, so e.g. a group member name of _foo gets stored as __foo.
  • Any other names beginning with _ (not beginning with __ or _zarr) are reserved for use by extensions.
  • Chunks are named like c/X/Y/Z or c.X.Y.Z depending on the dimension_separator. Therefore, in the default case of dimension_separator = "/", chunks are in a separate directory. In the case of a 0-dimensional array, the single chunk key is always just named "c", and dimension_separator has no effect.

Rationale:

On some filesystems, especially distributed filesystems, directories are somewhat expensive; in fact we have run into issues with this at Google when creating a large number of arrays at once. On those systems, it would be unfortunate to have to create two or three directories for every array instead of just one (one extra for chunks, and possibly one extra for extension metadata, if using e.g. "zarr.extensions/" as a prefix).

By slightly tweaking the chunk key encoding compared to the current v3 proposal, the chunks end up in a separate directory by default, but users can choose dimension_separator = "." to avoid that. When using dimension_separator="/", there will already be many directories for the chunks and one extra directory would presumably be insignificant. In cases where creating directories is expensive, and the number of chunks is small or large numbers of files in a single directory is not an issue, users can choose dimension_separator = ".". There is no need to use a special prefix like "_" for the chunks, since array directories cannot contain user-defined members.

Key properties:

  • This proposal provides a clear separate namespace for zarr-standard and extension metadata.
  • Directory names don't have dots in them, unless a user creates a group member with a dot.

@rabernat
Copy link
Contributor

rabernat commented Jan 18, 2023

I really like this proposal except for one thing:

  • Chunks are named like c/X/Y/Z or c.X.Y.Z depending on the dimension_separator. Therefore, in the default case of dimension_separator = "/", chunks are in a separate directory. In the case of a 0-dimensional array, the single chunk key is always just named "c", and dimension_separator has no effect.

Can we consider decoupling the questions of should ALL chunks be in a subdirectory? from should we store chunks in a nested series of directories? I can imagine that I might like to do c/X.Y.Z in many cases.

The rationale is as follows. We can imagine that the array directory might have other documents in it related to extensions that are discovered by listing, e.g.

foo/_zarr.json
foo/_zarr_meta.json

I want to absolutely avoid having to list a directory with millions of chunks in it. But I still might not want a nested hierarchy of chunks.

Does that make sense?

@jbms
Copy link
Contributor

jbms commented Jan 18, 2023

As far as extensions, I imagined that they would always be specified in the _zarr.json file, not discovered by listing. It would be good not to require listing, at least for most functionality, because some stores, like a plain http server, don't support listing, and even when supported, listing is often significantly more expensive than reading a single file (e.g. 10x the operation cost on S3 and GCS).

@rabernat
Copy link
Contributor

As far as extensions, I imagined that they would always be specified in the _zarr.json file, not discovered by listing.

Yes this makes sense. If we require this I can drop my objection.

It would be good not to require listing, at least for most functionality, because some stores, like a plain http server, don't support listing

Solving this in a general way (beyond the extension example) requires us to resolve the issues that motivated consolidated metadata (#136). For our applications, we must be able to discover the children of a group over HTTP. All solutions I can think of here require explicitly enumerating the children in _zarr.json, which would lead to a race condition when multiple writers create arrays at the same time. Do you have any thoughts on how to solve this?

@jbms
Copy link
Contributor

jbms commented Jan 18, 2023

As far as extensions, I imagined that they would always be specified in the _zarr.json file, not discovered by listing.

Yes this makes sense. If we require this I can drop my objection.

It would be good not to require listing, at least for most functionality, because some stores, like a plain http server, don't support listing

Solving this in a general way (beyond the extension example) requires us to resolve the issues that motivated consolidated metadata (#136). For our applications, we must be able to discover the children of a group over HTTP. All solutions I can think of here require explicitly enumerating the children in _zarr.json, which would lead to a race condition when multiple writers create arrays at the same time. Do you have any thoughts on how to solve this?

Not sure there is a single best/perfect solution. Options I can think of, other than the solution you mentioned, are:

  • Some HTTP servers can be configured to produce directory listings in HTML format --- often this is done by default. In principle the zarr implementation can parse the HTML directory listing. For example, Neuroglancer does that in order to support datasource URL completion, but does not otherwise require list functionality. It is a bit unfortunate to have to have to rely something like this, though.
  • Instead of using a plain HTTP server, run an object storage service like minio that supports the S3 API.
  • Use some sort of database (accessible over HTTP) rather than plain files, where the database provides listing without relying on HTTP listing. Like the solution you mentioned of enumerating the members in the _zarr.json file, this would likely require some amount of coordination when writing, though.

@rabernat
Copy link
Contributor

Thanks Jeremy!

One thing to mention is that this HTTP use case is mostly a sort of archival, read-only scenario. So I think it should be acceptable for our purposes to enumerate the children explicitly. I'm going to propose an extension to enable this.

@martindurant
Copy link
Member

Some HTTP servers can be configured to produce directory listings in HTML format

fsspec supports directory listing on HTTP servers that produce a list of links for child folders/files like this

@jstriebel
Copy link
Member Author

If the group member name starts with a _, we escape it with an extra _, so e.g. a group member name of _foo gets stored as __foo.

@jbms and I discussed that this might not be as easy as we thought first, since we don't have a clear entrypoint for a group anymore. This means that the following fs path can be opened differently:
/some/path/__maybegroup/__myarray/_zarr.json
We could open the array in multiple ways:

zarr.open("some/path")["_maybegroup/_myarray"]
zarr.open("some/path/__maybegroup")["_myarray"]
zarr.open("some/path/__maybegroup/__myarray")

Depending on the (user-defined) entry-point of the hierarchy the escaping needs to be done or not, which seems confusing. Also, an array in a path /some/path/_myarray/_zarr.json can never we opened as part of a group, since the array's fs name is not a valid hierarchy node name.

Since this would be a problem with all escaping schemes we tend towards disallowing a prefix again. Just _ might be too common for normal names, therefore we propose to use __ again, but with the same rules as above:

  • The primary array and group metadata will be stored as __zarr.json under the directory/prefix of the array/group.
  • Any other names beginning with __zarr are reserved for other zarr-standard metadata.
  • Node names may not start with __
  • Any other names beginning with __ (not beginning with __zarr) are reserved for use by extensions.
  • Chunks are named like c/X/Y/Z or c.X.Y.Z depending on the dimension_separator. Therefore, in the default case of dimension_separator = "/", chunks are in a separate directory. In the case of a 0-dimensional array, the single chunk key is always just named "c", and dimension_separator has no effect.

@jstriebel jstriebel moved this from In Discussion to Needs PR in ZEP1 Feb 2, 2023
@jstriebel jstriebel moved this from Needs PR to In Review in ZEP1 Feb 9, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in ZEP1 Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants