Skip to content
perrygreenfield edited this page Jul 16, 2020 · 4 revisions

zarr considered as a data format

Some projects are using zarr storage as a data format since it explicitly is centered around chunking support (though I have not found any specific ones on searching!?)

Zarr actually supports a multitude of storage representations, the main two types being:

  • Each chunk goes into a separate file, but the organization of files with directories has multiple options.
  • Chunks are stored in a database, with a number of database options provided.

We will only focus on the first option for this comparison.

Zarr storage does have support in multiple languages.

We will start on highlighting the support for how data can be organized as this is probably the biggest difference between zarr and ASDF.

Zarr allows attributes to be provided as a JSON object. Thus presumably nesting in the JSON is permitted and thus one may have an arbitrarily deep hierarchy of attributes, though there is no mechanism for automatically turning these into special objects in the supporting language. Essentially you are provided a tree of dictionaries, lists, and basic types. These attributes are associated with an array.

Zarr supports groups, and groups may be nested. Each group may consist of arrays and/or groups. Each group may have attributes associated with it just as arrays do.

Zarr is heavily reliant on the file system for organization. The directory structure and the presence of .zgroup and .zarr files indicate whether the directory represents a group or an array (or both?)

In discussing ASDF we presume the implementation of ASDF chunking for ASDF v2.0 (which, in fact, for the Python implementation will use zarr underneath, although with its own implementation of a new storage class). ASDF will provide two alternate storage cases. One where all chunks are in ASDF binary blocks, and one where chunks are separate files but with a different naming scheme than zarr uses. Each approach has its advantages. These are summarized as follows:

Chunks in ASDF binary blocks

Advantages:

  • Self contained; no dependencies on file systems, not fragile to file system operations
  • Better suited to archive use because it is self contained.
  • Good format for read-only cloud access using range feature.

Disadvantages:

  • Not as good for updating chunks within a file if combined with compression.
    • doing so results in some wasted space in file either through padding blocks to handle varying sizes, or reallocating blocks.
  • Poor in cloud use if updating chunks is required

Chunks as separate files

Advantages:

  • Better suited to being updated since chunks are easily resized.
  • Better suited to cloud use if updating is required.
  • Data more accessible to outside applications since available as separate files

Disadvantages:

  • Worse for archive use; use of tar or zip makes accessing metadata more cumbersome
  • Moving data can be more complex.

We feel both data representations have important enough advantages that both ultimately should be supported.

Contrasting ASDF with zarr storage

Advantages of ASDF

  • Not purely a chunked format, many uses do not need chunking
  • Not focused solely on array or table data.
  • Supports schemas for verifying metadata meets expectations
  • Supports tags to indicate special conversion to software objects should be performed
  • Much richer set of metadata to support science needs. E.g. support for:
    • standard physical units
    • functional models
    • inline arrays and tables in metadata, permitting all text data files for small data sets that are directly editable.
    • references, allowing many entities to share common data
  • Permits extensibility through URIs that projects can publish.
  • Very general purpose

Disadvantages

  • Mostly used in astronomy though uses are appearing elsewhere.
  • Does not support chunking yet (but soon...)
  • Only supported fully in Python currently

As far as we can tell, there is virtually nothing that zarr will have that ASDF won't provide once ASDF supports chunking (using zarr), other than other language support (currently), and many capabilities that ASDF currently provides that zarr doesn't.

Admittedly, the chunking/compression option extensions are just vaporware until released.

Clone this wiki locally