-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beyond consolidated metadata for V3: inspiration from Apache Iceberg #154
Comments
If I understand correctly, really what you are looking for is a key value-store adapter that handles versioning, snapshots, possibly packs things into a smaller number of files, and avoids the need for a list operation on the underlying storage. This is actually precisely what is addressed by this proposal that we are working on implementing as part of tensorstore (but is not tied to tensorstore): (To be clear, while we are currently implementing it, it is still very much a prototype and the actual format is not at all set in stone.) My understanding is that "storage transformers" were intended to apply only to the chunk data, not the metadata. Since the goal here seems to be to also version the metadata and store it in the same way as the chunks, it seems like this would more appropriately be implemented at the key-value store and not need any particular integration with zarr, just that zarr would be used with it. |
Thanks for sharing @jbms! I am currently reading and digesting that proposal. One point of clarification: storage transformers can act on any of the items in the store (chunks AND metadata). For example, the sharding transformer (#152) declares explicitly that
|
Currently, |
Ah good point. I was confused by the comment above. I guess for this to work we would want the storage transformer in the root metadata. |
Kerchunk would welcome being formalised as (part of) a zarrberg manifest, if that gets us a large fraction of the way to your goals. I think it can, since the storage layer can then express a change in the target of a key and could also be a place for per-chunk transformations if we want. That is not the same as a higher-level transformer; but see for example the ability to concat bytes from several references in preffs. A new version is a new references set (or some sort of delta scheme). Of course, it doesn't have to be via kerchunk, especially if you really want to have transformations a separately specified entity. Note that kerchunk does also have aims with non-zarr data, such as the very directory-file structure of parquet/orc you mentioned, or even just finding valid byte start offsets in a CSV or zstd compressed file. Those are all related-but-not-the-same. |
I wanted to pipe up and say I absolutely love this proposal, @rabernat :) |
@rabernat can we have a huddle about this sometime soon. We are evaluating a feature at ESDIS level to better understand what a persistant zarr store would look like, you pointed me to this thread awhile back, we're now having more active conversations and I would like to better understand this proposal because it might exactly address our need. |
@briannapagan , would be happy to take part too. As an orthogonally connected topic, I have also made a client for reading actual iceberg datasets into dask (i.e., tabular as opposed to zarrish nd-arrays). |
@briannapagan - sure thing. I'll ping you via email. For context, this idea is central to what Joe an I are now building at Earthmover. |
I'd be interested in joining (if that's ok). Though no worries if not |
me too! |
also (always) happy to join. |
I have a discussion with my colleague the other day about |
Picking up this conversation as we have an open ticket at ESDIS. @christine-e-smit gave a great summary of our use case:
We've had tangent discussions in pangeo-forge with some solutions mentioned including consolidated-metadata, and @rabernat pointing out that the core issue is an ACID transaction to update the Zarr store.
I am proposing another discussion so we can try and sync these two ongoing conversations. For those interested in joining a discussion with NAsA folks, please fill in this doodle by end of day Friday March 17th: https://doodle.com/meeting/participate/id/aQkYRPMd |
As part of the tensorstore project we have developed a key-value store database format called OCDBT (Optionally-cooperative distributed B+tree) that can sit as an adapter between an array format like zarr and the underlying base filesystem / object store. This format is designed to support versioning and copy-on-write snapshots, which could be used in various ways to solve the scenario described above:
You can find a high-level description of the format and its motivations here: And a description of the current on-disk format here: |
@jbms I don't have your contact information but also for anyone else able to join here is the meeting information for tomorrow 2PM eastern: Meetings agenda and notes: |
Some of us in the fusion energy research world have started using zarr (see [mastapp.site](https://mastapp.site if you’re curious!). For this data repository, we’re looking at using lakefs to handle versioning, but this proposal outlines a much better and more lightweight way to version our data. I’ll be following this with keen interest and would love to have an idea on timelines and progress. |
@NathanCummings : lakeFS versions individual files within a set, I think, which would make it hard to use for this kind of thing |
@martindurant if something like this were implemented, it would suit our needs nicely. |
Absolutely, version control as described in this thread sounds like what you might want (cc @rabernat , if you want to comment). |
Yes, tiledb is another option that we considered before deciding to use zarr. It’s still something I’m keeping in the back of my mind. |
Hi @NathanCummings! We currently support all of these features in Zarr via our Arraylake platform. We are currently working on open sourcing some of these features, along the lines of the original proposal. Not able to share much more publicly at this point, but feel free to reach out ([email protected]) for an update. |
Continuing on the #314 (comment) by @rabernat since more relevant here
any specifics on what components would be open-sourced/what they would allow to do? |
I'm sure the community would be very excited to hear about what's to come from Earthmover, and interest could be stirred by such specifics; but it's hard for developers (not me!) to reveal too much, especially if licenses have not yet been updated in all the places. "later this fall" sounds pretty soon to me, it being already mid-September. |
Our Arraylake platform already offers transactions, serializable isolation, and version control for Zarr data, with the feature set described here (and more technical details here). The way this works today requires both object storage AND interaction with our REST API. With our forthcoming open source release, we have figured out how to do with without the REST API, using only object storage. Happy to chat in person if you're interested in hearing more details: [email protected]. |
More than two years after writing this ticket, we finally built it. 💯 open source (Apache 2.0) |
I just reviewed the V3 spec. As with V2, the core spec does not address how to deal with unlistable stores. In V2, we developed the semi-ad-hoc "consolidated metadata" approach. I know there is already an issue about this vor V3 (#136), but I am opening a new one with the hope of broadening the discussion of what consolidated metadata can be.
Background: What is Apache Iceberg
I have recently been reading about Apache Iceberg. Iceberg is aimed at tabular data, so it is not a direct alternative or competitor to Zarr. Iceberg is not a file format itself (it can use Parquet, Avro, or ORC for individual data files). Instead, it is a scheme for organizing many data files individual into a larger (potentially massive) single "virtual" table. In addition to the table files, it tracks copious metadata about the files. In this sense, it is similar to Zarr: Zarr organizes many individual chunk files into a single large array.
For folks who want quickly come up to speed on Iceberg, I recommend reading the following documents in order:
Architecturally, these are the main points
This diagram says 1000 words.
Another interesting part of the spec, regarding the underling storage system capabilities:
Within the data lake / warehouse / lakehouse world, you can think of Iceberg as an open-source alternative to Databricks Delta Lake. It can be used by many different data warehouse query engines like Snowflake, Dremio, Spark, etc.
Inspiration for Zarr
It doesn't make sense to use Iceberg directly for Zarr--it's too tied to the tabular data model. There are lots of interesting ideas and concepts in the Iceberg spec that I think are worth trying to copy in Zarr. Here is a list of some that come to mind
Manifests
Manifests are essentially files which list other files in the dataset. This is conceptually similar to our V2 consolidated metadata in that it removes the need to explicitly list a store. We could store whatever is useful to us in our manifests. For example, we could explicitly list all chunks and their sizes. This would make many operations go much faster, particular on stores that are slow to list or get files information.
Snapshots, Commits, and Time Travel
Iceberg tables are updated by simply adding new files and then updating the manifest list to point to the new files. The new state is registered via an atomic commit operation (familiar to git users). A snapshot points to a particular set of files. We could imagine doing this with zarr chunks, e.g. using the new V3 layout
where
_v0
and_v1
are different versions of the same chunk. We could check out different versions of the array corresponding to different commits, preserving the whole history without having to duplicate unnecessary data. @jakirkham and I discussed this idea at scipy. (TileDB has a similar feature.)Applying the same concept to metadata documents would allow us to evolve array or group schemas and metadata while preserving previous states.
Partition Statistics
If we stored statistics for each chunk, we could optimize many queries. For example, storing the min, max, and sum of each chunk, with reductions applied along each combination of dimensions, would sometimes allow us to avoid explicitly reading a chunk.
Different underlying file formats
Zarr V2 chunks are extremely simple: a single compressed binary blob. As we move towards the sharding storage transformer (#134, #152), they are getting more complicated--a single shard can contain multiple chunks. Via kerchunk, we have learned that most other common array formats (e.g. NetCDF, HDF5) can actually be mapped directly to the zarr data model. (🙌 @martindurant) We are already using kerchunk to create virtual zarr datasets that map each chunk to different memory locations in multiple different HDF5 files. However, that is all outside of any Zarr specification.
Iceberg uses this exact same pattern! You can use many different optimized tabular formats (Parquet, Avro, ORC) for the data files and still expose everything as part of a single big virtual dataset. Adopting Iceberg-style manifests would enable us to formalize this as a supported feature of Zarr.
Implementation Ideas
Having finally understood how storage transformers are supposed to work (thanks to #149 (comment)), I have convinced myself that the iceberg-like features we would want can all be implemented via a storage transformer. In pseudocode, the API might look like this
The storage transformer class would do the work of translating specific storage keys requests into the correct paths in the underlying store (e.g.
data/root/foo/0.0 ➡️ data/root/foo/0.0_v2
).Summary
I'm quite excited about the possibilities that this direction would open up. Benefits for Zarr would be
At the same time, I am aware of the large amount of work involved. I think a useful path forward is to look for the 90/10 solution--what could we imitate from Iceberg that would give us 90% of the benefits for 10% of the effort.
The text was updated successfully, but these errors were encountered: