-
-
Notifications
You must be signed in to change notification settings - Fork 288
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question/possible enhancement: Relationship to N5 + Arrow/Parquet? #515
Comments
Hi Wictor, good questions...
A lot of the goals of Zarr overlap with N5, and I'm aware of the Zarr
backend for N5. Is there a clear intersection between these two projects
planned?
Yes, the two developer groups are working together, and work has started
towards a zarr version 3 protocol spec that could serve as a common
implementation target for both groups. Some more info here:
https://zarr-developers.github.io/zarr/specs/2019/06/19/zarr-v3-update.html
The funding proposal we put to CZI was to support the work towards
finishing and implementing the v3 protocol spec.
Also, and possibly more importantly: Has there been any communication with
the Apache Arrow/Parquet project? Although their *primary* focus is on
tabular data, support for tensors has been raised and noted on their
Github/JIRA/mailing list--but simply shelved as a lower priority.
Could/will Zarr eventually develop into a natural extension there since
many of the file spec characteristics (at least on the macro level) seem
similar?
I would love to connect with the Arrow project, there are many points of
commonality and it would be great to build together on common foundations.
We haven't had any direct contact as yet but I think we should reach out
asap.
|
Excellent to hear--and congrats on the CZI funding. I wasn't aware of the development going on with the v3 spec. As a follow-up: What are the team's opinions on TileDB? Is Zarr meant to target different use cases? Also, I apologize for the many project comparisons. It's fantastic that there's multiple groups working on this solution, I'm just trying to get a sense of where everyone is aiming and why to better inform decision making about future data management at my own institution. |
cc @jorisvandenbossche (in case you have thoughts on ways Arrow and Zarr could work together 🙂) |
Short answer is that tiledb and zarr do target very similar use cases but take a different technical approach, particularly regarding how to avoid synchronisation problems when there are multiple writers to a single array. The essence of it (as I understand) is that tiledb organises data into files or cloud objects based on write batches, whereas zarr takes a simpler approach and organises data into files or cloud objects based on chunks. So, e.g., if you write some data to a region of an array, in tiledb the data your write will get stored as a single object, regardless of whether the region you're writing to is contained within a single chunk or overlaps multiple chunks. If you write multiple times to the same chunk, that will get stored as multiple objects. Reading the current state of the data in a chunk then requires traversing all overlapping writes in time order. Zarr stores each chunk in a separate object, i.e., there is only ever one object per chunk. So if the region you are writing overlaps multiple chunks then multiple objects will be updated (or created if they didn't already exist). Because of this, tiledb is completely lock-free, i.e, you can have multiple writers writing to any region of an array in parallel. In zarr you can have multiple writers working in parallel but have to think about avoiding two writers ending up in contention for the same chunk, which most people do by aligning their write operations to chunk boundaries, but which can also be managed via chunk-level locks. The tradeoff is that the tiledb storage format is more complicated, and understanding how to optimise stored data for performance may not be so straightforward, because it may require steps like consolidating arrays. The zarr default storage layout is simpler, you can hack around with the data more easily using generic tools like file system tools or cloud object browser, and it's a bit more obvious how to tune performance by adjusting chunk size and shape. (In fact zarr abstracts over storage in a way that would allow someone to write a storage backend for zarr that follows something like the tiledb approach, i.e., zarr is not tied to the one-chunk-one-object storage layout. But that's more about what is possible rather than what is currently implemented.) There's lots more I haven't touched on, comparison of zarr vs tiledb probably should become an issue by itself over on the zarr-community repo so others can chime and and we can flesh out differences a bit more, I'm sure would be useful for others. Btw apologies for slow response on this. |
This is a fantastic response, and thank you very much for taking the time to roughly summarize things. And yes, the information you have here makes it much more clear about the different fundamental design choices made in the two projects. I can definitely imagine scenarios where either approach might be more/less advantageous depending on the type of data, access patterns, how many concurrent users, etc. Again, thanks! |
No problem at all. You put it very well, in general it's probably a good idea to try out both approaches with some real data in some realistic usage scenarios. And if people end up choosing tiledb then please feel free to tell us about it and why, it's technically very interesting to compare the two approaches and we're always interested to learn more about what people need :-) |
@wsphillips thanks for pointing us to this discussion via the Julia forum. @alimanfoo, thanks for these fair points on TileDB vs Zarr. If you will forgive the lengthy response, I’d like to give some background on the design goals motivating those technical decisions, and how TileDB has evolved to meet the needs of our heaviest users (primarily in genomics):
The tile/chunk-per-file approach does not work in the case of sparse arrays, because a tile contains an ordered collection of non-empty cells, which does not rely on some fixed space partitioning (to avoid storage/memory explosion of densified arrays 1, 2) - some space tiles may be full and some may be almost empty for certain distributions. Updating the sparse cells in a way that destroys that order, while keeping a fixed capacity in each tile, would require to regroup a massive collection of sparse cells into new tiles, stored in new files. This was not feasible given our first goal, hence the decision to create new tiles for the new sparse cells and store them in a new immutable file. With the addition of timestamps, the above design of writing separate "fragments" (i.e., update batches) very naturally enables the ability to time-travel, i.e., read snapshots of the array at some point in the past, without requiring a separate log (which can lead to consistency issues for parallel writes). Thus with TileDB, you are able to open an array at a timestamp and efficiently slice data as if no update has happened beyond that timestamp, or even undo an update (again, for un-consolidated arrays). There were some additional constraints with the inode count of a tile-per-file approach when using smaller tile sizes on some file systems (we had originally implemented a tile-per-file solution in 2015), but this is not much of a problem on cloud object stores like AWS S3, especially if you perform parallel reads (in fact, increasing object prefix count can reduce enforced slow-downs!). Some additional notes:
cc @ihnorton |
Hi @stavrospapadopoulos, just to briefly say thank you for adding this comment, I think it's hugely valuable to unpack these different technical approaches and to have a chance to discuss and understand them. |
I'd like to point out that the project I'll be working on to add Awkward Array as a Zarr v3 extension would be a way of accessing Arrow data through Zarr, though in a less direct route:
I've been talking about this as a possibility on zarr-developers/zarr-specs#62 and @martindurant developed an end-to-end demonstration in Zarr v2: https://github.com/martindurant/awkward_extras/blob/main/awkward_zarr/core.py To show what I mean, suppose you have >>> import pyarrow.parquet as pq
>>> import awkward as ak
>>> pyarrow_table = pq.read_table("complicated-example.parquet.txt")
>>> pyarrow_table["simple"].to_pylist()
[1, 2, 3]
>>> pyarrow_table["complicated"].to_pylist()
[
[{'x': 0.0, 'y': []}, {'x': 1.1, 'y': [1]}, {'x': 2.2, 'y': None}],
[],
[{'x': 3.3, 'y': [1, 2, 3]}, None, {'x': 4.4, 'y': [1, 2, 3, 4]}]
] The Awkward Array library has an equivalent for each of the Arrow types, so even the complicated Arrow array can be viewed as an Awkward Array. >>> awkward_array = ak.from_arrow(pyarrow_table["complicated"])
>>> awkward_array
<Array [[{x: 0, y: []}, ... 1, 2, 3, 4]}]] type='3 * var * ?{"x": float64, "y": ...'>
>>> awkward_array.tolist()
[
[{'x': 0.0, 'y': []}, {'x': 1.1, 'y': [1]}, {'x': 2.2, 'y': None}],
[],
[{'x': 3.3, 'y': [1, 2, 3]}, None, {'x': 4.4, 'y': [1, 2, 3, 4]}]
]
>>> awkward_array.type # DataShape notation; see https://datashape.readthedocs.io/
3 * var * ?{"x": float64, "y": option[var * int64]} The "Awkward Array <--> Zarr v3" project would use the ak.to_buffers function to decompose the data structure into one-dimensional buffers (shown here as a dict of NumPy arrays). The idea is to replace this dict with the Zarr group, as @martindurant has done, such that one Zarr group corresponds to one complicated array. >>> form, length, container = ak.to_buffers(awkward_array)
>>> container
{
'part0-node0-offsets': array([0, 3, 3, 6], dtype=int64),
'part0-node1-mask': array([47], dtype=uint8),
'part0-node3-data': array([0. , 1.1, 2.2, 3.3, 0. , 4.4]),
'part0-node4-mask': array([1, 1, 0, 1, 0, 1], dtype=int8),
'part0-node5-offsets': array([0, 0, 1, 1, 4, 4, 8], dtype=int64),
'part0-node6-data': array([1, 1, 2, 3, 1, 2, 3, 4])
} The data needed to reconstitute it are (a) the one-dimensional buffers, (b) this JSON " >>> length
3
>>> form
{
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "BitMaskedArray",
"mask": "u8",
"content": {
"class": "RecordArray",
"contents": {
"x": {
"class": "NumpyArray",
"itemsize": 8,
"format": "d",
"primitive": "float64",
"form_key": "node3"
},
"y": {
"class": "ByteMaskedArray",
"mask": "i8",
"content": {
"class": "ListOffsetArray64",
"offsets": "i64",
"content": {
"class": "NumpyArray",
"itemsize": 8,
"format": "l",
"primitive": "int64",
"form_key": "node6"
},
"form_key": "node5"
},
"valid_when": true,
"form_key": "node4"
}
},
"form_key": "node2"
},
"valid_when": true,
"lsb_order": true,
"form_key": "node1"
},
"form_key": "node0"
} The v3 extension to the Zarr client would read it back with ak.from_buffers. >>> reconstituted = ak.from_buffers(form, length, container)
>>> reconstituted
<Array [[{x: 0, y: []}, ... 1, 2, 3, 4]}]] type='3 * var * ?{"x": float64, "y": ...'>
>>> reconstituted.tolist()
[
[{'x': 0.0, 'y': []}, {'x': 1.1, 'y': [1]}, {'x': 2.2, 'y': None}],
[],
[{'x': 3.3, 'y': [1, 2, 3]}, None, {'x': 4.4, 'y': [1, 2, 3, 4]}]
]
>>> reconstituted.type
3 * var * ?{"x": float64, "y": option[var * int64]} This may be different from what you're thinking about for direct Arrow <--> Zarr because this has Zarr treating the six buffers in this example group as opaque objects, and a v3 extension would present a whole Zarr group as one Awkward Array (complaining if By exploding the data out into a set of non-contiguous buffers that are navigated by name, we can read fields of nested records (and partitions of rows, not shown here) lazily, on demand: >>> class VerboseMapping:
... def __init__(self, container):
... self.container = container
... def __getitem__(self, key):
... print(key)
... return self.container[key]
...
>>> lazy = ak.from_buffers(form, length, VerboseMapping(container), lazy=True)
>>> lazy.x
part0-node0-offsets
part0-node1-mask
part0-node3-data
<Array [[0, 1.1, 2.2], ... [3.3, None, 4.4]] type='3 * var * ?float64'>
>>> lazy.y
part0-node4-mask
part0-node5-offsets
part0-node6-data
<Array [[[], [1], None], ... [1, 2, 3, 4]]] type='3 * var * ?option[var * int64]'> The Some of the early comments in this thread were about Arrow's "tensor" type, which is included in the On today's Zarr call (2020-12-16), we talked about partial reading of buffers, which could come in handy here. Each of these buffers is a simple array, indexed by the one above it in the tree. |
@jpivarski , you might want to start a new thread, since this is rather old. I will add, though, that I would not be surprised is the "bunch of 1D arrays" in zarr, as a storage format for potentially deeply nested arrow datasets, is as performant in size, speed, and chunkability as parquet. |
Thanks for the suggestion! I thought this was the best place because it already has everyone who's interested in Arrow <--> Zarr. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
A lot of the goals of Zarr overlap with N5, and I'm aware of the Zarr backend for N5. Is there a clear intersection between these two projects planned?
Also, and possibly more importantly: Has there been any communication with the Apache Arrow/Parquet project? Although their primary focus is on tabular data, support for tensors has been raised and noted on their Github/JIRA/mailing list--but simply shelved as a lower priority. Could/will Zarr eventually develop into a natural extension there since many of the file spec characteristics (at least on the macro level) seem similar?
The text was updated successfully, but these errors were encountered: