-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Apache Arrow codec #227
Comments
How do you imagine this working? If people are using Parquet, do they actually need Zarr? |
I guess one of the selling features of zarr for me is being able to load only the chunks I need off of a remote server. Arrow is only an in-memory representation, so I guess it is conceivable that a chunk is larger than is reasonable for memory, and it gets spooled to a local disk as a Parquet file?🤔 I’ll experiment a bit with Arrow, and if I can get the behavior I’m hoping for I’ll submit a PR. |
Before getting to a PR, it would be good to get a clearer idea on the usage pattern and how well it generalizes (though it sounds like we are still working on those questions 😉). |
Hi @vdwees, just to second @jakirkham's comment, it would be helpful to clarify goals and usage patterns here. IIUC Arrow provides a standard way to share memory buffers between processes. So, e.g., you could imagine loading data from one or more chunks of any Zarr array into an PyArrow array, rather than a numpy array as currently. That is something completely independent of codecs, it's more about how to lay out memory buffers and expose them to applications. Parquet is a file format for columnar data, i.e., serialisation of multiple 1D arrays. Codecs in Zarr are are things like compressors which transform arrays during serialisation or deserialisation. Some of the current codecs in the numcodecs package do borrow some ideas from the Parquet format, but that's some very specific things, e.g., about how to serialise strings. Hth. |
Adding an Apache Arrow codec for efficient data loading.
For data stored in the filesystem, Apache parquete might be added as well.
Also mentioned here:
zarr-developers/zarr-python#515
The text was updated successfully, but these errors were encountered: