Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support conversion of rust struct to an arrow2 chunk #40

Closed
ncpenke opened this issue Jun 25, 2022 · 8 comments
Closed

Support conversion of rust struct to an arrow2 chunk #40

ncpenke opened this issue Jun 25, 2022 · 8 comments

Comments

@ncpenke
Copy link
Collaborator

ncpenke commented Jun 25, 2022

Created from the discussion in jorgecarleitao/arrow2#1092.

A rust struct can conceptually represent either an Arrow Struct or an arrow2::Chunk (a column group). The arrow2::Chunk is important since it's used in the deserialization/serialization API for parquet and flight conversion.

We can extend the arrow2_convert::TryIntoArrow and arrow2_convert::FromArrow traits to convert to/from arrow2::Chunk, but there are two possible mappings from a vector of structs, Vec<S> to Chunk:

  1. The Chunk has a single field of type Struct
  2. The Chunk contains the same number of fields as the struct.

1 can be easily supported by wrapping the an arrow2::Array in a Chunk.

2 has a couple of approaches:

a. A new derive macro to generate the mapping to a Chunk (eg. ArrowChunk or ArrowRoot).
b. Providing a helper method to convert a arrow2::StructArray to a Chunk by unwrapping the fields.

One related use-case that could guide this design is to support generic typed versions of the arrow2 csv, json, parquet, and flight serialize/deserialize methods, where the schema is specified by a rust struct (opened #41 for this). To achieve this, it would be useful to access the deserialize/serialize methods of each column separately for parallelism which is cleaner via 2a.

@ncpenke
Copy link
Collaborator Author

ncpenke commented Jun 25, 2022

@jorgecarleitao @nielsmeima any thoughts on this?

@jorgecarleitao
Copy link
Collaborator

I agree that the root is more natural to be a Chunk with each field. We still need it to be representable as a StructArray in case someone wants to compose it in another struct, though (i.e. at least internally we need that). We also still need fields(&self) -> &[Fields] since users you usually do:

let schema = fields.into();
let writer = Writer::new(schema);

for chunk in chunks:
     writer.write(chunk)

Arrow2 supports From<Vec<Field>> for Schema, so we could expose fields and let the user call into(). The difference between them is just the Schema's metadata - parquet and arrow support file-level custom metadata.

Note that in general StructArray = (Vec<Fields>, Chunk, Option<Bitmap>) where

  • Fields's types are consistent with Chunk's types.
  • validity's length is equal to Chunk's length.

AFAIK we do allow a record (not an item) to be nullable, and thus the correspondence here is even simpler. Could we just expose a method "into_chunk"? It seems that we have all ingredients?

@ncpenke
Copy link
Collaborator Author

ncpenke commented Jun 26, 2022

We definitely have all the ingredients. The open question I still have is for a struct S, should Vec<S>::into_chunk() correspond to a Chunk with all the fields in S or a single field that is a Struct?

In jorgecarleitao/arrow2#1092, it was the later. Maybe this is what you're proposing too. Do you think we should provide a helper somewhere, if needed, to go from Chunk that's wrapping a StructArray into a Chunk that's wrapping the Struct's fields?

As an aside I think we should unify all the conversion methods try_into_collection and try_into_arrow under arrow_try_into. I'll open a separate ticket for this, but this would be more consistent with the rust convention of omitting the type names from conversion method names. Conversion to a Chunk can be performed by arrow_try_into on a Vec<S>.

ncpenke added a commit that referenced this issue Jun 26, 2022
…alars + chunks

- Rename ArrowDeserialize to ArrowFieldDeserialize. Similarly for arrow_deserialize
- Rename ArrowSerialize to ArrowFieldSerialize. Similary for arrow_deserialize
- Rename internal trait ArrowArray to ArrowArrayIterator for clarity
ncpenke added a commit that referenced this issue Jun 27, 2022
ncpenke added a commit that referenced this issue Jun 27, 2022
@jorgecarleitao
Copy link
Collaborator

I think we can offer a chunk with a single StructArray by offering a function to convert to a Chunk - the user creates a struct with a single field to do so (right?)

@ncpenke
Copy link
Collaborator Author

ncpenke commented Jun 27, 2022

Seemed more convenient to provide a one-shot conversion, thanks for reviewing the PR! Let's see if we get more user feedback with this approach. I'll also add some examples for flight and parquet conversion in this repo.

@nielsmeima
Copy link
Contributor

In jorgecarleitao/arrow2#1092, it was the later. Maybe this is what you're proposing too. Do you think we should provide a helper somewhere, if needed, to go from Chunk that's wrapping a StructArray into a Chunk that's wrapping the Struct's fields?

@ncpenke did you opt for providing such a helper? How would one currently differentiate between wanting to obtain a Chunk (without falling backing to writing a helper themselves) that's wrapping a StructArray vs. a Chunk wrapping the Struct's fields using try_into_arrow? As far as I can see let chunk: Chunk<Arc<dyn Array>> = struct.try_into_arrow()? will always resolve into the first variant of StructArray.

I would still be interested in a Chunk wrapping the Struct's fields.

@ncpenke
Copy link
Collaborator Author

ncpenke commented Jun 30, 2022

Thanks for following up @nielsmeima. You're right the current implementation always resolves to the first variant. We can add a helper to this crate to facilitate wrapping the fields of a StructArray directly.

I opened #55 with a proposal. Would be thrilled if you want to take a stab it.

@nielsmeima
Copy link
Contributor

I will check the proposal and take a stab at implementation in the coming days. Thrilled to do so!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants