-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/221 serialize enums as flattened structs #222
base: main
Are you sure you want to change the base?
Feature/221 serialize enums as flattened structs #222
Conversation
Hi @raj-nimble Thanks. I like the idea a lot! ATM, I don't really have the time for serde arrow. I will have a proper look next week! First feedback: it's great, that you target the 0.12 release. I would like to limit the current main to bug fixes for the 0.11 release. Re builders we can also figure it out together, once I'm back at my laptop. I imagine the deserialization logic will be a bit tricky. Maybe the impl of untagged enums in serde could be an inspiration (I like use the rust playground to expand the generated deserialize impl). |
Hi @chmp that's great, I'm happy you like it. I will continue to iterate slowly on my own this week and I look forward to discussing it more with you next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments and thoughts. Thinking about next steps:
For serialization you should be able to follow the StructArrayBuilder
, maybe even implement the ArrayBuilder
as a collection of StructBuilders
and only merge the results in to_array
. Most likely having some annotation with a strategy is required to allow selecting a specialized implementation.
For deserialization I would implement the strategy to detect any non-null field, select the relevant variant and then follow the impl of the standard EnumDeserializer
. However, I think getting this logic right, will be really tricky. Probably, I would implement this as serialization only for now.
|
||
for variant in &self.variants { | ||
if let Some(variant) = variant { | ||
// TODO: does this break if there are no child fields? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this will most definitely break, for non struct variants I think it would be better to explicitly detect invalid variants as soon as they are encountered. The simplest solution would be to ensure the variant trace is a struct, in ensure_union_variant
. If you implement that a test of the error message would be great (there is an assert_err
helper)
let opts = TracingOptions::default().enums_with_data_as_structs(true); | ||
|
||
let enum_tracer = Tracer::from_samples(&enum_items, opts).unwrap(); | ||
let struct_tracer = Tracer::from_samples(&struct_items, TracingOptions::default()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be cool, to test from_type
. My feeling is: they way you implemented everything, it will work out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a basic UT just to see if it works but I want to move the tests together to share the utility code and test all edge cases with both converters.
Hi @chmp . Thanks so much for the comments! I haven't made any progress since last week as I was working on something else, but with your comments now I can hopefully circle back to this starting next week and I'll update you as soon as I have made some progress. Thanks! |
Agreed, it's best to implement the feature as serialization only for now. Can add a separate issue to track adding deserialization. |
16751a6
to
623467b
Compare
Hi @chmp , I'm struggling to find an elegant way to do the serialization. I considered redoing the schema creation to continue to use UnionBuilder and then add logic there to serialize it as a struct instead but I think that still ran into problems with arrow writer and seemed more convoluted. I then considered creating an entirely new builder, something like an Do you have any time to take a closer look and give some guidance on exactly how this might best be done? My 2 current FYI, I rebased onto the latest from develop-0.12. Thanks, |
623467b
to
4df3f9c
Compare
@raj-nimble good questions, what's the best option. One option would probably be to copy the code of
In pseudo code: // probably you need to do something different here, as you plan to use `Struct` as the type with a custom strategy
pub fn from_union_builder(builder: UnionBuilder) -> Result<FlattenedUnionBuilder> {
let builders = Vec::new();
for (child_builder, child_meta) in builder.fields {
// ensure the child builders are nullable here (otherwise `serialize_none` will fail)
let ArrayBuilder::Struct(child_builder) = child_builder else { fail!("Children must be structs") };
// modify the fields to include the variant prefix, the variant name will be in child_meta.name
}
// construct the builder
Ok(FlattenedUnionBuilder { ... })
}
pub fn into_array(self) -> Result<Array> {
let mut fields = Vec::new();
// note: the meta data for the variant is most likely not used, I would simply drop it
for (_, builder) in self.fields.into_iter().enumerate() {
let ArrayBuilder::Struct(builder) = builder else { fail!("..."); };
// concatenate the fields of the different variants
for (field, meta) in builder.fields {
fields.push((idx.try_into()?, builder.into_array()?, meta));
}
}
Ok(Array::Struct(StructArray {
// note: you will most likely need to keep track of the number of elements being added, simply add self.len += 1 or similar in the different `serialize_*` methods
len: self.len,
validity: self.seq.validity,
fields,
}))
}
pub fn serialize_variant(&mut self, variant_index: u32) -> Result<&mut ArrayBuilder> {
self.len += 1;
let variant_index = variant_index as usize;
// call push_none for any variant that was not selected
for (idx, builder) in self.fields.iter_mut().enumerate() {
if idx != variant_idx {
builder.serialize_none()?;
}
}
let Some(variant_builder) = self.fields.get_mut(variant_index) else {
fail!("Could not find variant {variant_index} in Union");
};
Ok(variant_builder)
} |
8de8e74
to
49d3cfd
Compare
Hi @chmp , thanks so much for your suggestions / guidance, it was immensely helpful. At some point, the UUID support could potentially become the next big hurdle. Our application uses them quite extensively, and as part of using the crate in our app, I switched to using |
I commented here on my current workaround for UUIDs when using |
@raj-nimble regarding tests: I will add a method to easily construct the new internal array objects. this way you should be able to easily get the underlying data out and compare whether everything worked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far!
} | ||
} | ||
|
||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far my convention has been to move these test into the serde_arrow/src/test_with_arrow
tree. As written in the top-level comment, I will add support to simplify getting the data of the arrays so it will be easier to check the contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, will move the tests once the helper methods are added
serde_arrow/src/internal/serialization/outer_sequence_builder.rs
Outdated
Show resolved
Hide resolved
@raj-nimble I started to work o a proper release for the 0.12 series. Hence the change in base branch. |
d785260
to
21ec2c5
Compare
21ec2c5
to
4eeab22
Compare
Hi @chmp sorry for the delay here. I got pulled away from this for a bit but I'm really hoping to finish up here this week or early next week. Did you get a chance yet to add support to simplify getting the data of the arrays so I can move the tests? |
Great to hear! Re.: getting the data out of the arrays: you can use // see also: https://github.com/chmp/serde_arrow/blob/main/serde_arrow/src/test_with_arrow/impls/chrono.rs#L659
let mut builder = ArrayBuilder::new(schema).unwrap();
builder.extend(&items).unwrap();
let arrays = builder.build_arrays().unwrap();
let [array] = arrays.try_into().unwrap();
let Array::Struct(array) = array else {
panic!("Expected struct array, got: {array:?}");
};
let (first_field, meta) = &array.fields[0];
assert_eq!(meta.name, "...");
assert_eq!(first_field, Array::Int32(PrimtiveArray { validity: None, values: vec![1, 2, 3, 4] })); |
4eeab22
to
bf3f789
Compare
Hi @chmp I have an issue and I am not yet certain how to fix it. Maybe you can provide some insight? I have a test case pushed up that reproduces the issue at The error I am getting is:
To repro/demonstrate, lets say we have this nested struct->enum->struct->enum structure
and we have a schema and data as follows
the error above shows up when we serialize
The fields look like this
As I understand it, what is happening is:
So while I understand how we arrive at that section of the code, I don't really understand the underlying behavior of the Dict builder or Int Builder on why this is so. I believe the Dict Builder is a result of the Do you have any thoughts on what I might need to do here? |
@raj-nimble Oh that's a tricky. The reason is the following:
One fix would be to force nested dictionary fields to be nullable with fn fix_dictionaries(field: &mut Field) {
if matches!(field.data_type, DataType::Dictionary(_, _, _)) {
field.nullable = true;
} else if let DataType::Struct(children) = &mut field.data_type {
for child in children {
fix_dictionaries(child);
}
}
} The recursion on struct is necessary, as nested structs that contain dictionaries will trigger the same error. Just a heads up: unions in Arrow are never nullable. For nested enums, that cannot be mapped to dictionaries, you will run into a similar issue. Unfortunately, I have no idea what would be a solution here. I would suggest, to simply catch this case and raise an error in schema tracing. Maybe it would also be worthwhile to add checks for both issues, non-nullable dictionaries in structs and enums in structs, to the serializer. The schema can also be constructed manually and would result in hard to track down errors. |
4a0ad54
to
c5b98b6
Compare
Hi @chmp , sorry for the recent radio silence and lack of progress. I got pulled into more urgent tasks but this has definitely not fallen off my radar. I do plan to return to this hopefully in the next month. For a summary of where I left off:
Sorry for the delay with this. |
@raj-nimble Thanks for the update. And no worries. I myself am quite busy with other things and can't really devote much time to serde_arrow. Just give me a ping when I should support :) |
This MR isn't complete, still need to add the correct functionality to array builder / sequence builder.
This is just as an example of the direction I was going, was hoping for your thoughts?
With this change,
from_samples
outputs Fields with the desired flattened structure.