Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/221 serialize enums as flattened structs #222

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

raj-nimble
Copy link

This MR isn't complete, still need to add the correct functionality to array builder / sequence builder.
This is just as an example of the direction I was going, was hoping for your thoughts?

With this change, from_samples outputs Fields with the desired flattened structure.

@chmp
Copy link
Owner

chmp commented Aug 15, 2024

Hi @raj-nimble

Thanks. I like the idea a lot!

ATM, I don't really have the time for serde arrow. I will have a proper look next week!

First feedback: it's great, that you target the 0.12 release. I would like to limit the current main to bug fixes for the 0.11 release.

Re builders we can also figure it out together, once I'm back at my laptop. I imagine the deserialization logic will be a bit tricky. Maybe the impl of untagged enums in serde could be an inspiration (I like use the rust playground to expand the generated deserialize impl).

@raj-nimble
Copy link
Author

Hi @chmp that's great, I'm happy you like it. I will continue to iterate slowly on my own this week and I look forward to discussing it more with you next week.

Copy link
Owner

@chmp chmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments and thoughts. Thinking about next steps:

For serialization you should be able to follow the StructArrayBuilder, maybe even implement the ArrayBuilder as a collection of StructBuilders and only merge the results in to_array. Most likely having some annotation with a strategy is required to allow selecting a specialized implementation.

For deserialization I would implement the strategy to detect any non-null field, select the relevant variant and then follow the impl of the standard EnumDeserializer. However, I think getting this logic right, will be really tricky. Probably, I would implement this as serialization only for now.


for variant in &self.variants {
if let Some(variant) = variant {
// TODO: does this break if there are no child fields?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this will most definitely break, for non struct variants I think it would be better to explicitly detect invalid variants as soon as they are encountered. The simplest solution would be to ensure the variant trace is a struct, in ensure_union_variant. If you implement that a test of the error message would be great (there is an assert_err helper)

serde_arrow/src/internal/schema/tracer.rs Outdated Show resolved Hide resolved
serde_arrow/src/internal/schema/tracer.rs Show resolved Hide resolved
let opts = TracingOptions::default().enums_with_data_as_structs(true);

let enum_tracer = Tracer::from_samples(&enum_items, opts).unwrap();
let struct_tracer = Tracer::from_samples(&struct_items, TracingOptions::default()).unwrap();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be cool, to test from_type. My feeling is: they way you implemented everything, it will work out of the box.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a basic UT just to see if it works but I want to move the tests together to share the utility code and test all edge cases with both converters.

@raj-nimble
Copy link
Author

Hi @chmp . Thanks so much for the comments! I haven't made any progress since last week as I was working on something else, but with your comments now I can hopefully circle back to this starting next week and I'll update you as soon as I have made some progress. Thanks!

@raj-nimble
Copy link
Author

raj-nimble commented Aug 29, 2024

Probably, I would implement this as serialization only for now.

Agreed, it's best to implement the feature as serialization only for now. Can add a separate issue to track adding deserialization.

@raj-nimble raj-nimble force-pushed the feature/221-serialize-enums-as-flattened-structs branch from 16751a6 to 623467b Compare September 11, 2024 01:56
@raj-nimble
Copy link
Author

raj-nimble commented Sep 11, 2024

Hi @chmp ,

I'm struggling to find an elegant way to do the serialization.
I started out attempting to modify StructBuilder to handle it but the issue becomes that the value being serialized is the actual Enum and it was difficult to attempt to detect that in ArrayBuilder, and deconstruct it into it's fields to call serialize on that.
If not detected, and we do something like trying to return the field ArrayBuilder, due to the serialization it ends up calling serialize_struct_field on the various primitive builders and it didn't seem like the best idea to go through and implement that serialization method on all the primitive builders (although maybe that really is the best way?).

I considered redoing the schema creation to continue to use UnionBuilder and then add logic there to serialize it as a struct instead but I think that still ran into problems with arrow writer and seemed more convoluted.

I then considered creating an entirely new builder, something like an EnumStructBuilder. I think that would move the complexity into OuterSequenceBuilder::new, into the build_builder and build_struct inner methods but I'm not quite sure yet. There was some small complexity in how dispatch! would choose to use this new builder which made it less elegant.

Do you have any time to take a closer look and give some guidance on exactly how this might best be done? My 2 current
options seem to be to make a completely new builder, or use a small modification on StructBuilder which then adds struct field handling to all the primitive builders.

FYI, I rebased onto the latest from develop-0.12.

Thanks,
Raj

@raj-nimble raj-nimble closed this Sep 11, 2024
@raj-nimble raj-nimble force-pushed the feature/221-serialize-enums-as-flattened-structs branch from 623467b to 4df3f9c Compare September 11, 2024 01:57
@raj-nimble raj-nimble reopened this Sep 11, 2024
@chmp
Copy link
Owner

chmp commented Sep 11, 2024

@raj-nimble good questions, what's the best option.

One option would probably be to copy the code of UnionBuilder into a separate file and adapt it:

  • build a union builder first, then modify it. This should fail if any variant is not serialized as a struct (as otherwise it does not make sense to merge it). I would include the logic merging the field names here.
  • merge the different struct builders in into_array(). Note that this would require to call serialize_none on the unused builders

In pseudo code:

    // probably you need to do something different here, as you plan to use `Struct` as the type with a custom strategy
    pub fn from_union_builder(builder: UnionBuilder) -> Result<FlattenedUnionBuilder> {
        let builders = Vec::new();
        for (child_builder, child_meta) in builder.fields {
           // ensure the child builders are nullable here (otherwise `serialize_none` will fail)
           let ArrayBuilder::Struct(child_builder) = child_builder else { fail!("Children must be structs") };
           // modify the fields to include the variant prefix, the variant name will be in child_meta.name
        }

        // construct the builder
       Ok(FlattenedUnionBuilder { ... })
    }

    pub fn into_array(self) -> Result<Array> {
        let mut fields = Vec::new();
        
        // note: the meta data for the variant is most likely not used, I would simply drop it
        for (_, builder) in self.fields.into_iter().enumerate() {
            let ArrayBuilder::Struct(builder) = builder else { fail!("..."); };

            // concatenate the fields of the different variants
            for (field, meta) in builder.fields {
                fields.push((idx.try_into()?, builder.into_array()?, meta));
             }
        }

        Ok(Array::Struct(StructArray {
            // note: you will most likely need to keep track of the number of elements being added, simply add self.len += 1 or similar in the different `serialize_*` methods
            len: self.len,
            validity: self.seq.validity,
            fields,
        }))
    }

    pub fn serialize_variant(&mut self, variant_index: u32) -> Result<&mut ArrayBuilder> {
        self.len += 1;

        let variant_index = variant_index as usize;
        
        // call push_none for any variant that was not selected
        for (idx, builder) in self.fields.iter_mut().enumerate() {
           if idx != variant_idx {
               builder.serialize_none()?;
            }
         }
        
        let Some(variant_builder) = self.fields.get_mut(variant_index) else {
            fail!("Could not find variant {variant_index} in Union");
        };

        Ok(variant_builder)
    }

@raj-nimble raj-nimble force-pushed the feature/221-serialize-enums-as-flattened-structs branch 2 times, most recently from 8de8e74 to 49d3cfd Compare September 18, 2024 21:43
@raj-nimble
Copy link
Author

raj-nimble commented Sep 19, 2024

Hi @chmp , thanks so much for your suggestions / guidance, it was immensely helpful.
I have pushed up the current draft of work. I haven't written any unit tests or done any cleanup up yet but this version is working with the new builder in the most naive sense, and in addition to working in a small test app, I also tested it within our core application and it works there as well for structs that do not contain UUIDs. I plan to spend the next day or two cleaning it up and writing tests.

At some point, the UUID support could potentially become the next big hurdle. Our application uses them quite extensively, and as part of using the crate in our app, I switched to using from_type instead of from_samples since the latter was causing an Error in serde_arrow with a "root type nullable" issue but maybe that's a bug in my current implementation, not sure yet. Unfortunately, the UUID workaround we arrived at in #203 requires using from_samples.

@raj-nimble
Copy link
Author

I commented here on my current workaround for UUIDs when using from_type
#203 (comment)

@chmp
Copy link
Owner

chmp commented Sep 22, 2024

@raj-nimble regarding tests: I will add a method to easily construct the new internal array objects. this way you should be able to easily get the underlying data out and compare whether everything worked.

Copy link
Owner

@chmp chmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far!

}
}

#[cfg(test)]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far my convention has been to move these test into the serde_arrow/src/test_with_arrow tree. As written in the top-level comment, I will add support to simplify getting the data of the arrays so it will be easier to check the contents.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will move the tests once the helper methods are added

@chmp chmp deleted the branch chmp:main September 22, 2024 19:36
@chmp chmp closed this Sep 22, 2024
@chmp chmp reopened this Sep 22, 2024
@chmp chmp changed the base branch from develop-0.12 to main September 22, 2024 19:49
@chmp
Copy link
Owner

chmp commented Sep 22, 2024

@raj-nimble I started to work o a proper release for the 0.12 series. Hence the change in base branch.

@raj-nimble raj-nimble force-pushed the feature/221-serialize-enums-as-flattened-structs branch from d785260 to 21ec2c5 Compare September 23, 2024 20:00
@raj-nimble raj-nimble force-pushed the feature/221-serialize-enums-as-flattened-structs branch from 21ec2c5 to 4eeab22 Compare October 4, 2024 19:40
@raj-nimble
Copy link
Author

Hi @chmp sorry for the delay here. I got pulled away from this for a bit but I'm really hoping to finish up here this week or early next week. Did you get a chance yet to add support to simplify getting the data of the arrays so I can move the tests?

@chmp
Copy link
Owner

chmp commented Oct 9, 2024

Great to hear!

Re.: getting the data out of the arrays: you can use build_arrays of the ArrayBuilder and then match on the variants of the Array enum. Something along the lines of

// see also: https://github.com/chmp/serde_arrow/blob/main/serde_arrow/src/test_with_arrow/impls/chrono.rs#L659
let mut builder = ArrayBuilder::new(schema).unwrap();
builder.extend(&items).unwrap();

let arrays = builder.build_arrays().unwrap();
let [array] = arrays.try_into().unwrap();

let Array::Struct(array) = array else { 
    panic!("Expected struct array, got: {array:?}"); 
};

let (first_field, meta) = &array.fields[0];
assert_eq!(meta.name, "...");
assert_eq!(first_field, Array::Int32(PrimtiveArray { validity: None, values: vec![1, 2, 3, 4] }));

@raj-nimble raj-nimble force-pushed the feature/221-serialize-enums-as-flattened-structs branch from 4eeab22 to bf3f789 Compare October 22, 2024 22:12
@raj-nimble
Copy link
Author

raj-nimble commented Oct 24, 2024

Hi @chmp I have an issue and I am not yet certain how to fix it. Maybe you can provide some insight?

I have a test case pushed up that reproduces the issue at serde_arrow/src/test_with_arrow/impls/flattened_union.rs.

The error I am getting is:

thread 'test_with_arrow::impls::flattened_union::test_flattened_union_with_nested_enum' panicked at serde_arrow/src/test_with_arrow/impls/flattened_union.rs:230:10:
failed to serialize: Error: Cannot push null for non-nullable array (data_type: "UInt32", field: "$.data.Two.opts.loc.key")

To repro/demonstrate, lets say we have this nested struct->enum->struct->enum structure

#[derive(Serialize, Deserialize)]
struct ComplexMessage {
    data: MsgData,
}

#[derive(Serialize, Deserialize)]
enum MsgData {
    One { data: usize },
    Two { opts: MsgOptions },
}

#[derive(Serialize, Deserialize)]
struct MsgOptions {
    loc: Location,
}

#[derive(Serialize, Deserialize)]
enum Location {
    Left,
    Right,
}

and we have a schema and data as follows

fn nested_enum_schema() -> SerdeArrowSchema {
    let options = TracingOptions::default()
        .allow_null_fields(true)
        .enums_without_data_as_strings(true)
        .enums_with_named_fields_as_structs(true);

    SerdeArrowSchema::from_type::<ComplexMessage>(options).unwrap()
}

fn nested_enum_data() -> Vec<ComplexMessage> {
    vec![
        ComplexMessage {
            data: MsgData::One { data: 3 },
        },
        ComplexMessage {
            data: MsgData::Two {
                opts: MsgOptions {
                    loc: Location::Right,
                },
            },
        },
    ]
}

the error above shows up when we serialize

    let mut builder = ArrayBuilder::new(nested_enum_schema()).unwrap();
    let serializer = Serializer::new(builder);
    nested_enum_data()
        .serialize(serializer)
        .expect("failed to serialize")
        .into_inner()
        .to_arrow()
        .expect("failed to serialize to arrow");

The fields look like this

[
    Struct(
        StructArray {
            len: 0,
            validity: None,
            fields: [
                (
                    UInt64(
                        PrimitiveArray {
                            validity: Some(
                                [],
                            ),
                            values: [],
                        },
                    ),
                    FieldMeta {
                        name: "One::data",
                        nullable: true,
                        metadata: {},
                    },
                ),
                (
                    Struct(
                        StructArray {
                            len: 0,
                            validity: Some(
                                [],
                            ),
                            fields: [
                                (
                                    Dictionary(
                                        DictionaryArray {
                                            indices: UInt32(
                                                PrimitiveArray {
                                                    validity: None,
                                                    values: [],
                                                },
                                            ),
                                            values: LargeUtf8(
                                                BytesArray {
                                                    validity: None,
                                                    offsets: [
                                                        0,
                                                    ],
                                                    data: [],
                                                },
                                            ),
                                        },
                                    ),
                                    FieldMeta {
                                        name: "loc",
                                        nullable: false,
                                        metadata: {},
                                    },
                                ),
                            ],
                        },
                    ),
                    FieldMeta {
                        name: "Two::opts",
                        nullable: true,
                        metadata: {},
                    },
                ),
            ],
        },
    ),
]

As I understand it, what is happening is:

  • The first variant in the first data structure is serialized, and 2nd variant builder is skipped, so we call StructBuilder::serialize_none()
  • StructBuilder::serialize_none() calls DictionaryUTF8Builder::serialize_default()
  • DictionaryUTF8Builder::serialize_default() calls IntBuilder::serialize_none()
  • IntBuilder::serialize_none() calls self.array.push_scalar_none() which calls set_validity() with no buffer and value false which fails

So while I understand how we arrive at that section of the code, I don't really understand the underlying behavior of the Dict builder or Int Builder on why this is so. I believe the Dict Builder is a result of the TracingOptions::enums_without_data_as_strings field which we do want but I'm not sure how to approach allowing this variant to be none.

Do you have any thoughts on what I might need to do here?

@chmp
Copy link
Owner

chmp commented Oct 26, 2024

@raj-nimble Oh that's a tricky. The reason is the following:

  • A null struct value needs to have value for its fields, that why serialize_none calls serialize_default for each of its fields
  • The call to serialize_default in the dictionary array serializes a null value, as there is no reasonable default. The default for strings, an empty string, is not guaranteed to exist. The index 0 is not guaranteed to map to a valid string, if there are no values for this dictionary (probably a corner case, but nonethless)

One fix would be to force nested dictionary fields to be nullable with enums_with_named_fields_as_structs, regardless of the data. Something like:

fn fix_dictionaries(field: &mut Field) {
  if matches!(field.data_type, DataType::Dictionary(_, _, _)) {
    field.nullable = true;
  } else if let DataType::Struct(children) = &mut field.data_type {
    for child in children {
       fix_dictionaries(child);
    }
  }
}

The recursion on struct is necessary, as nested structs that contain dictionaries will trigger the same error.

Just a heads up: unions in Arrow are never nullable. For nested enums, that cannot be mapped to dictionaries, you will run into a similar issue. Unfortunately, I have no idea what would be a solution here. I would suggest, to simply catch this case and raise an error in schema tracing.

Maybe it would also be worthwhile to add checks for both issues, non-nullable dictionaries in structs and enums in structs, to the serializer. The schema can also be constructed manually and would result in hard to track down errors.

@raj-nimble raj-nimble force-pushed the feature/221-serialize-enums-as-flattened-structs branch from 4a0ad54 to c5b98b6 Compare October 28, 2024 21:51
@raj-nimble
Copy link
Author

Hi @chmp , sorry for the recent radio silence and lack of progress. I got pulled into more urgent tasks but this has definitely not fallen off my radar. I do plan to return to this hopefully in the next month.

For a summary of where I left off:

  • I think most of it is good to go. The naive happy-path behavior is all working and some extended testing in our main application was all working as expected. There are some limitations but they were all expected and able to be handled.
  • I think there is probably some gaps in unit test coverage still.
  • I had found one specific bug that I was investigating at the time I dropped off. I have a unit test that reproduces the issue but I had not started investigating it yet.
  • My plan when I can come back to this is to fix that bug, confirm with our application, add any other possibly missing unit tests, then open this up for official review.

Sorry for the delay with this.

@chmp
Copy link
Owner

chmp commented Dec 11, 2024

@raj-nimble Thanks for the update. And no worries. I myself am quite busy with other things and can't really devote much time to serde_arrow. Just give me a ping when I should support :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants