Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Coral-Schema] Add DATE type support in RelDataTypeToAvroType #484

Merged
merged 3 commits into from
Jan 11, 2024

Conversation

ljfgem
Copy link
Collaborator

@ljfgem ljfgem commented Jan 10, 2024

What changes are proposed in this pull request, and why are they necessary?

This patch added the DATE type support in RelDataTypeToAvroType to prevent the exception java.lang.UnsupportedOperationException: DATE is not supported.

How was this patch tested?

  1. Unit test
  2. Tested on the affected production view, which could be translated well with this PR
  3. Tested on Spark shell, the affected production view can be queried well with this PR

Copy link
Contributor

@aastha25 aastha25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!

schema.addProp("logicalType", "timestamp-millis");
return schema;
case DATE:
schema = Schema.create(Schema.Type.INT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's no Date type schema in avro https://github.com/apache/avro/blob/branch-1.10/lang/java/avro/src/main/java/org/apache/avro/Schema.java#L125, how did we choose 'int' here?

What is the default spark derived schema for 'Date' type field?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so spark derived date type is cast-able to coral derived 'int' type?

Copy link
Collaborator Author

@ljfgem ljfgem Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Spark AvroDeserializer would return date type directly for this schema, and spark.table.printSchema also returns date type, Spark won't cast to int type.

Comment on lines +7 to +10
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the hive schema representation of 'Date' data type column? who injects the 'logicalType'? why is this not just

{
          "name" : "Date_Field",
          "type" : [ "null", "date" ]
        },

Copy link
Collaborator Author

@ljfgem ljfgem Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Avro schema representing date type, "type" : [ "null", "date" ] is not valid in Avro.

@aastha25
Copy link
Contributor

have we validated that this approach works with and without viewshift for the impacted dataset?

@ljfgem
Copy link
Collaborator Author

ljfgem commented Jan 10, 2024

have we validated that this approach works with and without viewshift for the impacted dataset?

@aastha25 Tested it on the target viewshift replacement view, which couldn't be translated & queried without this PR, but can be translated & queried with this PR.

@aastha25
Copy link
Contributor

@aastha25 Tested it on the target viewshift replacement view, which couldn't be translated & queried without this PR, but can be translated & queried with this PR.

It'll be good to test the top level view (which invokes this nested viewshift replacement view), to make sure that the schema generated for viewshift view (after comparing spark inferred schema & coral provided schema) is compatible with the top level view.

Comment on lines 118 to 119
// Spark recognizes the data type of {"type": "int", "logicalType": "date"} as a date type:
// https://github.com/apache/spark/blob/master/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L145
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the Spark comment relevant?
Especially, the code below converts a Calcite Date Type to Avro Date logical type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our use case, Avro schema needs to be deserializer by Spark for upcast, but agreed that the Spark-related comment should not be added here since Coral-Schema is not just used for Spark.

@ljfgem ljfgem merged commit 26cb90c into linkedin:master Jan 11, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants