-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly handle single type uniontypes in Coral #507
Conversation
@@ -41,6 +41,12 @@ public class RelDataTypeToHiveTypeStringConverter { | |||
private RelDataTypeToHiveTypeStringConverter() { | |||
} | |||
|
|||
public RelDataTypeToHiveTypeStringConverter(boolean convertUnionTypes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add Java doc on what the effect of the parameter is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Convert single uniontypes back to Hive representation so coalesce_struct UDF can handle | ||
// single uniontypes in Spark correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description/code in coral-common should not be custom to Spark or at least reference Spark, when it can be generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed reference.
|
||
if (containsSingleUnionType(operandType)) { | ||
// Pass in schema string to keep track of the original Hive schema containing single uniontypes so coalesce_struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand the comment to explain why it is important for coalesce_struct to know this distinction.
* | ||
* Note that uniontypes holding a single need to be handled specially in Spark as there is a Spark-specific mechanism | ||
* that unwraps a single uniontype (a uniontype holding only one data type) to just the single underlying data type. | ||
* Reference: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-avro---spark-sql-conversion | ||
* | ||
* Check `CoralSparkTest#testUnionExtractUDFOnSingleTypeUnions` for examples. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should state that the behavior is specific to base tables. Giving an example query helps. In fact, you might write a few illustrating different expected behaviors of this transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should state that the behavior is specific to base tables.
I'm not sure this unwrapping behavior specific to base tables. I created a hive base table with a single union column, then created a view that selects all on the base table. Running desc
on both table and view in Spark yield the same schema (single union is unwrapped).
Giving an example query helps
On line 47, the comment points to CoralSparkTest.testUnionExtractUDFOnSingleTypeUnions
which has examples covering simple and complex cases of Hive queries that would induce this transformer, and also the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this unwrapping behavior specific to base tables. I created a hive base table with a single union column, then created a view that selects all on the base table. Running desc on both table and view in Spark yield the same schema (single union is unwrapped).
I meant explaining that the main reason for unwrapping relies in base tables. The current description is a bit vague and may not help with the thought process.
Stating examples in the comments helps vs pointing to the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wmoustafa Updated this piece of javadoc to include end2end reasoning + examples in comments. PTAL
/** | ||
* DataTypeDerivedSqlCallConverter transforms the sqlCalls | ||
* in the input SqlNode representation to be compatible with Trino engine. | ||
* The transformation may involve change in operator, reordering the operands | ||
* or even re-constructing the SqlNode. | ||
* | ||
* All the transformations performed as part of this shuttle require RelDataType derivation. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is confusing (also class name is confusing).
Why would a spark Shuttle do something specific to Trino?
The description sounds too generic and arbitrary without specific contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is a copy of DataTypeDerivedSqlCallConverter
in rel2trino
. I could create a follow up PR to edit the name and javadoc for both DataTypeDerivedSqlCallConverter
classes?
Why would a spark Shuttle do something specific to Trino?
Forgot to change "Trino engine." to "Spark engine." after copying, fixed.
return sparkSqlNode.accept(new SparkSqlRewriter()); | ||
} | ||
|
||
public static String constructSparkSQL(SqlNode sparkSqlNode) { | ||
return sparkSqlNode.toSqlString(SparkSqlDialect.INSTANCE).getSql(); | ||
} | ||
|
||
private static boolean isSelectStar(SqlNode node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this function have to do with single unions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies for not explaining earlier. After adding in type derivation transformations on coral spark RHS, there was a side-effect where the old detection for select star queries no longer worked requiring us to update the detection logic. This is a more robust check anyhow.
.accept(new CoralToSparkSqlCallConverter(sparkUDFInfos)); | ||
|
||
SqlNode coralSqlNodeWithRelDataTypeDerivedConversions = | ||
coralSqlNode.accept(new DataTypeDerivedSqlCallConverter(hmsClient, coralSqlNode, sparkUDFInfos)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not follow the intuition behind this (specifically in the context of single union types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During coral-spark RHS, RelNode -> SqlNode translation layer, there didn't used to be a need for transformers that require rel type derivation. However now, we need type derivation in ExtractUnionFunctionTransformer
(which lives in coral-spark RHS) to detect extract_union
calls on single uniontypes. Only if we do detect that it is a call on single uniontypes, then pass in the schema string when transforming extract_union
call to coalesce_struct
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally this is a needed step for Coral IR upgrades, Introduce API to enable type derivation in the SqlNode transformation layer
that we now validated is doable. cc: @aastha25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a discussion around this. Our objective here is to standardize SqlNode
to SqlNode
conversions to happen strictly through SqlCallTransformers
. We should discuss if this API is sufficient and if not, how to organize/standardize things that happen outside it. Objective is to minimize ad hoc transformations, and this seems to add an ad hoc transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline that we organize CoralRelNode -> LanguageSqlNode as 3 logical steps:
- Apply SqlShuttle for list of
SqlCall
transformers that require rel type derivation - Apply SqlShuttle for list of
SqlCall
transformers that do not require rel type derivation - SqlNode transformations that cannot be done at the SqlCall layer (in coral-spark it is the
CoralSqlNodeToSparkSqlNodeConverter
class)
1 and 2 must be separated into 2 steps in that order as intermixing type derivation transformers with no type derivation transformers causes failures due to no type derivation transformers introducing certain operators that the type derivation util cannot yet handle.
A future PR will be set up to refactor these steps into a well documented class that loops through the 3 SqlShuttles
.
* fix single uniontypes in Coral * remove SingleUnionFieldReferenceTransformer * remove field reference fix to put in separate PR * spotless * update comments * fix comment + add single uniontype test for RelDataTypeToHiveTypeStringConverter * spotless * improve Javadoc for ExtractUnionFunctionTransformer * use html brackets in javadoc
Intro
The goal of the PR is to fix the unaccounted for side effects of single uniontype (a uniontype holding only one data type such as
uniontype<array<string>>
) related changes from #409.Example 1 revealing issues surrounding view text translations on field references for extractions on single union datatypes:
Necesary Context: There is a Spark-specific mechanism that unwraps the uniontype to just the single underlying data type when reading from avro schemas. Reference.
Behavior before this PR:
Hive
Spark
View v1 translated to:
Trino
View v1 translated to:
Issues:
In Trino, single uniontypes like
union_col
are not unwrapped like it is in spark, thereforeextract_union(union_col)
would producestruct<tag_0:string>
so we need the.tag_0
reference to extract out the underlying datatype.We shouldn’t be taking 2 operators (extract_union UDF call + field reference) and then club them into a single identifier as it is not modular enough. We cannot assume
extract_union
is never called on its own and needextract_union
on single unions to work on it's own.Behavior after this PR:
Hive portion remains the same as before
Spark
View v1 translated to:
Trino
View v1 translated to:
Solution to issues:
Remove
SingleUnionFieldReferenceTransformer
entirely (which we no longer need anyways based on (2)), which was in Hive LHS. Now the Trino translation path forextract_union
on single unions are left untouched.Made changes in the
coalesce_struct
UDF to add a second optional schema string parameter, which is generated and passed inExtractUnionFunctionTransformer
while we still we have context of the Hive type and any single uniontypes. Nowcoalesce_struct
UDF knows which fields (nested or not) came from a Hive single uniontype and can coalesce it as such.Example 2 revealing issues surrounding schema type derivations for Trino single uniontype:
Behavior before this PR:
Hive
struct_col
column Rel datatype in v1 RelNodeIssue: Because of this change to surface the underlying field & its schema in Coral's SqlNode RelNode representation when field is single uniontype, the Rel type of
struct_col
is simplystruct<union_field:array<string>>
instead ofstruct<union_field:uniontype<array<string>>>
. This change was intended to fix an issue where the Avro schema forextract_union
calls on single unions (generated from RelNode using coral-schema) could not be analyzed by the Spark engine.However, Coral generates Trino schemas also using RelNode, and Trino now cannot analyze the schema. We have a type coercion problem where Trino cannot coerce a col of type:
row(union_field row(tag tinyint, field0 array(varchar)))
(a struct representing a single uniontype) to the type of the column stored in the view definition (generated from Coral's RelNode with the uniontype unwrapped)row(union_field array(varchar))
.Behavior after this PR:
Hive portion remains the same as before
struct_col
column Rel datatype in v1 RelNodeSolution to issue: Remove unwrapping of single uniontypes of RelNodes in Hive LHS
Note this change doesn't break Spark's type derivation for single uniontypes at LinkedIn. There is no regression on the schema produced on a column that is a single uniontype in Hive, verified by creating a view with a single uniontype column and describing it in spark shell with this changes from this PR.
How was this patch tested?
coalesec_struct
UDF changes to ensure entire UX is as expected