-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Schema Diff Error Message #160
Conversation
core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala
Show resolved
Hide resolved
core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala
Outdated
Show resolved
Hide resolved
else if (expectedSeq.isEmpty) | ||
List(Red(actualSeq.mkString("[", ",", "]")), Green("[]")) | ||
List(Red(prodToString(actualSeq)), Green(emptyProd)) | ||
else { | ||
val withEquals = actualSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest skipping this part if the Type is not Row, if it´s T
the schema will be compared in compile time and if the type T
is equal to the DataSet, it will fail when we execute .collect()
in runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought by this point we only work with in memory data - Seq[T] ?
or you mean in case of showProductDiff[Dataset[...]]
core/src/main/scala/com/github/mrpowers/spark/fast/tests/ProductUtil.scala
Outdated
Show resolved
Hide resolved
val a = actualDS.collect().toSeq | ||
val e = expectedDS.collect().toSeq | ||
if (!a.approximateSameElements(e, equals)) { | ||
val arr = ("Actual Content", "Expected Content") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see the improvement, any Dataset[T] can be transformed into a DataFrame, the only change here is that it will work with the original Class
T` that was provided, but the result will be the same, only with the Class name prefixing each line. If the only thing was to prevent the .asRows method we could have done this and not any other change will be needed.
val a = actualDS.collect().toSeq | |
val e = expectedDS.collect().toSeq | |
if (!a.approximateSameElements(e, equals)) { | |
val arr = ("Actual Content", "Expected Content") | |
val a = actualDS.toDF.collect() //now its a Row | |
val e = expectedDS.toDF.collect() | |
if (!a.approximateSameElements(e, equals)) { | |
val arr = ("Actual Content", "Expected Content") | |
val msg = "Diffs\n" ++ DataframeUtil.showDataframeDiff(arr, a, e, truncate) |
I see this as more unified and we wont have to check if its a Product or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShowProductDiff
is an update/extension to the original showDataframeDiff
so we can have a nicely format table display for any Case Class
and additionally Row. This was done so we have nicely format table for StructField
And since showProductDiff
can now also display differences for any Case Class
I figured we no longer need to convert a Dataset[T]
into a Dataframe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these changes could be simplified by only adding a single parameter in the header because at the end the processing doesn't care if it's a row or a case class, it only requires the list of elements:
private[mrpowers] def showDataframeDiff(
header: (String, String),
actual: Seq[Row],
expected: Seq[Row],
truncate: Int = 20,
minColWidth: Int = 3,
className = Option[String]
): String = {
val (className, lBracket, rBracket) = className match {
case None => ("", "[", "]")
case Some(cn) (cn, "(", ")")
}
Use these 3 elements to make the representation, and when you call this function, pass the class name if it's not a Row.
I don't see any other benefit of getting the element as T
because the comparison is not done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we also use it for Seq[StructField] for displaying schema fields won't we have to make actual and expected Seq[Any]
?
private[mrpowers] def showDataframeDiff(
header: (String, String),
actual: Seq[Any],
expected: Seq[Any],
truncate: Int = 20,
minColWidth: Int = 3,
className = Option[String]
): String = {
val (className, lBracket, rBracket) = className match {
case None => ("", "[", "]")
case Some(cn) (cn, "(", ")")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def assertSmallDatasetContentEquality[T: ClassTag](actualDS: Dataset[T], expectedDS: Dataset[T], truncate: Int, equals: (T, T) => Boolean): Unit = {
val a = actualDS.collect().toSeq
val e = expectedDS.collect().toSeq
if (!a.approximateSameElements(e, equals)) {
val arr = ("Actual Content", "Expected Content")
val runTimeClass = implicitly[ClassTag[T]].runtimeClass
val prefix = if (runTimeClass == classOf[Row]) None else Some(runTimeClass.getSimpleName)
val msg = "Diffs\n" ++ DataframeUtil.showDataframeDiff(arr, a.asRows, e.asRows, truncate, prefix)
throw DatasetContentMismatch(msg)
}
}
If you don't want to do this here, move it inside showDataframeDiff
, you can put the type parameter but you ar not forced to use it. But I see it weird to have the signature of your function as:
private[mrpowers] def showDataframeDiff[T:ClassTag](
header: (String, String),
actual: Seq[Any],
expected: Seq[Any],
truncate: Int = 20,
minColWidth: Int = 3
): String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer putting it in showProductDiff
since it will also need to be on betterSchemaMismatchMessage
. Do you think we should still use Seq[T]
? I think that would make it at least safer for the input Type. ofcourse unless someone decided to use Any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok to change the name, it's called from a dataset method. To keep the [T]
, this is not a public method, it carries the type and the compiler will validate the schema of the actual and expected dataframe. The only problem is if the conversion to DataSet[T] is not correct, e.g. :
assertSmallDataset(spark.table("aTable").as[MyType], expected)
If the table doesn't match, the error will be raised with an exception in the .as[T]
method in runtime.
Twhy reason I say if we detect a non-Row DataSet, to skschema validation schema because it's already done.
I see it simpler if you know the exact type of the data as a Row only, and you only have one way to work with it. If you still think this could be good I would suggest first making more tests, not only with products but also with datasets of only one type like DataSet[String]
DataSet[Int]
DataSet[Timestamp]
DataSet[Array[Int]]. These types are not products and in some cases, I don't know how will it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. I forgot that people can total do Dataset[Array[Int]]
. I did some test, and it support single value type correctly but not Array[Int]
.
Actually, currently our main branch also has issue with Dataset[Seq[Int]]
. let's me try fixing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alfonsorr I fixed issue with Dataset[Array[Int]] and also add test cases for String, Int, Array, Seq cases
What changed:
Schema Diff before:
Schema Diff now:
Dispaly for Dataframe diff
Display for Dataset diff